Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt to disallow /index.php/ path
- 
					
					
					
					
 Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/ 
- 
					
					
					
					
 Hi Cyrus, Thanks for your reply! Unfortunately the problem is yet to be fixed, I hope that my disallow will work shortly. It seems that most of the index.php links to each other internally (and from old /index.php/ pages that no longer exist), which is super weird. How google found them does not make any sense to me. I don't beleive that external sources are linking to these pages either - I mean, how would they find these links anyway?. 
- 
					
					
					
					
 Hi Mikkel, Like Chris, I spidered your site and couldn't find any links to /index.php files, which probably indicates one of two things: - You've fixed the problem - Yay!
- Or Google is finding those links from external sources
- Google found those links at one time in the past, and is still trying to crawl them.
 In the Crawl Errors report in Google Webmaster Tools, if you click on the link of each 404, there's often a "linked from" source where you can see where Google discovered the broken link. This is really helpful in rooting out the cause. Regardless, I'm going to go with #1 and optimistically believe that you were able to fix the problem.  
- 
					
					
					
					
 If I spider your site I'm not seeing any /index.php urls. Does that mean you did get Joomla to cooperate with your rewriting? Or was your problem that you'd previously had urls indexed with /index.php/ paths and you needed to remove them? 
- 
					
					
					
					
 Hi Mikkel, I have checked your robots.txt, it looks perfect. If you redirect /index.php to home page that using httaccess file or by using any joomla plugin that would great for you. And its also a permanent solution.  
- 
					
					
					
					
 Well, I tried the sensible solution and redirecting to the correct URL instead. However the SEF program is quite limited and keep on creating new URLs regardless of my modification. Im looking for a more permanent solution, and the disallow seems at bit simple as I'm not a super programmer. By the way - thanks for quick replys, kudos to both of you! 
- 
					
					
					
					
 Sure, the website in question is www.vauni.dk I don't think that there is any inbound links to the index.php pages. They are not easily found. 
- 
					
					
					
					
 Couldn't you rewrite those /index.php/ urls to remove the /index.php/? Like this in .htaccess: RewriteRule ^(.*)$ /index.php/$1 [L] Only used Joomla once, but there must be a way to configure joomla to just use "/" instead of "/index.php/"? Update: Here's a solution to your /index.php/ issue: http://www.eprcreations.com/remove-index-php-from-joomla-urls/ Once you've updated that, and have your urls working properly without the /index.php/, you could add this slight modification of the rewrite rule above so that all your old /index.php/ urls would be 301'd to your new ones: RewriteRule ^(.*)$ /index.php/$1 [R=301,L] Put it underneath the RewriteBase / line they describe in that post. 
- 
					
					
					
					
 Hi Mikkel, Do you inbound link pointing to you index.php pages ? If yes, then it might affect your seo. Disallow: /index.ph/ is perfect but after implementing it don't inter link those index.php pages. Can you share me your website URL so that I can show you with example. How to do it. 
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		I have two robots.txt pages for www and non-www version. Will that be a problem?
 There are two robots.txt pages. One for www version and another for non-www version though I have moved to the non-www version. Technical SEO | | ramb0
- 
		
		
		
		
		
		Robots.txt on http vs. https
 We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this... Technical SEO | | zeepartner0
- 
		
		
		
		
		
		Google indexing despite robots.txt block
 Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks! Technical SEO | | zeepartner0
- 
		
		
		
		
		
		Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)
 Hi I take it if theres a staging or development area on a subdomain for a site, who's content is hence usually duplicate then this should not be indexable i.e. (no-indexed & nofollowed in metarobots) ? In order to prevent dupe content probs as well as non project related people seeing work in progress or finding accidentally in search engine listings ? Also if theres no such info in meta robots is there any other way it may have been made non-indexable, or at least dupe content prob removed by canonicalising the page to the equivalent page on the live site ? In the case in question i am finding it listed in serps when i search for the staging/dev area url, so i presume this needs urgent attention ? Cheers Dan Technical SEO | | Dan-Lawrence0
- 
		
		
		
		
		
		/~username
 Hello, The utility on this site that crawls your site and highlights what it sees as potential problems reported an issue with /~username access seeing it as duplicate content i.e. mydomain.com/file.htm is the same as mydomain.com~/username/file.htm so I went to my server hosts and they disabled it using mod_userdir but GWT now gives loads of 404 errors. Have I gone about this the wrong way or was it not really a problem in the first place or have I fixed something that wasn't broken and made things worse? Thanks, Ian Technical SEO | | jwdl0
- 
		
		
		
		
		
		Oh no googlebot can not access my robots.txt file
 I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site. Technical SEO | | BistosAmerica0
- 
		
		
		
		
		
		Is blocking RSS Feeds with robots.txt necessary?
 Is it necessary to block an rss feed with robots.txt? It seems they are automatically not indexed (http://googlewebmastercentral.blogspot.com/2007/12/taking-feeds-out-of-our-web-search.html) And, google says here that it's important not to block RSS feeds (http://googlewebmastercentral.blogspot.com/2009/10/using-rssatom-feeds-to-discover-new.html) I'm just checking! Technical SEO | | nicole.healthline0
- 
		
		
		
		
		
		Index forum sites
 Hi Moz Team, somehow the last question i raised a few days ago not only wasnt answered up until now, it was also completely deleted and the credit was not "refunded" - obviously there was some data loss involved with your restructuring. Can you check whether you still find the last question and answer it quickly? I need the answer 🙂 Here is one more question: I bought a website that has a huge forum, loads of pages with user generated content. Overall around 500.000 Threads with 9 Million comments. The complete forum is noindex/nofollow when i bought the site, now i am thinking about what is the best way to unleash the potential. The current system is vBulletin 3.6.10. a) Shall i first do an update of vbulletin to version 4 and use the vSEO tool to make the URLs clean, more user and search engine friendly before i switch to index/follow? b) would you recommend to have the forum in the folder structure or on a subdomain? As far as i know subdomain does take lesser strenght from the TLD, however, it is safer because the subdomain is seen as a separate entity from the regular TLD. Having it in he folder makes it easiert to pass strenght from the TLD to the forum, however, it puts my TLD at risk c) Would you release all forum sites at once or section by section? I think section by section looks rather unnatural not only to search engines but also to users, however, i am afraid of blasting more than a millionpages into the index at once. d) Would you index the first page of a threat or all pages of a threat? I fear duplicate content as the different pages of the threat contain different body content but the same Title and possibly the same h1. Looking forward to hear from you soon! Best Fabian Technical SEO | | fabiank0
 
			
		 
			
		 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				