Robots.txt to disallow /index.php/ path

Mikkehl

Hi SEOmoz,

I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?.

I don't use that extension, but would it cause me any problems from an SEO perspective?

How do I disallow all index.php's? Is it a simple: Disallow: /index.php/

Mikkehl

Hi Cyrus,

Thanks for your reply!

Unfortunately the problem is yet to be fixed, I hope that my disallow will work shortly.

It seems that most of the index.php links to each other internally (and from old /index.php/ pages that no longer exist), which is super weird. How google found them does not make any sense to me.

I don't beleive that external sources are linking to these pages either - I mean, how would they find these links anyway?.

Cyrus-Shepard

Hi Mikkel,

Like Chris, I spidered your site and couldn't find any links to /index.php files, which probably indicates one of two things:

You've fixed the problem - Yay!
Or Google is finding those links from external sources
Google found those links at one time in the past, and is still trying to crawl them.

In the Crawl Errors report in Google Webmaster Tools, if you click on the link of each 404, there's often a "linked from" source where you can see where Google discovered the broken link. This is really helpful in rooting out the cause.

Regardless, I'm going to go with #1 and optimistically believe that you were able to fix the problem.

cogbox

If I spider your site I'm not seeing any /index.php urls. Does that mean you did get Joomla to cooperate with your rewriting?

Or was your problem that you'd previously had urls indexed with /index.php/ paths and you needed to remove them?

SanketPatel

Hi Mikkel, I have checked your robots.txt, it looks perfect. If you redirect /index.php to home page that using httaccess file or by using any joomla plugin that would great for you. And its also a permanent solution.

Mikkehl

Well, I tried the sensible solution and redirecting to the correct URL instead. However the SEF program is quite limited and keep on creating new URLs regardless of my modification. Im looking for a more permanent solution, and the disallow seems at bit simple as I'm not a super programmer.

By the way - thanks for quick replys, kudos to both of you!

Mikkehl

Sure, the website in question is www.vauni.dk

I don't think that there is any inbound links to the index.php pages. They are not easily found.

cogbox

Couldn't you rewrite those /index.php/ urls to remove the /index.php/?

Like this in .htaccess:

RewriteRule ^(.*)$ /index.php/$1 [L]

Only used Joomla once, but there must be a way to configure joomla to just use "/" instead of "/index.php/"?

Update:

Here's a solution to your /index.php/ issue:

http://www.eprcreations.com/remove-index-php-from-joomla-urls/

Once you've updated that, and have your urls working properly without the /index.php/, you could add this slight modification of the rewrite rule above so that all your old /index.php/ urls would be 301'd to your new ones:

RewriteRule ^(.*)$ /index.php/$1 [R=301,L]

Put it underneath the RewriteBase / line they describe in that post.

SanketPatel

Hi Mikkel,

Do you inbound link pointing to you index.php pages ? If yes, then it might affect your seo. Disallow: /index.ph/ is perfect but after implementing it don't inter link those index.php pages. Can you share me your website URL so that I can show you with example. How to do it.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Robots.txt to disallow /index.php/ path

Browse Questions

Explore more categories

Related Questions

URLs dropping from index (Crawled, currently not indexed)

Should I block Map pages with robots.txt?

Will blocking the Wayback Machine (archive.org) have any impact on Google crawl and indexing/SEO?

Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)

Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?

Removing robots.txt on WordPress site problem

Subdomain Removal in Robots.txt with Conditional Logic??

Is blocking RSS Feeds with robots.txt necessary?

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved