Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Allow or Disallow First in Robots.txt
-
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command?
example:
Allow: /models/ford///page*
Disallow: /models////page
-
Just caught this a bit late and probably to late to add something but my two pence is test it in Webmaster Tools, via Crawl -> Robot.txt tester - if you've not used this before simply add the url you want to test and Google highlights the directive that allows or disallows it.
-
Thank you Cyrus, yes, I have tried your suggested robots.txt checker and despite it validates the file, it shows me a couple of warnings about the "unusual" use of wildcard. It is my understanding that I would probably need to discuss all this with Google folks directly.
Thank you for you answer... and, yes Keri, I know this is a old thread, but still useful today!
Thanks

-
Can't say with 100% confidence, but sounds like it might work. You could always upload it to a server and use a robots.txt checker to validate, although sometimes the validator tools may incorporate slight differences in edge cases like this that make them moot.
-
Just a quick note, this question is actually from spring of 2012.
-
What about something like:
allow: /directory/$
disallow: /directory/*
Where I want this to be indexed:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
Ideas?
-
I really appreciate all that effort you put in to ensure your method was correct. many thanks.
-
Interesting question - I've had this discussion a couple of times with different SEOs. Here's my best understanding: There are actually 2 different answers - one if you are talking about Google, and one for every other search engine.
For most search engines, the "Allow" should come first. This is because the first matching pattern always wins, for the reasons Geoff stated.
But Google is different. They state:
"At a group-member level, in particular for
allowanddisallowdirectives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined."Robots.txt Specifications - Webmasters — Google Developers
So for Google, order is not important, only the specificity of the rule based on the length of the entry. But the order of precedence for rules with wildcards is undefined.
This last part is important, because your directives contain wildcards. If I'm reading this right, your particular directives:
Allow: /models/ford///page*
Disallow: /models////pageSo if it's "undefined" which directive will Google follow, if order isn't important? Fortunately, there's a simple way to find out.Google Webmaster allows you to test any robots.txt file. I created a dummy file based on your rules, In this case, your directives worked perfectly no matter what order I put them in.
| http://cyrusshepard.com/models/ford/test/test/pages | Allowed by line 2: Allow: /models/ford///page* | Allowed by line 2: Allow: /models/ford///page* |
| http://cyrusshepard.com/models/chevy/test/test/pages | Blocked by line 3: Disallow: /models////page | Blocked by line 3: Disallow: /models////page |So, to summarize:1. Always put Allow directives first, as most search engines follow the "first rule counts" rule.2. Google doesn't care about order, but rather the specificity based on the length of the entry.3. The order of precedence for rules with wildcards is undefined.4. When in doubt, check your robots.txt file in Google Webmaster tools.Hope this helps.(sorry for the very long answer which basically says you were right all along

-
I understand your concern. I am basing my answer based on the fact that if you don't have a robots.txt at all, Google will still crawl you, which means its an allow by default. So all that matters in my opinion is the disallow, but because you need an allow from the wildcard disallow, you could allow that and disallow next.
Honestly, I don't think it matters. If you think the way a bot would work, it's not like robots.txt 1 line is read, then the bot goes crawling and then comes back reads the next line and so on. Does that make sense ? It reads all the lines in the robots.txt and then follows the directives. But to be sure, you can do either of the scenarios and see for yourself. I am sure the results would be same either way.
-
The allow directives need to come before the disallow directives for the same directory/file paths. (I have never personally tested this although it makes logical sense to instruct a robot to access one particular path within a directory structure before it sees that it is blocked from crawling that directory).
For example:-
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/photos
Allow: /s2/static
Disallow: /s2
As per how Google have formatted their robots.txt.
-
Thanks. I want to make sure I get this right in a syntax universally understood by all engines. I have seen webmasters all over the place on this one with some saying that crawlers use a first matching rule and others that say that crawlers use a last matching rule. I am almost thinking to have the allow command twice - before and after, to cover all bases.
-
I don't think it matters, but I think I would disallow first, because by default everything is an Allow.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Good to use disallow or noindex for these?
Hello everyone, I am reaching out to seek your expert advice on a few technical SEO aspects related to my website. I highly value your expertise in this field and would greatly appreciate your insights.
Technical SEO | | williamhuynh
Below are the specific areas I would like to discuss: a. Double and Triple filter pages: I have identified certain URLs on my website that have a canonical tag pointing to the main /quick-ship page. These URLs are as follows: https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black
https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black+fabric Considering the need to optimize my crawl budget, I would like to seek your advice on whether it would be advisable to disallow or noindex these pages. My understanding is that by disallowing or noindexing these URLs, search engines can avoid wasting resources on crawling and indexing duplicate or filtered content. I would greatly appreciate your guidance on this matter. b. Page URLs with parameters: I have noticed that some of my page URLs include parameters such as ?variant and ?limit. Although these URLs already have canonical tags in place, I would like to understand whether it is still recommended to disallow or noindex them to further conserve crawl budget. My understanding is that by doing so, search engines can prevent the unnecessary expenditure of resources on indexing redundant variations of the same content. I would be grateful for your expert opinion on this matter. Additionally, I would be delighted if you could provide any suggestions regarding internal linking strategies tailored to my website's structure and content. Any insights or recommendations you can offer would be highly valuable to me. Thank you in advance for your time and expertise in addressing these concerns. I genuinely appreciate your assistance. If you require any further information or clarification, please let me know. I look forward to hearing from you. Cheers!0 -
How to allow bots to crawl all but WP-content
Hello, I would like my website to remain crawlable to bots, but to block my wp content and media. Does the following robots.txt work? I worry that the * user agent may conflict with the others. User-agent: *
Technical SEO | | Tom3_15
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/ User-agent: GoogleBot
Allow: / User-agent: GoogleBot-Mobile
Allow: / User-agent: GoogleBot-Image
Allow: / User-agent: Bingbot
Allow: / User-agent: Slurp
Allow: /0 -
Disallow wildcard match in Robots.txt
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks Disallow: /?crawler=1
Technical SEO | | AmandaBridge
Disallow: /?mobile=1 Thank you0 -
One robots.txt file for multiple sites?
I have 2 sites hosted with Blue Host and was told to put the robots.txt in the root folder and just use the one robots.txt for both sites. Is this right? It seems wrong. I want to block certain things on one site. Thanks for the help, Rena
Technical SEO | | renalynd270 -
Should I block Map pages with robots.txt?
Hello, I have a website that was started in 1999. On the website I have map pages for each of the offices listed on my site, for which there are about 120. Each of the 120 maps is in a whole separate html page. There is no content in the page other than the map. I know all of the offices love having the map pages so I don't want to remove the pages. So, my question is would these pages with no real content be hurting the rankings of the other pages on our site? Therefore, should I block the pages with my robots.txt? Would I also have to remove these pages (in webmaster tools?) from Google for blocking by robots.txt to really work? I appreciate your feedback, thanks!
Technical SEO | | imaginex0 -
Two META Robots tags on a page - which will win?
Hi, Does anybody know which meta-robots tag will "win" if there is more than one on a page? The situation:
Technical SEO | | jmueller
our CMS is not very flexible and so we have segments of META-Tags on the page that originate from templates.
Now any author can add any meta-tag from within his article-editor.
The logic delivering the pages does not care if there might be more than one meta-robots tag present (one from template, one from within the article). Now we could end up with something like this: Which one will be regarded by google & co?
First?
Last?
None? Thanks a lot,
Jan0 -
Should I block robots from URLs containing query strings?
I'm about to block off all URLs that have a query string using robots.txt. They're mostly URLs with coremetrics tags and other referrer info. I figured that search engines don't need to see these as they're always better off with the original URL. Might there be any downside to this that I need to consider? Appreciate your help / experiences on this one. Thanks Jenni
Technical SEO | | ShearingsGroup0 -
Subdomain Removal in Robots.txt with Conditional Logic??
I would like to see if there is a way to add conditional logic to the robots.txt file so that when we push from DEV to PRODUCTION and the robots.txt file is pushed, we don't have to remember to NOT push the robots.txt file OR edit it when it goes live. My specific situation is this: I have www.website.com, dev.website.com and new.website.com and somehow google has indexed the DEV.website.com and NEW.website.com and I'd like these to be removed from google's index as they are causing duplicate content. Should I: a) add 2 new GWT entries for DEV.website.com and NEW.website.com and VERIFY ownership - if I do this, then when the files are pushed to LIVE won't the files contain the VERIFY META CODE for the DEV version even though it's now LIVE? (hope that makes sense) b) write a robots.txt file that specifies "DISALLOW: DEV.website.com/" is that possible? I have only seen examples of DISALLOW with a "/" in the beginning... Hope this makes sense, can really use the help! I'm on a Windows Server 2008 box running ColdFusion websites.
Technical SEO | | ErnieB0