Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Disallow wildcard match in Robots.txt
-
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks
Disallow: /?crawler=1
Disallow: /?mobile=1Thank you
-
This is a good reply.
Everyone gets really confused because Robots.txt has very minor, partial wildcard support and that makes people think that Robots.txt files use Regex, which they do not. Instead of having some weird half and half implementation, it would be much better IMO if the Robots.txt initiative / directive were updated to say "yes, you can use full regular expressions with regards to URL string matching".
Many people are left in a kind of silly guessing game because Google doesn't 'properly' elaborate or invest in expanding the definitions to their currently (publicly) assumed end-game.
People assume that if "*" will match any string of characters, "?" will match any individual character when used in a robots.txt file. This would make sense, but it's not the case. AFAIK there are only one or two supported wildcard characters in Robots.txt and that's why people get confused, looking for escape characters and the suchlike.
-
Hi Amanda,
Those lines tell GoogleBot not to crawl urls that have that text fragments.
For example, wont crawl: domain.com/category/product**?mobile=1**BUT, that doesnt mean that will not crawl every URL with question marks. For that, the line should be like this:
Disallow: /*?I do highly recommend you to read this guides:
About /robots.txt - Official site - Robotstxt.org
Robots.txt - Moz
Robots.txt: the ultimate guide - YOAST
The Complete Guide to Robots.txt - PORTENTHope it helps.
Best luck.
GR
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Good to use disallow or noindex for these?
Hello everyone, I am reaching out to seek your expert advice on a few technical SEO aspects related to my website. I highly value your expertise in this field and would greatly appreciate your insights.
Technical SEO | | williamhuynh
Below are the specific areas I would like to discuss: a. Double and Triple filter pages: I have identified certain URLs on my website that have a canonical tag pointing to the main /quick-ship page. These URLs are as follows: https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black
https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black+fabric Considering the need to optimize my crawl budget, I would like to seek your advice on whether it would be advisable to disallow or noindex these pages. My understanding is that by disallowing or noindexing these URLs, search engines can avoid wasting resources on crawling and indexing duplicate or filtered content. I would greatly appreciate your guidance on this matter. b. Page URLs with parameters: I have noticed that some of my page URLs include parameters such as ?variant and ?limit. Although these URLs already have canonical tags in place, I would like to understand whether it is still recommended to disallow or noindex them to further conserve crawl budget. My understanding is that by doing so, search engines can prevent the unnecessary expenditure of resources on indexing redundant variations of the same content. I would be grateful for your expert opinion on this matter. Additionally, I would be delighted if you could provide any suggestions regarding internal linking strategies tailored to my website's structure and content. Any insights or recommendations you can offer would be highly valuable to me. Thank you in advance for your time and expertise in addressing these concerns. I genuinely appreciate your assistance. If you require any further information or clarification, please let me know. I look forward to hearing from you. Cheers!0 -
Robots.txt in subfolders and hreflang issues
A client recently rolled out their UK business to the US. They decided to deploy with 2 WordPress installations: UK site - https://www.clientname.com/uk/ - robots.txt location: UK site - https://www.clientname.com/uk/robots.txt
Technical SEO | | lauralou82
US site - https://www.clientname.com/us/ - robots.txt location: UK site - https://www.clientname.com/us/robots.txt We've had various issues with /us/ pages being indexed in Google UK, and /uk/ pages being indexed in Google US. They have the following hreflang tags across all pages: We changed the x-default page to .com 2 weeks ago (we've tried both /uk/ and /us/ previously). Search Console says there are no hreflang tags at all. Additionally, we have a robots.txt file on each site which has a link to the corresponding sitemap files, but when viewing the robots.txt tester on Search Console, each property shows the robots.txt file for https://www.clientname.com only, even though when you actually navigate to this URL (https://www.clientname.com/robots.txt) you’ll get redirected to either https://www.clientname.com/uk/robots.txt or https://www.clientname.com/us/robots.txt depending on your location. Any suggestions how we can remove UK listings from Google US and vice versa?0 -
Log in, sign up, user registration and robots
Hi all, We have an accommodation site that asks users only to register when they want to book a room, in the last step. Though this is the ideal situation when you have tons of users, nowadays we are having around 1500 - 2000 per day and making tests we found out that if we ask for a registration (simple, 1 click FB) we mail them all and through a good customer service we are increasing our sales. That is why, we would like to ask users to register right after the home page ie Home/accommodation or and all the rest. I am not sure how can I make to make that content still visible to robots.
Technical SEO | | Eurasmus.com
Will the authentication process block google crawling it? Maybe something we can do? We are not completely sure how to proceed so any tip would be appreciated. Thank you all for answering.3 -
Should I block Map pages with robots.txt?
Hello, I have a website that was started in 1999. On the website I have map pages for each of the offices listed on my site, for which there are about 120. Each of the 120 maps is in a whole separate html page. There is no content in the page other than the map. I know all of the offices love having the map pages so I don't want to remove the pages. So, my question is would these pages with no real content be hurting the rankings of the other pages on our site? Therefore, should I block the pages with my robots.txt? Would I also have to remove these pages (in webmaster tools?) from Google for blocking by robots.txt to really work? I appreciate your feedback, thanks!
Technical SEO | | imaginex0 -
Guys & Gals anyone know if urllist.txt is still used?
I'm using a tool which generates urllist.txt and looking on the SEO Forums it seems that Yahoo used to use this. What I'd like to know is is it still used anywhere and should we have it on the site?
Technical SEO | | danwebman0 -
Google insists robots.txt is blocking... but it isn't.
I recently launched a new website. During development, I'd enabled the option in WordPress to prevent search engines from indexing the site. When the site went public (over 24 hours ago), I cleared that option. At that point, I added a specific robots.txt file that only disallowed a couple directories of files. You can view the robots.txt at http://photogeardeals.com/robots.txt Google (via Webmaster tools) is insisting that my robots.txt file contains a "Disallow: /" on line 2 and that it's preventing Google from indexing the site and preventing me from submitting a sitemap. These errors are showing both in the sitemap section of Webmaster tools as well as the Blocked URLs section. Bing's webmaster tools are able to read the site and sitemap just fine. Any idea why Google insists I'm disallowing everything even after telling it to re-fetch?
Technical SEO | | ahockley0 -
No indexing url including query string with Robots txt
Dear all, how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt? Thanks!
Technical SEO | | HMK-NL0 -
Removing robots.txt on WordPress site problem
Hi..am a little confused since I ticked the box in WordPress to allow search engines to now crawl my site (previously asked for them not to) but Google webmaster tools is telling me I still have robots.txt blocking them so am unable to submit the sitemap. Checked source code and the robots instruction has gone so a little lost. Any ideas please?
Technical SEO | | Wallander0