Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Googlebot does not obey robots.txt disallow
-
Hi Mozzers!
We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter.
We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming.
But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems.
What could be the reason?
Best regards,
Martin
-
Sorry for the late reply. Feel free to send me a PM. (not sure I can help, but more than happy to take a look)
-
We do not currently have any sanitation rules in order to maintain the nocrawl param. But that is a good point. 301:ing will be difficult for us but I will definitely add the nocrawl param to the rel canonical of those internal SERPs.
-
Thank you, Igol. I will definitely look into your first suggestion.
-
Thank you, Cyrus.
This is what it looks like:
User-agent: *
Disallow: /nocrawl=1The weird thing is that when testing one of the sample URLs (given by Google as "problematic" in the GWMT message and that contains the nocrawl param) on the GWMT "Blocked URLs" page by entering the contents of our robots.txt and the sample URL, Google says crawling of the URL is disallowed for Googlebot.
On the top of the same page, it says "Never" under the heading "Fetched when" (translated from Swedish..). But when i "Fetch as Google" our robots.txt, Googlebot has no problems fetching it. So i guess the "Never" information is due to a GWMT bug?
I also tested our robots.txt against your recommended service http://www.frobee.com/robots-txt-check. It says all robots has access to the sample URL above, but I gather the tool is not wildcard-savvy.
I will not disclose our domain in this context, please tell me if it is ok to send you a PW.
About the noindex stuff. Basically, the nocrawl param is added to internal links pointing to internal search result pages filtered by more than two params. Although we allow crawling of less complicated internal serps, we disallow indexing of most of them by "meta noindex".
-
Thanks.
100% agree with the Meta Noindex suggestion.
-
It can be tricky blocking parameters with robots.txt. The first thing you want to do is make sure your are actually blocking the URLs. There are a few good robots.txt checkers out there that can help:
You're file is probably going to look something like:
User-agent: *
Disallow: /*?nocrawl=1... but this could vary depending on exactly you don't want crawled
+1 to Igal's suggestion of handling these via parameter settings in Google Webmaster Tools: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
Finally, if your goal is to keep search results out of the index (it probably should be) then you should also highly consider using a meta robots NOINDEX tag on all search results pages. You can also slap a nofollow on links pointing to search results as this might also help Google steer clear of those pages.
Best of luck!
Edit: Here's what John Wu of Google Webmaster has to say...
"We show this warning when we find a high number of URLs on a site -- even before we attempt to crawl them. If you are blocking them with a robots.txt file, that's generally fine. If you really do have a high number of URLs on your site, you can generally ignore this message. If your site is otherwise small and we find a high number of URLs, then this kind of message can help you to fix any issues (or disallow access) before we start to access your server to check gazillions of URLs :-)."
-
Didn't say it wasn't.
I`m just not sure how these rules apply to parameters, since they are not a part of the "core" URL.
(For example: What happens if I take a URL from your site, change a nocrawl=1 to nocrawl=0 and link to it from mine?
Do you have any URL sanitation rules in place to overcome that or will the page be indexed by Googlebot when it crawls my site and moves on to yours?)Personally, when dealing with parameters, I find it easier to work with WMT so I was offering an easier workaround, (at least for me)
To tell you the truth, I would use hard-coded on page meta noindex/nofollow here (again, as parameters can be so easily manipulated).
-
Igal, thank your for replying.
But robots.txt disallowing URLs by matching patterns has been supported by Googlebot for a long time now.
-
Hi
I`m not sure if this is the best way to go about it.
Robots.txt is commonly used for folder level disallow rules, I`m not sure how it will respond to parameters.
Having said that, there are several things you can do here:
1. You can use WMT to zero in on this parameter and prevent it from being searched.
To do so choose Configuration>>URL Parameters, answer "Yes" to the question about content change and
check-in the 3rd bullet (Only URL with value...) Of course you'll need to choose "1" as the right value.2. If this still didn't solve your issue, you might want to try using htacess + regex to prevent access by user agent.
You can find user-agent information here Googlebot user agent listAlso, you may want to check my blog post about some of the less known Googlebot Facts (shameless self-promotion)
Best
Igal
-
I'll send you a PW, Des.
-
What the domain.?
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl solutions for landing pages that don't contain a robots.txt file?
My site (www.nomader.com) is currently built on Instapage, which does not offer the ability to add a robots.txt file. I plan to migrate to a Shopify site in the coming months, but for now the Instapage site is my primary website. In the interim, would you suggest that I manually request a Google crawl through the search console tool? If so, how often? Any other suggestions for countering this Meta Noindex issue?
Technical SEO | | Nomader1 -
Googlebot and other spiders are searching for odd links in our website trying to understand why, and what to do about it.
I recently began work on an existing Wordpress website that was revamped about 3 months ago. https://thedoctorwithin.com. I'm a bit new to Wordpress, so I thought I should reach out to some of the experts in the community.Checking ‘Not found’ Crawl Errors in Google Search Console, I notice many irrelevant links that are not present in the website, nor the database, as near as I can tell. When checking the source of these irrelevant links, I notice they’re all generated from various pages in the site, as well as non-existing pages, allegedly in the site, even though these pages have never existed. For instance: https://thedoctorwithin.com/category/seminars/newsletters/page/7/newsletters/page/3/feedback-and-testimonials/ allegedly linked from: https://thedoctorwithin.com/category/seminars/newsletters/page/7/newsletters/page/3/ (doesn’t exist) In other cases, these goofy URLs are even linked from the sitemap. BTW - all the URLs in the sitemap are valid URLs. Currently, the site has a flat structure. Nearly all the content is merely URL/content/ without further breakdown (or subdirectories). Previous site versions had a more varied page organization, but what I'm seeing doesn't seem to reflect the current page organization, nor the previous page organization. Had a similar issue, due to use of Divi's search feature. Ended up with some pretty deep non-existent links branching off of /search/, such as: https://thedoctorwithin.com/search/newsletters/page/2/feedback-and-testimonials/feedback-and-testimonials/online-continuing-education/consultations/ allegedly linked from: https://thedoctorwithin.com/search/newsletters/page/2/feedback-and-testimonials/feedback-and-testimonials/online-continuing-education/ (doesn't exist). I blocked the /search/ branches via robots.txt. No real loss, since neither /search/ nor any of its subdirectories are valid. There are numerous pre-existing categories and tags on the site. The categories and tags aren't used as pages. I suspect Google, (and other engines,) might be creating arbitrary paths from these. Looking through the site’s 404 errors, I’m seeing the same behavior from Bing, Moz and other spiders, as well. I suppose I could use Search Console to remove URL/category/ and URL/tag/. I suppose I could do the same, in regards to other legitimate spiders / search engines. Perhaps it would be better to use Mod Rewrite to lead spiders to pages that actually do exist. Looking forward to suggestions about best way to deal with these errant searches. Also curious to learn about why these are occurring. Thank you.
Technical SEO | | linkjuiced0 -
Guys & Gals anyone know if urllist.txt is still used?
I'm using a tool which generates urllist.txt and looking on the SEO Forums it seems that Yahoo used to use this. What I'd like to know is is it still used anywhere and should we have it on the site?
Technical SEO | | danwebman0 -
Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)
Hi I take it if theres a staging or development area on a subdomain for a site, who's content is hence usually duplicate then this should not be indexable i.e. (no-indexed & nofollowed in metarobots) ? In order to prevent dupe content probs as well as non project related people seeing work in progress or finding accidentally in search engine listings ? Also if theres no such info in meta robots is there any other way it may have been made non-indexable, or at least dupe content prob removed by canonicalising the page to the equivalent page on the live site ? In the case in question i am finding it listed in serps when i search for the staging/dev area url, so i presume this needs urgent attention ? Cheers Dan
Technical SEO | | Dan-Lawrence0 -
Google insists robots.txt is blocking... but it isn't.
I recently launched a new website. During development, I'd enabled the option in WordPress to prevent search engines from indexing the site. When the site went public (over 24 hours ago), I cleared that option. At that point, I added a specific robots.txt file that only disallowed a couple directories of files. You can view the robots.txt at http://photogeardeals.com/robots.txt Google (via Webmaster tools) is insisting that my robots.txt file contains a "Disallow: /" on line 2 and that it's preventing Google from indexing the site and preventing me from submitting a sitemap. These errors are showing both in the sitemap section of Webmaster tools as well as the Blocked URLs section. Bing's webmaster tools are able to read the site and sitemap just fine. Any idea why Google insists I'm disallowing everything even after telling it to re-fetch?
Technical SEO | | ahockley0 -
Oh no googlebot can not access my robots.txt file
I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site.
Technical SEO | | BistosAmerica0 -
Temporarily suspend Googlebot without blocking users
We'll soon be launching a redesign, on a new platform, migrating millions of pages to new URLs. How can I tell Google (and other crawlers) to temporarily (a day or two) ignore my site? We're hoping to buy ourselves a small bit of time to verify redirects and live functionality before allowing Google to crawl and index the new architecture. GWT's recommendation is to 503 all pages - including robots.txt, but that also makes the site invisible to real site visitors, resulting in significant business loss. Bad answer. I've heard some recommendations to disallow all user agents in robots.txt. Any answer that puts the millions of pages we already have indexed at risk is also a bad answer. Thanks
Technical SEO | | lzhao0