Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Meta NoIndex tag and Robots Disallow
-
Hi all,
I hope you can spend some time to answer my first of a few questions
We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS!
Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination).
After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
"There is no information about this page because it is blocked by robots.txt"So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt.
So coming to my question.
-
Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index?
-
Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index?
I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”.I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index.
Thanks!
B
-
-
There's no real way to estimate how long the re-crawl will take, Ben. You can get a bit of an idea by looking at the crawl rate reported in Google Webmaster Tools.
Yes, asking for a page fetch then submitting with linked pages for each of the main website sections can help speed up the crawl discovery. In addition, make sure you've submitted a current sitemap and it's getting found correctly (also reported in GWT) You should also do the same in Bing Webmaster Tools. Too many sites forget about optimizing for Bing - even if it's only 20% of Google's traffic, there's no point throwing it away.
Lastly, earning some new links to different sections of the site is another great signal. This can often be effectively & quickly done using social media - especially Google+ as it gets crawled very quickly.
As far as your other question - yes, once you get the unwanted URLs out of the index, you can add the robots.txt disallow back in to optimise your crawl budget. I would strongly recommend you leave the meta-robots no-index tag in place though as a "belt & suspenders" approach to keep pages linking into those unwanted pages from triggering a re-indexing. It's OK to have both in place as long as the de-indexing has already been accomplished, as we've discussed.
Hope that answer your questions?
Paul
-
So once Google has started to see the meta-noindex and is slowly deindexing pages, once that is done, I would like to block it from crawling them with a robots.txt to conserve my crawl budget.
But, there are still internal links on the site that point to these URL´s - would they get back into the index in this case?
-
Hi Paul,
Thank you for your detailed answer - so I'm not going crazy
I did try with canonicals but then realized they are more of a suggestion as opposed to a directive and I am still correcting a lot of dupe content and 404's so I am imagining that Google view's the site as "these guys don't know what they are doing' so may have ignored the canonical suggestion.
So what I have done is remove the robots block on the pages I want de-indexed and add in meta noindex, follow on these pages - From what you are saying, they should naturally de-index, after which, I will put the robots.txt block back on to keep my crawl budget spent on better areas of the site.
How long in your opinion can it take for Googlebot to de-index the pages? Can I help it along at all to speed up? Fetch page and linking pages as Googlebot?
Thanks again,
Ben
-
You're right to be confused, B. The terminology is unfortunate and misleading.
To answer your questions
1. Yes
2. Yes.
A disallow in robots.txt does nothing to remove already-indexed pages. That's not its purpose. Its only purpose is to tell the search crawlers not to waste their time crawling those pages. Even if pages have been blocked in robots, they will remain in the index if already there. Even if never crawled, and blocked in robots.txt, they can still end up indexed if some other indexed page links to them and the crawlers find those pages by following links. Again, nothing in a robots.txt disallow tells the engines to remove a page from the index, just not to waste time crawling it.
Put another way, the robots.txt disallow directive only disallows crawling - it says nothing about what to do if the page gets into the index in other ways.
The meta-robots no-index tag however explicitly states to the crawler "if you arrive at this page, do not add it to the index. If it is already in the index, remove it".
And yea - as you suspected - if pages are blocked in robots.txt, the crawler obeys and doesn't visit those pages So it can't discover the no-index command to drop them from the index. Thus the only way a page could get dropped is if a crawler followed a link from an external site and discovered the page that way. A very inefficient way of trying to get all those pages out of the index.
Bottom line - robots.txt is never the correct tool to deal with duplicate content issues. It's sole purpose is to keep the crawlers from wasting time on unimportant pages so they can spend more time finding (and therefore indexing) more important pages.
The three tools for dealing with duplicate content are meta-robots no-index tags in a page header, 301 redirects, and canonical tags. Which one to use depends on the architecture of your site, your intended purpose, and the site's technical limitations.
Hope that makes sense?
Paul
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt & Disallow: /*? Question!
Hi, I have a site where they have: Disallow: /*? Problem is we need the following indexed: ?utm_source=google_shopping What would the best solution be? I have read: User-agent: *
Intermediate & Advanced SEO | | vetofunk
Allow:Â ?utm_source=google_shopping
Disallow: /*? Any ideas?0 -
Is it best practice to have a canonical tags on all pages
The website I'm working on has no canonical tags. There is duplicate content so rel=canonicals need adding to certain pages but is it best practice to have a tag on every page ?
Intermediate & Advanced SEO | | ColesNathan0 -
Conditional Noindex for Dynamic Listing Pages?
Hi, We have dynamic listing pages that are sometimes populated and sometimes not populated. They are clinical trial results pages for disease types, some of which don't always have trials open. This means that sometimes the CMS produces a blank page -- pages that are then flagged as thin content. We're considering implementing a conditional noindex -- where the page is indexed only if there are results. However, I'm concerned that this will be confusing to Google and send a negative ranking signal. Any advice would be super helpful. Thanks!
Intermediate & Advanced SEO | | yaelslater0 -
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
Wildcarding Robots.txt for Particular Word in URL
Hey All, So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it? We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all. In this case these are the lines I've added to the robots.txt Disallow: /*&viagra Disallow: /*&Viagra I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious. Thanks! Jon
Intermediate & Advanced SEO | | EvansHunt0 -
Does a UTM tag influence the linkvalue?
Will Google value a link with a UTM tag the same as a clean link without a UTM tag? I should say that a UTM tag link is not a natural link so the linkvalue is zero. Anyone any idea how to look at this?
Intermediate & Advanced SEO | | TT_Vakantiehuizen0 -
Wildcard Redirects & Canonical Tags
I have an interesting situation. Current URLs Example1: www.domain.com/red-widgets-cid-1234.html
Intermediate & Advanced SEO | | NakulGoyal
www.domain.com/red-widgets-cid-1234-1.html
www.domain.com/red-widgets-cid-1234-1-1.html Canonical on All Above URLs:
www.domain.com/red-widgets-cid-1234.html New URL:
www.domain.com/red-widgets-cid-4567.html Current URLs Example2: www.domain.com/red-widgets-cid-1234+10.html
www.domain.com/red-widgets-cid-1234+10-1.html
www.domain.com/red-widgets-cid-1234+10-1-1.html Canonical on All Above URLs:
www.domain.com/red-widgets-cid-1234+10.html New URL:
www.domain.com/red-widgets-cid-6789.html Current URLs Example3: www.domain.com/red-widgets-cid-1234+10+5.html
www.domain.com/red-widgets-cid-1234+10+5-1.html
www.domain.com/red-widgets-cid-1234+10+5-1-1.html Canonical on All Above URLs:
www.domain.com/red-widgets-cid-1234+10+5.html New URL:
www.domain.com/american-red-widgets-cid-6789+5.html I want to make sure all variations of the above URL redirect to the new URLs. However, as you see in Example 3, we are dealing with variables that are passed on. (+5 in this case). Question 1: What wildcard 301 redirect / regular expression can I use to tackle these ? Question 2: If we redirect www.domain.com/red-widgets-cid-1234+10+5.html to www.domain.com/red-widgets-cid-6789+5.html and www.domain.com/red-widgets-cid-6789+5.html contains the canonical tag www.domain.com/american-red-widgets-cid-6789+5.html, any concerns or red flags here ?0 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. Â While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. Â For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Â Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Â Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1