Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Dynamic Canonical Tag for Search Results Filtering Page
Hi everyone, I run a website in the travel industry where most users land on a location page (e.g. domain.com/product/location, before performing a search by selecting dates and times. This then takes them to a pre filtered dynamic search results page with options for their selected location on a separate URL (e.g. /book/results). The /book/results page can only be accessed on our website by performing a search, and URL's with search parameters from this page have never been indexed in the past. We work with some large partners who use our booking engine who have recently started linking to these pre filtered search results pages. This is not being done on a large scale and at present we only have a couple of hundred of these search results pages indexed. I could easily add a noindex or self-referencing canonical tag to the /book/results page to remove them, however it’s been suggested that adding a dynamic canonical tag to our pre filtered results pages pointing to the location page (based on the location information in the query string) could be beneficial for the SEO of our location pages. This makes sense as the partner websites that link to our /book/results page are very high authority and any way that this could be passed to our location pages (which are our most important in terms of rankings) sounds good, however I have a couple of concerns. • Is using a dynamic canonical tag in this way considered spammy / manipulative? • Whilst all the content that appears on the pre filtered /book/results page is present on the static location page where the search initiates and which the canonical tag would point to, it is presented differently and there is a lot more content on the static location page that isn’t present on the /book/results page. Is this likely to see the canonical tag being ignored / link equity not being passed as hoped, and are there greater risks to this that I should be worried about? I can’t find many examples of other sites where this has been implemented but the closest would probably be booking.com. https://www.booking.com/searchresults.it.html?label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYARS4ARfIAQzYAQHoAQH4AQuIAgGoAgO4ArajrpcGwAIB0gIkYmUxYjNlZWMtYWQzMi00NWJmLTk5NTItNzY1MzljZTVhOTk02AIG4AIB&sid=d4030ebf4f04bb7ddcb2b04d1bade521&dest_id=-2601889&dest_type=city& Canonical points to https://www.booking.com/city/gb/london.it.html In our scenario however there is a greater difference between the content on both pages (and booking.com have a load of search results pages indexed which is not what we’re looking for) Would be great to get any feedback on this before I rule it out. Thanks!
Technical SEO | | GAnalytics1 -
Canonical tag for Home page: with or without / at the end???
Setting up canonical tags for an old site. I really need advice on that darn backslash / at the end of the homepage URL. We have incoming links to the homepage as http://www.mysite.com (without the backslash), and as http://www.mysite.com/ (with the backslash), and as http://www.mysite.com/index.html I know that there should be 301 redirects to just one version, but I need to know more about the canonical tags... Which should the canonical tag be??? (without the backslash) or (with the backslash) Thanks for your help! 🙂
Technical SEO | | GregB1230 -
Duplicate title-tags with pagination and canonical
Some time back we implemented the Google recommendation for pagination (the rel="next/prev"). GWMT now reports 17K pages with duplicate title-tags (we have about 1,1m products on our site and about 50m pages indexed in Google) As an example we have properties listed in various states and the category title would be "Properties for Sale in [state-name]". A paginated search page or browsing a category (see also http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970) would then include the following: The title for each page is the same - so to avoid the duplicate title-tags issue, I would think one would have the following options: Ignore what Google says Change the canonical to http://www.site.com/property/state.html (which would then only show the first XX results) Append a page number to the title "Properties for Sale in [state-name] | Page XX" Have all paginated pages use noindex,follow - this would then result in no category page being indexed Would you have the canonical point to the individual paginated page or the base page?
Technical SEO | | MagicDude4Eva2 -
Tags showing up in Google
Yesterday a user pointed out to me that Tags were being indexed in Google search results and that was not a good idea. I went into my Yoast settings and checked the "nofollow, index" in my Taxanomies, but when checking the source code for no follow, I found nothing. So instead, I went into the robot.txt and disallowed /tag/ Is that ok? or is that a bad idea? The site is The Tech Block for anyone interested in looking.
Technical SEO | | ttb0 -
Geotargeting duplicate content to different regions - href and canonical tag confusion
If you duplicate content onto a sub-folder for say a new US geotargeted site (to target kw spelling differences) and, in addition to GWT geotargeting settings, implement the 'Canonical' and 'Hreflang' tags on these new pages to show G different region and language version (en-us). Then does the original/main site similar pages also need to have canonical and href tags ? The main/original sites page I don't really want to target a specific country (although existing signals (hosting etc) will be UK (primary target of main site) but pages show up in other country searches too (which we want). Im presuming fine to leave the original/main site as it currently is although wording in google blog/webmaster central articles etc are a bit confusing hence why im asking for anyone elses opinion/input on this. Also is there are any benefit (or just best practice) to use 'www.example.com/en-us/...' in the subdirectory URL as opposed to just 'www.example.com/us/' many thanks in advance to any commentators 🙂
Technical SEO | | Dan-Lawrence0 -
Do I need to add canonical link tags to pages that I promote & track w/ UTM tags?
New to SEOmoz, loving it so far. I promote content on my site a lot and am diligent about using UTM tags to track conversions & attribute data properly. I was reading earlier about the use of link rel=canonical in the case of duplicate page content and can't find a conclusive answer whether or not I need to add the canonical tag to these pages. Do I need the canonical tag in this case? If so, can the canonical tag live in the HEAD section of the original / base page itself as well as any other URLs that call that content (that have UTM tags, etc)? Thank you.
Technical SEO | | askotzko1 -
Can I Disallow Faceted Nav URLs - Robots.txt
I have been disallowing /*? So I know that works without affecting crawling. I am wondering if I can disallow the faceted nav urls. So disallow: /category.html/? /category2.html/? /category3.html/*? To prevent the price faceted url from being cached: /category.html?price=1%2C1000
Technical SEO | | tylerfraser
and
/category.html?price=1%2C1000&product_material=88 Thanks!0 -
Robots.txt file getting a 500 error - is this a problem?
Hello all! While doing some routine health checks on a few of our client sites, I spotted that a new client of ours - who's website was not designed built by us - is returning a 500 internal server error when I try to look at the robots.txt file. As we don't host / maintain their site, I would have to go through their head office to get this changed, which isn't a problem but I just wanted to check whether this error will actually be having a negative effect on their site / whether there's a benefit to getting this changed? Thanks in advance!
Technical SEO | | themegroup0