Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
-
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO.
I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
-
Yeah, sorry for the confusion. I put the tag on all the pages (Original and Duplicate). I sent you a PM with another good article on Rel canonical tag
-
Peter, Thanks for the clarification.
-
Generally agree, although I'd just add that Robots.txt also isn't so great at removing content that's already been indexed (it's better at prevention). So, I find that it's not just not ideal - it sometimes doesn't even work in these cases.
Rel-canonical is generally a good bet, and it should go on the duplicate (you can actually put it on both, although it's not necessary).
-
Next time I'll read the reference links better
Thank you!
-
per google webmaster tools:
If Google knows that these pages have the same content, we may index only one version for our search results. Our algorithms select the page we think best answers the user's query. Now, however, users can specify a canonical page to search engines by adding a element with the attribute
rel="canonical"
to the section of the non-canonical version of the page. Adding this link and attribute lets site owners identify sets of identical content and suggest to Google: "Of all these pages with identical content, this page is the most useful. Please prioritize it in search results." -
Thanks Kyle. Anthony had a similar view on using the rel canonical tag. I'm just curious about adding it to both the original page or duplicate page? Or both?
Thanks,
Greg
-
Anthony, Thanks for your response. See Kyle, he also felt using the rel canonical tag was the best thing to do. However he seemed to think you'd put it on the original page - the one you want to rank for. And you're suggesting putting on the duplicate page. Should it be added to both while specifying which page is the 'original'?
Thanks!
Greg
-
I'm not sure I understand why the site owner seems to think that the duplicate content is necessary?
If I was in your situation I would be trying to convince the client to remove the duplicate content from their site, rather than trying to find a way around it.
If the information is difficult to find then this may be due to a problem with the site architecture. If the site does not flow well enough for visitors to find the information they need, then perhaps a site redesign is necessary.
-
Well, the answer would be yes and no. A robots.txt file would stop the bots from indexing the page, but links from other pages in site to that non indexed page could therefor make it crawlable and then indexed. AS posted in google webmaster tools here:
"You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results."
I think the best way to avoid any conflict is applying the rel="canonical" tag to each duplicate page that you don't want indexed.
You can find more info on rel canonical here
Hope this helps out some.
-
The best way would be to use the Rel canonical tag
On the page you would like to rank for put the Rel canonical tag in
This lets google know that this is the original page.
Check out this link posted by Rand about the Rel canonical tag [http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps](http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps)
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Category Pages & Content
Hi Does anyone have any great examples of an ecommerce site which has great content on category pages or product listing pages? Thanks!
Intermediate & Advanced SEO | | BeckyKey1 -
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Should I be using meta robots tags on thank you pages with little content?
I'm working on a website with hundreds of thank you pages, does it make sense to no follow, no index these pages since there's little content on them? I'm thinking this should save me some crawl budget overall but is there any risk in cutting out the internal links found on the thank you pages? (These are only standard site-wide footer and navigation links.) Thanks!
Intermediate & Advanced SEO | | GSO0 -
Should comments and feeds be disallowed in robots.txt?
Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /*
Intermediate & Advanced SEO | | workathomecareers0 -
International SEO - cannibalisation and duplicate content
Hello all, I look after (in house) 3 domains for one niche travel business across three TLDs: .com .com.au and co.uk and a fourth domain on a co.nz TLD which was recently removed from Googles index. Symptoms: For the past 12 months we have been experiencing canibalisation in the SERPs (namely .com.au being rendered in .com) and Panda related ranking devaluations between our .com site and com.au site. Around 12 months ago the .com TLD was hit hard (80% drop in target KWs) by Panda (probably) and we began to action the below changes. Around 6 weeks ago our .com TLD saw big overnight increases in rankings (to date a 70% averaged increase). However, almost to the same percentage we saw in the .com TLD we suffered significant drops in our .com.au rankings. Basically Google seemed to switch its attention from .com TLD to the .com.au TLD. Note: Each TLD is over 6 years old, we've never proactively gone after links (Penguin) and have always aimed for quality in an often spammy industry. **Have done: ** Adding HREF LANG markup to all pages on all domain Each TLD uses local vernacular e.g for the .com site is American Each TLD has pricing in the regional currency Each TLD has details of the respective local offices, the copy references the lacation, we have significant press coverage in each country like The Guardian for our .co.uk site and Sydney Morning Herlad for our Australia site Targeting each site to its respective market in WMT Each TLDs core-pages (within 3 clicks of the primary nav) are 100% unique We're continuing to re-write and publish unique content to each TLD on a weekly basis As the .co.nz site drove such little traffic re-wrting we added no-idex and the TLD has almost compelte dissapread (16% of pages remain) from the SERPs. XML sitemaps Google + profile for each TLD **Have not done: ** Hosted each TLD on a local server Around 600 pages per TLD are duplicated across all TLDs (roughly 50% of all content). These are way down the IA but still duplicated. Images/video sources from local servers Added address and contact details using SCHEMA markup Any help, advice or just validation on this subject would be appreciated! Kian
Intermediate & Advanced SEO | | team_tic1 -
How to Remove Joomla Canonical and Duplicate Page Content
I've attempted to follow advice from the Q&A section. Currently on the site www.cherrycreekspine.com, I've edited the .htaccess file to help with 301s - all pages redirect to www.cherrycreekspine.com. Secondly, I'd added the canonical statement in the header of the web pages. I have cut the Duplicate Page Content in half ... now I have a remaining 40 pages to fix up. This is my practice site to try and understand what SEOmoz can do for me. I've looked at some of your videos on Youtube ... I feel like I'm scrambling around to the Q&A and the internet to understand this product. I'm reading the beginners guide.... any other resources would be helpful.
Intermediate & Advanced SEO | | deskstudio0 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | | nicole.healthline4 -
Are duplicate links on same page alright?
If I have a homepage with category links, is it alright for those category links to appear in the footer as well, or should you never have duplicate links on one page? Can you please give a reason why as well? Thanks!
Intermediate & Advanced SEO | | dkamen0