Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Removing duplicate content
-
Due to URL changes and parameters on our ecommerce sites, we have a massive amount of duplicate pages indexed by google, sometimes up to 5 duplicate pages with different URLs.
1. We've instituted canonical tags site wide.
2. We are using the parameters function in Webmaster Tools.
3. We are using 301 redirects on all of the obsolete URLs
4. I have had many of the pages fetched so that Google can see and index the 301s and canonicals.
5. I created HTML sitemaps with the duplicate URLs, and had Google fetch and index the sitemap so that the dupes would get crawled and deindexed.
None of these seems to be terribly effective. Google is indexing pages with parameters in spite of the parameter (clicksource) being called out in GWT. Pages with obsolete URLs are indexed in spite of them having 301 redirects. Google also appears to be ignoring many of our canonical tags as well, despite the pages being identical.
Any ideas on how to clean up the mess?
-
Where this is appearing the most is on cross domain canonicals. We have duplicate content across 2 websites, and we've canonicaled some pages from Site A to Site B, and some from Site B to Site A. In theory, pages that were canonicaled to the other domain should be deindexed. When I run a rankings report, I see pages for the wrong domain ranking, a month later. They are pages with parameters, or old URLs that we've changed. It's like a game of whack a mole. Every time we get a page deindexed, a duplicate with a different parameter takes its place. And this is in spite of calling out these parameters in GWT.
What I imagine is happening is that we have several URLs for the same page indexed. When Google crawls our site, it is correctly canonicaling the page it crawls. In the rankings, however, Google is probably pulling a duplicate page out of its index, and ranking it without crawling it. If it was crawling it, Google would see the canonical tag, and not rank it. So we have an ongoing battle to get Google to crawl the page it just pulled out of its index to see the the canonical tag.
The reason for all this is that when a page cross domain canonicals correctly, the rankings for the duplicate page on the other site goes up dramatically. As long as Google keeps ranking the wrong pages, we don't get the rankings bump on the other site.
-
Are you basing this on a site: search? It's fairly common for URLs to appear in a site: search that otherwise will not appear for any actual searches. Are the undesirable versions of the URLs getting any search traffic?
-
Yes, as Patrick said, surprisingly often something like this is a result of a simple oversight because we have been looking at the same code over and over...
Do you have access to Screaming Frog? You could crawl your site and see whether redirects/canonicals are behaving as you expected.
Have you taken a look at the html of one of the incorrectly indexed pages when it is loaded in your browser? Can you see the canonical? If you try going to a redirected page, does it redirect? [I know--way to obvious, but sometimes it is good to start at the beginning again when we can't root out an issue.]
Another culprit in these cases can be internal links. Do you link internally using any of the undesirable URLs? That can send a message to Google that those URLs are still in play. Again, you can use Screaming Frog to find those strings.
-
It sounds like part of the problem may be the sitemaps you're sending. By including duplicates in a sitemap, you're basically telling Google that each version of the page is valid. I would remove them and resubmit a sitemap with only the canonical versions you want indexed and see if that helps.
-
Hi there
Are you sure you are using all of the tools above properly? Not saying you're not but people make mistakes and it's just something to look into.
When did you implement all of the changes? Was it recently or was it a long time ago?
How is your organic traffic and rankings? Did you check if you have a manual action at all?
Let me know - thanks!
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Same content, different languages. Duplicate content issue? | international SEO
Hi, If the "content" is the same, but is written in different languages, will Google see the articles as duplicate content?
Intermediate & Advanced SEO | | chalet
If google won't see it as duplicate content. What is the profit of implementing the alternate lang tag?Kind regards,Jeroen0 -
Country Code Top Level Domains & Duplicate Content
Hi looking to launch in a new market, currently we have a .com.au domain which is geo-targeted to Australia. We want to launch in New Zealand which is ends with .co.nz If i duplicate the Australian based site completely on the new .co.nz domain name, would i face duplicate content issues from a SEO standpoint?
Intermediate & Advanced SEO | | jayoliverwright
Even though it's on a completely separate country code. Or is it still advised tosetup hreflang tag across both of the domains? Cheers.0 -
Trailing Slashes for Magento CMS pages - 2 URLS - Duplicate content
Hello, Can anyone help me find a solution to Fixing and Creating Magento CMS pages to only use one URL and not two URLS? www.domain.com/testpage www.domain.com/testpage/ I found a previous article that applies to my issue, which is using htaccess to redirect request for pages in magento 301 redirect to slash URL from the non-slash URL. I dont understand the syntax fully in htaccess , but I used this code below. This code below fixed the CMS page redirection but caused issues on other pages, like all my categories and products with this error: "This webpage has a redirect loop ERR_TOO_MANY_REDIRECTS" Assuming you're running at domain root. Change to working directory if needed. RewriteBase / # www check If you're running in a subdirectory, then you'll need to add that in to the redirected url (http://www.mydomain.com/subdirectory/$1 RewriteCond %{HTTP_HOST} !^www. [NC]
Intermediate & Advanced SEO | | iamgreenminded
RewriteRule ^(.*)$ http://www.mydomain.com/$1 [R=301,L] Trailing slash check Don't fix direct file links RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_URI} !(.)/$
RewriteRule ^(.)$ $1/ [L,R=301] Finally, forward everything to your front-controller (index.php) RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* index.php [QSA,L]0 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Duplicate Content www vs. non-www and best practices
I have a customer who had prior help on his website and I noticed a 301 redirect in his .htaccess Rule for duplicate content removal : www.domain.com vs domain.com RewriteCond %{HTTP_HOST} ^MY-CUSTOMER-SITE.com [NC]
Intermediate & Advanced SEO | | EnvoyWeb
RewriteRule (.*) http://www.MY-CUSTOMER-SITE.com/$1 [R=301,L,NC] The result of this rule is that i type MY-CUSTOMER-SITE.com in the browser and it redirects to www.MY-CUSTOMER-SITE.com I wonder if this is causing issues in SERPS. If I have some inbound links pointing to www.MY-CUSTOMER-SITE.com and some pointing to MY-CUSTOMER-SITE.com, I would think that this rewrite isn't necessary as it would seem that Googlebot is smart enough to know that these aren't two sites. -----Can you comment on whether this is a best practice for all domains?
-----I've run a report for backlinks. If my thought is true that there are some pointing to www.www.MY-CUSTOMER-SITE.com and some to the www.MY-CUSTOMER-SITE.com, is there any value in addressing this?0 -
Problems with ecommerce filters causing duplicate content.
We have an ecommerce website with 700 pages. Due to the implementation of filters, we are seeing upto 11,000 pages being indexed where the filter tag is apphended to the URL. This is causing duplicate content issues across the site. We tried adding "nofollow" to all the filters, we have also tried adding canonical tags, which it seems are being ignored. So how can we fix this? We are now toying with 2 other ideas to fix this issue; adding "no index" to all filtered pages making the filters uncrawble using javascript Has anyone else encountered this issue? If so what did you do to combat this and was it successful?
Intermediate & Advanced SEO | | Silkstream0 -
PDF for link building - avoiding duplicate content
Hello, We've got an article that we're turning into a PDF. Both the article and the PDF will be on our site. This PDF is a good, thorough piece of content on how to choose a product. We're going to strip out all of the links to our in the article and create this PDF so that it will be good for people to reference and even print. Then we're going to do link building through outreach since people will find the article and PDF useful. My question is, how do I use rel="canonical" to make sure that the article and PDF aren't duplicate content? Thanks.
Intermediate & Advanced SEO | | BobGW0 -
Duplicate Content on Wordpress b/c of Pagination
On my recent crawl, there were a great many duplicate content penalties. The site is http://dailyfantasybaseball.org. The issue is: There's only one post per page. Therefore, because of wordpress's (or genesis's) pagination, a page gets created for every post, thereby leaving basically every piece of content i write as a duplicate. I feel like the engines should be smart enough to figure out what's going on, but if not, I will get hammered. What should I do moving forward? Thanks!
Intermediate & Advanced SEO | | Byron_W0