Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Disallowed Pages Still Showing Up in Google Index. What do we do?
-
We recently disallowed a wide variety of pages for www.udemy.com which we do not want google indexing (e.g., /tags or /lectures). Basically we don't want to spread our link juice around to all these pages that are never going to rank. We want to keep it focused on our core pages which are for our courses.
We've added them as disallows in robots.txt, but after 2-3 weeks google is still showing them in it's index. When we lookup "site: udemy.com", for example, Google currently shows ~650,000 pages indexed... when really it should only be showing ~5,000 pages indexed.
As another example, if you search for "site:udemy.com/tag", google shows 129,000 results. We've definitely added "/tag" into our robots.txt properly, so this should not be happening... Google showed be showing 0 results.
Any ideas re: how we get Google to pay attention and re-index our site properly?
-
The last time I used a tool, excluding via robots.txt was also sufficient for URL removal.
Recently, Google has updated their documentation to strongly encourage you to use URL removal only for things like exposing confidential information, and not to clean up old pages or errors in your GWT account (see http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1269119). I know many people still use the tool for that type of stuff, but wanted to point out that change.
-
Thank you Keri.
Yes, good idea, but whatever you request, that page or directory must respond with a 404, otherwise, it will be ignored.
- that is why I couldn't do that with the send to a friend URLs
(would have been a nice thing to do)
I guess I could have cheated, and made them return a 404 if it was google, just to dump them all out of the index.
The 15,000 I did request to be removed were individual pages, that returned 404 response code, so thats why I did them one at a time. I could have waited, but if you wait, then google keeps trying to fetch those missing pages and they keep reporting them in your GWT.
That is a good reason to request the removals.
I actually gave up when the number of deletions got to 1.5 million. I figured it was just too hard to do.
-
The last time I looked, you can request removal of an entire directory as well, which should work for the OP.
-
I would have said the same thing, except that a few weeks ago, I removed a rule from the robots file and I changed the affected pages to have a noindex.nofollow and the next day, tens of thousands of those pages appeared in the index and overpowered the content pages.
So my advice, is don't trust noindex,nofollow and just stop the robot going down that tree (as you are doing) and find another way to get those pages out of the index.
You can use the URL removal request tool.
It only seems to allow you to remove 1000 per day.
I have done this before by automating the removal using a macro program.
I think I removed about 15,000 over the space of a month, doing that.
They are fairly fast at removing URLs these days, 24 hours or less.
-
Disallowing in your robots.txt keeps the bots from indexing your pages going forward, but Google may keep returning them in search results. This post has great explanations about ways to remove pages from indices: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts
The surefire way to get them out of the index is to remove the disallow from your robots.txt, and add a meta noindex tags on all the pages you want removed. Once they're reindexed by Google, they'll no longer appear in SERPs.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Home Page Disappears From Google - But Rest of Site Still Ranked
As title suggests we are running into a serious issue of the home page disapearing from Google search results whilst the rest of the site still remains. We search for it naturally cannot find a trace, then use a "site:" command in Google and still the home page does not come up. We go into web masters and inspect the home page and even Google states that the page is indexable. We then run the "Request Indexing" and the site comes back on Google. This is having a damaging affect and we would like to understand why this issue is happening. Please note this is not happening on just one of our sites but has happened to three which are all located on the same server. One of our brand which has the issue is: www.henweekends.co.uk
Intermediate & Advanced SEO | | JH_OffLimits0 -
Google Is Indexing my 301 Redirects to Other sites
Long story but now i have a few links from my site 301 redirecting to youtube videos or eCommerce stores. They carry a considerable amount of traffic that i benefit from so i can't take them down, and that traffic is people from other websites, so basically i have backlinks from places that i don't own, to my redirect urls (Ex. http://example.com/redirect) My problem is that google is indexing them and doesn't let them go, i have tried blocking that url from robots.txt but google is still indexing it uncrawled, i have also tried allowing google to crawl it and adding noindex from robots.txt, i have tried removing it from GWT but it pops back again after a few days. Any ideas? Thanks!
Intermediate & Advanced SEO | | cuarto7150 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | | SolveWebMedia0 -
Pages are Indexed but not Cached by Google. Why?
Here's an example: I get a 404 error for this: http://webcache.googleusercontent.com/search?q=cache:http://www.qjamba.com/restaurants-coupons/ferguson/mo/all But a search for qjamba restaurant coupons gives a clear result as does this: site:http://www.qjamba.com/restaurants-coupons/ferguson/mo/all What is going on? How can this page be indexed but not in the Google cache? I should make clear that the page is not showing up with any kind of error in webmaster tools, and Google has been crawling pages just fine. This particular page was fetched by Google yesterday with no problems, and even crawled again twice today by Google Yet, no cache.
Intermediate & Advanced SEO | | friendoffood2 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | | danatanseo0 -
Google Not Indexing XML Sitemap Images
Hi Mozzers, We are having an issue with our XML sitemap images not being indexed. The site has over 39,000 pages and 17,500 images submitted in GWT. If you take a look at the attached screenshot, 'GWT Images - Not Indexed', you can see that the majority of the pages are being indexed - but none of the images are. The first thing you should know about the images is that they are hosted on a content delivery network (CDN), rather than on the site itself. However, Google advice suggests hosting on a CDN is fine - see second screenshot, 'Google CDN Advice'. That advice says to either (i) ensure the hosting site is verified in GWT or (ii) submit in robots.txt. As we can't verify the hosting site in GWT, we had opted to submit via robots.txt. There are 3 sitemap indexes: 1) http://www.greenplantswap.co.uk/sitemap_index.xml, 2) http://www.greenplantswap.co.uk/sitemap/plant_genera/listings.xml and 3) http://www.greenplantswap.co.uk/sitemap/plant_genera/plants.xml. Each sitemap index is split up into often hundreds or thousands of smaller XML sitemaps. This is necessary due to the size of the site and how we have decided to pull URLs in. Essentially, if we did it another way, it may have involved some of the sitemaps being massive and thus taking upwards of a minute to load. To give you an idea of what is being submitted to Google in one of the sitemaps, please see view-source:http://www.greenplantswap.co.uk/sitemap/plant_genera/4/listings.xml?page=1. Originally, the images were SSL, so we decided to reverted to non-SSL URLs as that was an easy change. But over a week later, that seems to have had no impact. The image URLs are ugly... but should this prevent them from being indexed? The strange thing is that a very small number of images have been indexed - see http://goo.gl/P8GMn. I don't know if this is an anomaly or whether it suggests no issue with how the images have been set up - thus, there may be another issue. Sorry for the long message but I would be extremely grateful for any insight into this. I have tried to offer as much information as I can, however please do let me know if this is not enough. Thank you for taking the time to read and help. Regards, Mark Oz6HzKO rYD3ICZ
Intermediate & Advanced SEO | | edlondon0 -
Google Indexing Feedburner Links???
I just noticed that for lots of the articles on my website, there are two results in Google's index. For instance: http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html and http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+thewebhostinghero+(TheWebHostingHero.com) Now my Feedburner feed is set to "noindex" and it's always been that way. The canonical tag on the webpage is set to: rel='canonical' href='http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html' /> The robots tag is set to: name="robots" content="index,follow,noodp" /> I found out that there are scrapper sites that are linking to my content using the Feedburner link. So should the robots tag be set to "noindex" when the requested URL is different from the canonical URL? If so, is there an easy way to do this in Wordpress?
Intermediate & Advanced SEO | | sbrault740 -
Getting Pages Requiring Login Indexed
Somehow certain newspapers' webpages show up in the index but require login. My client has a whole section of the site that requires a login (registration is free), and we'd love to get that content indexed. The developer offered to remove the login requirement for specific user agents (eg Googlebot, et al.). I am afraid this might get us penalized. Any insight?
Intermediate & Advanced SEO | | TheEspresseo0