Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
-
I'm doing a detailed analysis of how Google sees and indexes our website and we have found that there are 240,256 pages in the index which is way too many. It's an e-commerce site that needs some tidying up.
I'm working with an SEO specialist to set up URL parameters and put information in to the robots.txt file so the excess pages aren't indexed (we shouldn't have any more than around 3,00 - 4,000 pages) but we're struggling to find a way to get a list of these 240,256 pages as it would be helpful information in deciding what to put in the robots.txt file and which URL's we should ask Google to remove.
Is there a way to get a list of the URL's indexed? We can't find it in the Google Webmaster Tools.
-
Looks like I can only do the first thousand. It's a start though. Thank you for the information.
Many of the URL's on my list, when put in to Google search, are giving me 80-100 other variants I can remove by hand.
http://www.mathewporter.co.uk/list-a-domains-indexed-pages-in-google-docs/ for anyone else following.
-
Finally getting around to doing this and noticed that when I change the start number to anything above 900, it doesn't work - ie: it's only letting me look at the first 1,000 results for some reason.
The list of 1,000 has given me some good URL's to search off for the filtering thingy that was generating all the garbage URL's but I'd love to get past 1,000 if I can.
Does anyone know how?
-
Correct. I have gone in to URL Parameters already and set them to Crawl 'No URLs' for those we don't want crawled.
We haven't added those parameters listed in there in to the robots.txt file yet, but I will do that now. I had an initial consult today and we ran way over time when we discovered all this stuff so I have another appointment in a couple of weeks.
We have a sitemap of all the category pages and relevant static pages on the site already and Google has those indexed nicely. We just need to get rid of the 240,000 pages it has indexed that we don't want in there (frightening I know - it's a really high number).
I greatly appreciate you taking the time to respond. Thank you.
-
Thanks. There's a lot of auto-generated content, duplicate pages and we've set the robots.txt file up to exclude a large number of them. Now we wait.
Very helpful and greatly appreciated. Thank you.
-
Hi,
I'm going to assume that as you have said it's an e-commerce site that the URL parameters are created by product variations, filters, sorts etc. If so then you must already be seeing those parameters on the URL of your site as you navigate and in your analytics or search results.
Your SEO specialist should easily be able to add those parameters to the robots file. Then personally I would resubmit a site map for completeness and wait for results to take effect.
-
Joanne,
I'm afraid there's no way to know which pages are actually indexed from your Webmaster Tools. You can use a simple search in Google: site:domain.com and it will list "all" your indexed pages, however, there's no way to export that as a report.
You can create a report using some "hack". Login to your Google Drive, create a new spreadsheet and use the following command to populate rows:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")
This will load the first 100 results. You will need to repeat the process for every 1000 results you have, changing the last variable: "start=1" to "start=100" and then "start=200", etc (you see where I'm going). This could really be a pain in the butt for your site's size.
My recommendation is you navigate your own site, decide which pages should be removed and then create the robots.txt regardless what google has indexed. Once you complete your robots.txt, it will take a few weeks (or even a month) to have the blocked pages removed.
Hope that helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Alternate page with proper canonical tag Status: Excluded in Google webmaster tools.
In Google Webmaster Tools, I have a coverage issue. I am getting this error message: Alternate page with proper canonical tag Status: Excluded. It gives the below blog post page as an example. Any idea how to resolve? At one time, I was using handl utm grabber, but the plugin is deactivated on my website. https://www.savacations.com/turrialba-costa-ricas-garden-city/?utm_source=deleted&utm_medium=deleted&utm_term=deleted&utm_content=deleted&utm_campaign=deleted&gclid=deleted5.
Intermediate & Advanced SEO | | Alancito0 -
Page with metatag noindex is STILL being indexed?!
Hi Mozers, There are over 200 pages from our site that have a meta tag "noindex" but are STILL being indexed. What else can I do to remove them from the Index?
Intermediate & Advanced SEO | | yaelslater0 -
Conditional Noindex for Dynamic Listing Pages?
Hi, We have dynamic listing pages that are sometimes populated and sometimes not populated. They are clinical trial results pages for disease types, some of which don't always have trials open. This means that sometimes the CMS produces a blank page -- pages that are then flagged as thin content. We're considering implementing a conditional noindex -- where the page is indexed only if there are results. However, I'm concerned that this will be confusing to Google and send a negative ranking signal. Any advice would be super helpful. Thanks!
Intermediate & Advanced SEO | | yaelslater0 -
Should I use the on classified listing pages that have expired?
We have went back and forth on this and wanted to get some outside input. I work for an online listing website that has classified ads on it. These ads are generated by companies on our site advertising weekend events around the country. We have about 10,000 companies that use our service to generate their online ads. This means that we have thousands of pages being created each week. The ads have lots of content: pictures, sale descriptions, and company information. After the ads have expired, and the sale is no longer happening, we are currently placing the in the heads of each page. The content is not relative anymore since the ad has ended. The only value the content offers a searcher is the images (there are millions on expired ads) and the descriptions of the items for sale. We currently are the leader in our industry and control most of the top spots on Google for our keywords. We have been worried about cluttering up the search results with pages of ads that are expired. In our Moz account right now we currently have over 28k crawler warnings alerting us to the being in the page heads of the expired ads. Seeing those warnings have made us nervous and second guessing what we are doing. Does anybody have any thoughts on this? Should we continue with placing the in the heads of the expired ads, or should we be allowing search engines to index the old pages. I have seen websites with discontinued products keeping the products around so that individuals can look up past information. This is the closest thing have seen to our situation. Any help or insight would be greatly appreciated! -Matt
Intermediate & Advanced SEO | | mellison0 -
Should I use noindex or robots to remove pages from the Google index?
I have a Magento site and just realized we have about 800 review pages indexed. The /review directory is disallowed in robots.txt but the pages are still indexed. From my understanding robots means it will not crawl the pages BUT if the pages are still indexed if they are linked from somewhere else. I can add the noindex tag to the review pages but they wont be crawled. https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html Should I remove the robots.txt and add the noindex? Or just add the noindex to what I already have?
Intermediate & Advanced SEO | | Tylerj0 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | | SolveWebMedia0 -
Pages are Indexed but not Cached by Google. Why?
Here's an example: I get a 404 error for this: http://webcache.googleusercontent.com/search?q=cache:http://www.qjamba.com/restaurants-coupons/ferguson/mo/all But a search for qjamba restaurant coupons gives a clear result as does this: site:http://www.qjamba.com/restaurants-coupons/ferguson/mo/all What is going on? How can this page be indexed but not in the Google cache? I should make clear that the page is not showing up with any kind of error in webmaster tools, and Google has been crawling pages just fine. This particular page was fetched by Google yesterday with no problems, and even crawled again twice today by Google Yet, no cache.
Intermediate & Advanced SEO | | friendoffood2 -
Best way to permanently remove URLs from the Google index?
We have several subdomains we use for testing applications. Even if we block with robots.txt, these subdomains still appear to get indexed (though they show as blocked by robots.txt. I've claimed these subdomains and requested permanent removal, but it appears that after a certain time period (6 months)? Google will re-index (and mark them as blocked by robots.txt). What is the best way to permanently remove these from the index? We can't use login to block because our clients want to be able to view these applications without needing to login. What is the next best solution?
Intermediate & Advanced SEO | | nicole.healthline0