Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
-
I'm doing a detailed analysis of how Google sees and indexes our website and we have found that there are 240,256 pages in the index which is way too many. It's an e-commerce site that needs some tidying up.
I'm working with an SEO specialist to set up URL parameters and put information in to the robots.txt file so the excess pages aren't indexed (we shouldn't have any more than around 3,00 - 4,000 pages) but we're struggling to find a way to get a list of these 240,256 pages as it would be helpful information in deciding what to put in the robots.txt file and which URL's we should ask Google to remove.
Is there a way to get a list of the URL's indexed? We can't find it in the Google Webmaster Tools.
-
Looks like I can only do the first thousand. It's a start though. Thank you for the information.
Many of the URL's on my list, when put in to Google search, are giving me 80-100 other variants I can remove by hand.
http://www.mathewporter.co.uk/list-a-domains-indexed-pages-in-google-docs/ for anyone else following.
-
Finally getting around to doing this and noticed that when I change the start number to anything above 900, it doesn't work - ie: it's only letting me look at the first 1,000 results for some reason.
The list of 1,000 has given me some good URL's to search off for the filtering thingy that was generating all the garbage URL's but I'd love to get past 1,000 if I can.
Does anyone know how?
-
Correct. I have gone in to URL Parameters already and set them to Crawl 'No URLs' for those we don't want crawled.
We haven't added those parameters listed in there in to the robots.txt file yet, but I will do that now. I had an initial consult today and we ran way over time when we discovered all this stuff so I have another appointment in a couple of weeks.
We have a sitemap of all the category pages and relevant static pages on the site already and Google has those indexed nicely. We just need to get rid of the 240,000 pages it has indexed that we don't want in there (frightening I know - it's a really high number).
I greatly appreciate you taking the time to respond. Thank you.
-
Thanks. There's a lot of auto-generated content, duplicate pages and we've set the robots.txt file up to exclude a large number of them. Now we wait.
Very helpful and greatly appreciated. Thank you.
-
Hi,
I'm going to assume that as you have said it's an e-commerce site that the URL parameters are created by product variations, filters, sorts etc. If so then you must already be seeing those parameters on the URL of your site as you navigate and in your analytics or search results.
Your SEO specialist should easily be able to add those parameters to the robots file. Then personally I would resubmit a site map for completeness and wait for results to take effect.
-
Joanne,
I'm afraid there's no way to know which pages are actually indexed from your Webmaster Tools. You can use a simple search in Google: site:domain.com and it will list "all" your indexed pages, however, there's no way to export that as a report.
You can create a report using some "hack". Login to your Google Drive, create a new spreadsheet and use the following command to populate rows:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")
This will load the first 100 results. You will need to repeat the process for every 1000 results you have, changing the last variable: "start=1" to "start=100" and then "start=200", etc (you see where I'm going). This could really be a pain in the butt for your site's size.
My recommendation is you navigate your own site, decide which pages should be removed and then create the robots.txt regardless what google has indexed. Once you complete your robots.txt, it will take a few weeks (or even a month) to have the blocked pages removed.
Hope that helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Conditional Noindex for Dynamic Listing Pages?
Hi, We have dynamic listing pages that are sometimes populated and sometimes not populated. They are clinical trial results pages for disease types, some of which don't always have trials open. This means that sometimes the CMS produces a blank page -- pages that are then flagged as thin content. We're considering implementing a conditional noindex -- where the page is indexed only if there are results. However, I'm concerned that this will be confusing to Google and send a negative ranking signal. Any advice would be super helpful. Thanks!
Intermediate & Advanced SEO | | yaelslater0 -
Should I use noindex or robots to remove pages from the Google index?
I have a Magento site and just realized we have about 800 review pages indexed. The /review directory is disallowed in robots.txt but the pages are still indexed. From my understanding robots means it will not crawl the pages BUT if the pages are still indexed if they are linked from somewhere else. I can add the noindex tag to the review pages but they wont be crawled. https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html Should I remove the robots.txt and add the noindex? Or just add the noindex to what I already have?
Intermediate & Advanced SEO | | Tylerj0 -
Why does Google rank a product page rather than a category page?
Hi, everybody In the Moz ranking tool for one of our client's (the client sells sport equipment) account, there is a trend where more and more of their landing pages are product pages instead of category pages. The optimal landing page for the term "sleeping bag" is of course the sleeping bag category page, but Google is sending them to a product page for a specific sleeping bag.. What could be the critical factors that makes the product page more relevant than the category page as the landing page?
Intermediate & Advanced SEO | | Inevo0 -
Mass Removal Request from Google Index
Hi, I am trying to cleanse a news website. When this website was first made, the people that set it up copied all kinds of articles they had as a newspaper, including tests, internal communication, and drafts. This site has lots of junk, but this kind of junk was on the initial backup, aka before 1st-June-2012. So, removing all mixed content prior to that date, we can have pure articles starting June 1st, 2012! Therefore My dynamic sitemap now contains only articles with release date between 1st-June-2012 and now Any article that has release date prior to 1st-June-2012 returns a custom 404 page with "noindex" metatag, instead of the actual content of the article. The question is how I can remove from the google index all this junk as fast as possible that is not on the site anymore, but still appears in google results? I know that for individual URLs I need to request removal from this link
Intermediate & Advanced SEO | | ioannisa
https://www.google.com/webmasters/tools/removals The problem is doing this in bulk, as there are tens of thousands of URLs I want to remove. Should I put the articles back to the sitemap so the search engines crawl the sitemap and see all the 404? I believe this is very wrong. As far as I know this will cause problems because search engines will try to access non existent content that is declared as existent by the sitemap, and return errors on the webmasters tools. Should I submit a DELETED ITEMS SITEMAP using the <expires>tag? I think this is for custom search engines only, and not for the generic google search engine.
https://developers.google.com/custom-search/docs/indexing#on-demand-indexing</expires> The site unfortunatelly doesn't use any kind of "folder" hierarchy in its URLs, but instead the ugly GET params, and a kind of folder based pattern is impossible since all articles (removed junk and actual articles) are of the form:
http://www.example.com/docid=123456 So, how can I bulk remove from the google index all the junk... relatively fast?0 -
Google indexing only 1 page out of 2 similar pages made for different cities
We have created two category pages, in which we are showing products which could be delivered in separate cities. Both pages are related to cake delivery in that city. But out of these two category pages only 1 got indexed in google and other has not. Its been around 1 month but still only Bangalore category page got indexed. We have submitted sitemap and google is not giving any crawl error. We have also submitted for indexing from "Fetch as google" option in webmasters. www.winni.in/c/4/cakes (Indexed - Bangalore page - http://www.winni.in/sitemap/sitemap_blr_cakes.xml) 2. http://www.winni.in/hyderabad/cakes/c/4 (Not indexed - Hyderabad page - http://www.winni.in/sitemap/sitemap_hyd_cakes.xml) I tried searching for "hyderabad site:www.winni.in" in google but there also http://www.winni.in/hyderabad/cakes/c/4 this link is not coming, instead of this only www.winni.in/c/4/cakes is coming. Can anyone please let me know what could be the possible issue with this?
Intermediate & Advanced SEO | | abhihan0 -
Dev Subdomain Pages Indexed - How to Remove
I own a website (domain.com) and used the subdomain "dev.domain.com" while adding a new section to the site (as a development link). I forgot to block the dev.domain.com in my robots file, and google indexed all of the dev pages (around 100 of them). I blocked the site (dev.domain.com) in robots, and then proceeded to just delete the entire subdomain altogether. It's been about a week now and I still see the subdomain pages indexed on Google. How do I get these pages removed from Google? Are they causing duplicate content/title issues, or does Google know that it's a development subdomain and it's just taking time for them to recognize that I deleted it already?
Intermediate & Advanced SEO | | WebServiceConsulting.com0 -
Getting Pages Requiring Login Indexed
Somehow certain newspapers' webpages show up in the index but require login. My client has a whole section of the site that requires a login (registration is free), and we'd love to get that content indexed. The developer offered to remove the login requirement for specific user agents (eg Googlebot, et al.). I am afraid this might get us penalized. Any insight?
Intermediate & Advanced SEO | | TheEspresseo0 -
Google is indexing wordpress attachment pages
Hey, I have a bit of a problem/issue what is freaking me out a bit. I hope you can help me. If i do site:www.somesitename.com search in Google i see that Google is indexing my attachment pages. I want to redirect attachment URL's to parent post and stop google from indexing them. I have used different redirect plugins in hope that i can fix it myself but plugins don't work. I get a error:"too many redirects occurred trying to open www.somesitename.com/?attachment_id=1982 ". Do i need to change something in my attachment.php fail? Any idea what is causing this problem? get_header(); ?> /* Run the loop to output the attachment. * If you want to overload this in a child theme then include a file * called loop-attachment.php and that will be used instead. */ get_template_part( 'loop', 'attachment' ); ?>
Intermediate & Advanced SEO | | TauriU0