Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
No indexing url including query string with Robots txt
- 
					
					
					
					
 Dear all, how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt? Thanks! 
- 
					
					
					
					
 Dear all, what is the best option? And are the option below good? A: Disallow - sort-order (Only URLs with value = asc)
 "A single URL may contain many parameters for each of which you can specify settings. More restrictive settings override less restrictive settings. For example, here are three parameters and their settings" source: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687 B: User-agent: Googlebot Disallow: /*.=name$ for example www.sub.domain.com/collection.html?dir=desc&order=name source: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 Thanks! 
- 
					
					
					
					
 You could always just use rel="canonical" which would be much better than completely blocking all URL parameters. 
- 
					
					
					
					
 Hey, Should that second URL be www.sub.domain.com/collection/adresboeken.html?whatever=something If so, then by using /collection/?* you are saying that anything within /collection/ with a query string should not be indexed. If adresboeken.html always has a query string, it may not get indexed. The other options I'd consider before using robots.txt are telling Google to ignore dir=desc&order=color in Google Webmaster Tools parameter handling. This is the best way to handle query string issues. (Assuming you are trying to influence Google. Clearly Google Webmaster Tools won't affect Bing!) Another idea is to set a canonical URL on /collection/adresboeken.html referencing /collection/adresboeken.html without the query string. This tells the search engines that the query strings do not make a unique URL. (adresboeken.html?dir=desc&order=color is the same as adresboeken.html?dir=desc&order=price is the same as adresboeken.html?dir=asc&order=color is the same as adresboeken.html, and so on). I hope that helps. Thanks, 
 Matthew
- 
					
					
					
					
 Hi, Robots.txt works mainly on 2 rules. Those are User-agent: and Disallow: User-agent: the name of the robot you need to block Disallow: the url or folder or other url with conditions you need to block. As you have asked in your question you need to block a url with a condition. But you have to remember that Robot.txt is giving so critical results if you did not use it correctly. Anyway in your question, you wanted to block url/pages with query strings like page.html?dir=asc&order=name so you have to use following: User-agent: * Disallow: /*? So the above will block all the urls with a question mark (?) for all the search robots. This will not block only page.html?dir=asc&order=name it will alos block comments.html?dir=asc&order=name So use it so carefully. Hope this is the what you have looked for. If need more help you may ask. Regards Prasad 
- 
					
					
					
					
 Dear all, thanks for responding. If I have a pages like 1. www.sub.domain.com/collection.html exists, I want to index it, and 2. www.sub.domain.com/collection.html?dir=desc&order=color which I don't want to index Is this the way to do this in de robots.txt?: Disallow: /collection/?* Thanks! 
- 
					
					
					
					
 Hi, Here is an article explaining how to do this in robots.txt: 
 http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/Depending on what you are trying to do, it might also be worth investigating parameter handling in Google Webmaster Tools: 
 http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687Thanks, 
 Matthew
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		Robots.txt allows wp-admin/admin-ajax.php
 Hello, Mozzers! Technical SEO | | AndyKubrin
 I noticed something peculiar in the robots.txt used by one of my clients: Allow: /wp-admin/admin-ajax.php What would be the purpose of allowing a search engine to crawl this file?
 Is it OK? Should I do something about it?
 Everything else on /wp-admin/ is disallowed.
 Thanks in advance for your help.
 -AK:2
- 
		
		
		
		
		
		Removing a site from Google index with no index met tags
 Hi there! I wanted to remove a duplicated site from the google index. I've read that you can do this by removing the URL from Google Search console and, although I can't find it in Google Search console, Google keeps on showing the site on SERPs. So I wanted to add a "no index" meta tag to the code of the site however I've only found out how to do this for individual pages, can you do the same for a entire site? How can I do it? Thank you for your help in advance! L Technical SEO | | Chris_Wright1
- 
		
		
		
		
		
		Indexed pages
 Just started a site audit and trying to determine the number of pages on a client site and whether there are more pages being indexed than actually exist. I've used four tools and got four very different answers... Google Search Console: 237 indexed pages Google search using site command: 468 results MOZ site crawl: 1013 unique URLs Screaming Frog: 183 page titles, 187 URIs (note this is a free licence, but should cut off at 500) Can anyone shed any light on why they differ so much? And where lies the truth? Technical SEO | | muzzmoz1
- 
		
		
		
		
		
		Will Google Recrawl an Indexed URL Which is No Longer Internally Linked?
 We accidentally introduced Google to our incomplete site. The end result: thousands of pages indexed which return nothing but a "Sorry, no results" page. I know there are many ways to go about this, but the sheer number of pages makes it frustrating. Ideally, in the interim, I'd love to 404 the offending pages and allow Google to recrawl them, realize they're dead, and begin removing them from the index. Unfortunately, we've removed the initial internal links that lead to this premature indexation from our site. So my question is, will Google revisit these pages based on their own records (as in, this page is indexed, let's go check it out again!), or will they only revisit them by following along a current site structure? We are signed up with WMT if that helps. Technical SEO | | kirmeliux0
- 
		
		
		
		
		
		Why do some URLs for a specific client have "/index.shtml"?
 Reviewing our client's URLs for a 301 redirect strategy, we have noticed that many URLs have "/index.shtml." The part we don'd understand is these URLs aren't the homepage and they have multiple folders followed by "/index.shtml" Does anyone happen to know why this may be occurring? Is there any SEO value in keeping the "/index.shtml" in the URL? Technical SEO | | FranFerrara0
- 
		
		
		
		
		
		How to Remove /feed URLs from Google's Index
 Hey everyone, I have an issue with RSS /feed URLs being indexed by Google for some of our Wordpress sites. Have a look at this Google query, and click to show omitted search results. You'll see we have 500+ /feed URLs indexed by Google, for our many category pages/etc. Here is one of the example URLs: http://www.howdesign.com/design-creativity/fonts-typography/letterforms/attachment/gilhelveticatrade/feed/. Based on this content/code of the XML page, it looks like Wordpress is generating these: <generator>http://wordpress.org/?v=3.5.2</generator> Any idea how to get them out of Google's index without 301 redirecting them? We need the Wordpress-generated RSS feeds to work for various uses. My first two thoughts are trying to work with our Development team to see if we can get a "noindex" meta robots tag on the pages, by they are dynamically-generated pages...so I'm not sure if that will be possible. Or, perhaps we can add a "feed" paramater to GWT "URL Parameters" section...but I don't want to limit Google from crawling these again...I figure I need Google to crawl them and see some code that says to get the pages out of their index...and THEN not crawl the pages anymore. I don't think the "Remove URL" feature in GWT will work, since that tool only removes URLs from the search results, not the actual Google index. FWIW, this site is using the Yoast plugin. We set every page type to "noindex" except for the homepage, Posts, Pages and Categories. We have other sites on Yoast that do not have any /feed URLs indexed by Google at all. Side note, the /robots.txt file was previously blocking crawling of the /feed URLs on this site, which is why you'll see that note in the Google SERPs when you click on the query link given in the first paragraph. Technical SEO | | M_D_Golden_Peak0
- 
		
		
		
		
		
		Oh no googlebot can not access my robots.txt file
 I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site. Technical SEO | | BistosAmerica0
- 
		
		
		
		
		
		Removing robots.txt on WordPress site problem
 Hi..am a little confused since I ticked the box in WordPress to allow search engines to now crawl my site (previously asked for them not to) but Google webmaster tools is telling me I still have robots.txt blocking them so am unable to submit the sitemap. Checked source code and the robots instruction has gone so a little lost. Any ideas please? Technical SEO | | Wallander0
 
			
		 
			
		 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				