Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
- 
					
					
					
					
 Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab. 
- 
					
					
					
					
 I mentioned both. You add a meta robots to noindex and remove from the sitemap. 
- 
					
					
					
					
 But google is still free to index a link/page even if it is not included in xml sitemap. 
- 
					
					
					
					
 Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap. 
- 
					
					
					
					
 I am using wordpress, Enfold theme (themeforest). I want some files to be accessed by google, but those should not be indexed. Here is an example: http://prntscr.com/h8918o I have currently blocked some JS directories/files using robots.txt (check screenshot) But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot) Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out. 
- 
					
					
					
					
 Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives: allow: /directory/$ disallow: /directory/* Which allows this URL: http://www.mysite.com/directory/ But doesn't allow the following one: http://www.mysite.com/directory/sub-directory2/... This page also gives an update similar to mine: https://support.google.com/webmasters/answer/156449?hl=en I think I am good! Thanks  
- 
					
					
					
					
 Thank you Michael, it is my understanding then that my idea of doing this: allow: /directory/$ disallow: /directory/* Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise. In the meantime if anyone else has more ideas about all this and can confirm me that would be great! Thank you again. 
- 
					
					
					
					
 I've always stuck to Disallow and followed - "This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:" http://www.robotstxt.org/robotstxt.html From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory | /*| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://a-moz.groupbuyseo.org/community/q/allow-or-disallow-first-in-robots-txt 
- 
					
					
					
					
 Thank you Michael, Google and other SEs actually recognize the "allow:" command: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt The fact is: if I don't specify that, how can I be sure that the following single command: disallow: /directory/* Doesn't prevent SEs to spider the /directory/ index as I'd like to? 
- 
					
					
					
					
 As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just disallow: /directory/* You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1 
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		Robots.txt & Disallow: /*? Question!
 Hi, I have a site where they have: Disallow: /*? Problem is we need the following indexed: ?utm_source=google_shopping What would the best solution be? I have read: User-agent: * Intermediate & Advanced SEO | | vetofunk
 Allow:Â ?utm_source=google_shopping
 Disallow: /*? Any ideas?0
- 
		
		
		
		
		
		Large robots.txt file
 We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size? Intermediate & Advanced SEO | | ThomasHarvey0
- 
		
		
		
		
		
		How can I get Bing to index my subdomain correctly?
 Hi guys, My website exists on a subdomain (i.e. https://website.subdomain.com) and is being indexed correctly on all search engines except Bing and Duck Duck Go, which list 'https://www.website.subdomain.com'. Unfortunately my subdomain isn't configured for www (the domain is out of my control), so searchers are seeing a server error when clicking on my homepage in the SERPs. I have verified the site successfully in Bing Webmaster Tools, but it still shows up incorrectly. Does anyone have any advice on how I could fix this issue? Thank you! Intermediate & Advanced SEO | | cos20300
- 
		
		
		
		
		
		Block in robots.txt instead of using canonical?
 When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt? Intermediate & Advanced SEO | | YairSpolter0
- 
		
		
		
		
		
		Subdomains vs directories on existing website with good search traffic
 Hello everyone, I operate a website called Icy Veins (www.icy-veins.com), which gives gaming advice for World of Warcraft and Hearthstone, two titles from Blizzard Entertainment. Up until recently, we had articles for both games on the main subdomain (www.icy-veins.com), without a directory structure. The articles for World of Warcraft ended in -wow and those for Hearthstone ended in -hearthstone and that was it. We are planning to cover more games from Blizzard entertainment soon, so we hired a SEO consultant to figure out whether we should use directories (www.icy-veins.com/wow/, www.icy-veins.com/hearthstone/, etc.) or subdomains (www.icy-veins.com, wow.icy-veins.com, hearthstone.icy-veins.com). For a number of reason, the consultant was adamant that subdomains was the way to go. So, I implemented subdomains and I have 301-redirects from all the old URLs to the new ones, and after 2 weeks, the amount of search traffic we get has been slowly decreasing, as the new URLs were getting index. Now, we are getting about 20%-25% less search traffic. For example, the week before the subdomains went live we received 900,000 visits from search engines (11-17 May). This week, we only received 700,000 visits. All our new URLs are indexed, but they rank slightly lower than the old URLs used to, so I was wondering if this was something that was to be expected and that will improve in time or if I should just go for subdomains. Thank you in advance. Intermediate & Advanced SEO | | damienthivolle0
- 
		
		
		
		
		
		Should comments and feeds be disallowed in robots.txt?
 Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /* Intermediate & Advanced SEO | | workathomecareers0
- 
		
		
		
		
		
		How is Google crawling and indexing this directory listing?
 We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one! Intermediate & Advanced SEO | | danatanseo0
- 
		
		
		
		
		
		Best way to block a sub-domain from being indexed
 Hello, The search engines have indexed a sub-domain I did not want indexed its on old.domain.com and dev.domain.com - I was going to password them but is there a best practice way to block them. My main domain default robots.txt says :- Sitemap: http://www.domain.com/sitemap.xml global User-agent: * Intermediate & Advanced SEO | | JohnW-UK
 Disallow: /cgi-bin/
 Disallow: /wp-admin/
 Disallow: /wp-includes/
 Disallow: /wp-content/plugins/
 Disallow: /wp-content/cache/
 Disallow: /wp-content/themes/
 Disallow: /trackback/
 Disallow: /feed/
 Disallow: /comments/
 Disallow: /category//
 Disallow: */trackback/
 Disallow: */feed/
 Disallow: /comments/
 Disallow: /?0
 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				