Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
- 
					
					
					
					
 Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab. 
- 
					
					
					
					
 I mentioned both. You add a meta robots to noindex and remove from the sitemap. 
- 
					
					
					
					
 But google is still free to index a link/page even if it is not included in xml sitemap. 
- 
					
					
					
					
 Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap. 
- 
					
					
					
					
 I am using wordpress, Enfold theme (themeforest). I want some files to be accessed by google, but those should not be indexed. Here is an example: http://prntscr.com/h8918o I have currently blocked some JS directories/files using robots.txt (check screenshot) But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot) Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out. 
- 
					
					
					
					
 Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives: allow: /directory/$ disallow: /directory/* Which allows this URL: http://www.mysite.com/directory/ But doesn't allow the following one: http://www.mysite.com/directory/sub-directory2/... This page also gives an update similar to mine: https://support.google.com/webmasters/answer/156449?hl=en I think I am good! Thanks  
- 
					
					
					
					
 Thank you Michael, it is my understanding then that my idea of doing this: allow: /directory/$ disallow: /directory/* Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise. In the meantime if anyone else has more ideas about all this and can confirm me that would be great! Thank you again. 
- 
					
					
					
					
 I've always stuck to Disallow and followed - "This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:" http://www.robotstxt.org/robotstxt.html From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory | /*| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://a-moz.groupbuyseo.org/community/q/allow-or-disallow-first-in-robots-txt 
- 
					
					
					
					
 Thank you Michael, Google and other SEs actually recognize the "allow:" command: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt The fact is: if I don't specify that, how can I be sure that the following single command: disallow: /directory/* Doesn't prevent SEs to spider the /directory/ index as I'd like to? 
- 
					
					
					
					
 As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just disallow: /directory/* You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1 
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		"noindex, follow" or "robots.txt" for thin content pages
 Does anyone have any testing evidence what is better to use for pages with thin content, yet important pages to keep on a website? I am referring to content shared across multiple websites (such as e-commerce, real estate etc). Imagine a website with 300 high quality pages indexed and 5,000 thin product type pages, which are pages that would not generate relevant search traffic. Question goes: Does the interlinking value achieved by "noindex, follow" outweigh the negative of Google having to crawl all those "noindex" pages? With robots.txt one has Google's crawling focus on just the important pages that are indexed and that may give ranking a boost. Any experiments with insight to this would be great. I do get the story about "make the pages unique", "get customer reviews and comments" etc....but the above question is the important question here. Intermediate & Advanced SEO | | khi50
- 
		
		
		
		
		
		Block in robots.txt instead of using canonical?
 When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt? Intermediate & Advanced SEO | | YairSpolter0
- 
		
		
		
		
		
		Complimentary sub-brand; subdomain, subfolder, something else?
 Hello forum! I have a question about subdomains vs. subfolders for a new sub-brand for a company. The company is looking at creating a sub-brand delivering a different service to the parent company. It is complimentary in a sense, but it would need a very different marketing strategy. It is not trying to 'hide' its parent brand at all, but instead would leverage the parent brand as added social proof. I've read that creating a subdomain essentially means starting from scratch in terms of SEO, and that a subfolder would better leverage the domain authority the TLD has accrued. However, creating a subfolder does not really gel with me, as it would not in my opinion provide a good experience for visitors. I.e. it's like running a website that sells electronics and having a subfolder marketing IT support services. Yes, there is some synergy-- but it can also lead to visitor confusion. I'd love your opinions on this! Carlo Intermediate & Advanced SEO | | carlod0
- 
		
		
		
		
		
		Do you add 404 page into robot file or just add no index tag?
 Hi, got different opinion on this so i wanted to double check with your comment is. We've got /404.html page and I was wondering if you would add this page to robot text so it wouldn't be indexed or would you just add no index tag? What would be the best approach? Thanks! Intermediate & Advanced SEO | | Rubix0
- 
		
		
		
		
		
		Robots Disallow Backslash - Is it right command
 Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character ex - www.xyz.com/\/index.php?option=com_product www.xyz.com/\"/index.php?option=com_product Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk Need to know for command :- User-agent: *Â Â Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site, Intermediate & Advanced SEO | | Modi0
- 
		
		
		
		
		
		Robots.txt: Can you put a /* wildcard in the middle of a URL?
 We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds? Intermediate & Advanced SEO | | IHSwebsite0
- 
		
		
		
		
		
		Could you use a robots.txt file to disalow a duplicate content page from being crawled?
 A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution? Intermediate & Advanced SEO | | gregelwell0
- 
		
		
		
		
		
		URL Structure for Directory Site
 We have a directory that we're building and we're not sure if we should try to make each page an extension of the root domain or utilize sub-directories as users narrow down their selection. What is the best practice here for maximizing your SERP authority? Choice #1 - Hyphenated Architecture (no sub-folders): State Page /state/ City Page /city-state/ Business Page /business-city-state/ Intermediate & Advanced SEO | | knowyourbank
 4) Location Page  /locationname-city-state/ or.... Choice #2 - Using sub-folders on drill down: State Page /state/ City Page /state/city Business Page /state/city/business/
 4) Location Page /locationname-city-state/ Again, just to clarify, I need help in determining what the best methodology is for achieving the greatest SEO benefits. Just by looking it would seem that choice #1 would work better because the URL's are very clear and SEF. But, at the same time it may be less intuitive for search. I'm not sure. What do you think?0
 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				