Robots.txt Generator Guide
Every website that wants any say over how it gets crawled relies on a single small text file sitting at its root: robots.txt. It's one of the oldest standards on the web, and its syntax is deceptively simple on the surface — a few lines specifying which crawler a rule applies to and which paths it can or can't access — but it's also unforgiving. A missing slash, a rule placed under the wrong User-agent block, or a wildcard used incorrectly can either block far more of a site than intended, accidentally hiding it from search engines entirely, or fail to block anything at all because the syntax didn't match what the crawler expected.
The landscape robots.txt needs to address has also expanded considerably. Where it once mainly concerned a handful of search engine crawlers like Googlebot and Bingbot, site owners today increasingly want to make a deliberate choice about AI crawlers that scrape content to train language models or power AI-generated answers — bots like GPTBot, CCBot, or ClaudeBot — separately from how they treat traditional search indexing. Doing this correctly means writing distinct User-agent blocks for each crawler you want to address individually, which is exactly the kind of repetitive, syntax-sensitive task that's easy to get wrong by hand.
This generator walks through the decision in a structured way: which crawlers to allow generally, which specific bots — including named AI crawlers — to block outright, which folders or paths should be off-limits to crawling regardless of bot, and where your sitemap lives so crawlers can discover it directly from the robots.txt file itself rather than having to find it some other way. It then assembles all of those choices into a single, correctly formatted file using the exact directive names and structure the standard expects, removing the guesswork around things like whether a rule needs a trailing slash or how multiple User-agent blocks should be separated.
The entire generator runs client-side in your browser, so the structure of your site that you describe while building the file — which paths you're choosing to hide from crawlers — is never transmitted anywhere. Once generated, the output is a plain text file meant to be uploaded to the root of your domain, where crawlers expect to find it by convention, replacing whatever robots.txt may already exist there.
How to generate a robots.txt file
- Decide on your default crawling policy. Start by deciding whether most crawlers should be allowed to access most of your site by default, which is the typical choice for public content sites that want to be found through search. Set the baseline rule for the wildcard User-agent, which applies to any crawler not specifically named elsewhere in the file, before layering on more specific exceptions. Getting this default right first means every later rule you add is a deliberate exception to a sensible baseline, rather than trying to reconstruct the overall policy from a pile of unrelated specific rules. Most public-facing content sites choose to allow by default and only carve out specific exceptions afterward, since that keeps the file shorter and easier to reason about over time.
- Choose which AI crawlers to block or allow. Review the list of named AI crawlers, such as those used to gather training data for language models or to power AI search summaries, and decide individually whether each should be allowed or blocked. This is now a genuinely separate decision from traditional SEO, since a site might want full visibility in Google search while still declining to have its content scraped for AI training, and the only way to express that distinction is a dedicated User-agent block naming that specific crawler with its own disallow rule. New crawlers appear periodically as AI companies launch new products, so revisiting this list every few months is a reasonable habit for a site owner who cares about the distinction.
- Set crawl rules for specific folders or paths. Add disallow rules for any specific paths that shouldn't be crawled regardless of which bot is requesting them, such as admin areas, internal search result pages, staging subdirectories, or duplicate content generated by filters and sorting parameters. Be as specific as possible with each path so you don't accidentally block more than intended; a rule meant to hide one internal folder can unintentionally match unrelated public pages if the path prefix is too broad or a wildcard is placed carelessly. Testing each new rule against a few real URLs on your site before finalizing the file helps catch this kind of accidental overreach early.
- Add your sitemap URL. Include the full URL to your sitemap.xml file so crawlers that read robots.txt can discover it directly without needing to guess its location or rely on it being submitted separately through a search console. This is a small addition but a genuinely useful one, since it gives every well-behaved crawler, not just the ones you've manually registered with, a direct path to your complete list of indexable URLs the very first time it visits your site. Keep this URL updated if your sitemap ever moves to a different path or filename.
- Download and upload the file to your domain root. Once the generated file looks correct, download it and upload it to the root of your domain so it's reachable at yourdomain.com/robots.txt exactly, since crawlers look for it at that specific location by convention and won't find it anywhere else, including a subfolder. After uploading, visit that URL directly in a browser to confirm it loads as plain text with the rules you expect, and replace any previous robots.txt file rather than leaving the old one alongside the new.
Use Cases
- Blocking AI training crawlers without affecting search visibility: Allow Google and Bing to index your site fully while specifically disallowing crawlers that scrape content for AI model training.
- Hiding admin and internal pages from search engines: Add disallow rules for login pages, admin dashboards, and internal tools that shouldn't appear in search results.
- Preventing duplicate content from filter and sort URLs: Block crawling of parameterized URLs generated by on-site filtering or sorting that create near-duplicate pages.
- Launching a new site with a clear crawling policy from day one: Set up a correct robots.txt before launch so search engines and AI crawlers encounter the intended rules immediately.
- Pointing crawlers directly to your sitemap: Add a sitemap directive so crawlers can discover your full URL list without relying solely on manual submission.
- Auditing and rebuilding a misconfigured robots.txt: Replace an existing file that's blocking more or less than intended with a freshly generated, correctly structured one.
About This Tool
What is it? A browser-based form that produces a correctly formatted robots.txt file based on your choices about default crawler access, specific AI and search bot rules, disallowed paths, and your sitemap location.
Why use it? It removes the syntax guesswork around User-agent blocks, wildcards, and directive names that makes robots.txt easy to get subtly wrong by hand, and it makes blocking specific named AI crawlers — now a common, deliberate choice separate from traditional SEO — as straightforward as toggling an option.
Alternatives: Copying a generic robots.txt template found online works for a basic case but rarely matches your site's actual folder structure or addresses current AI crawlers by name; writing one from the specification directly requires careful attention to syntax details that are easy to get wrong; this generator produces a tailored file from simple choices without either drawback.
Common mistakes: Using an overly broad wildcard in a disallow path is the most damaging mistake, since it can unintentionally block far more of a site than intended, including pages meant to be indexed; a second common mistake is forgetting that robots.txt only requests that compliant crawlers stay away from a path, but does not prevent a determined or non-compliant bot from accessing it, and is also not a mechanism for keeping a page out of search results that is already indexed, since that requires a noindex tag instead.
Frequently Asked Questions
- Does blocking an AI crawler in robots.txt guarantee my content won't be used for training?
- It signals a request that compliant crawlers are expected to honor, but robots.txt is a voluntary standard; a crawler that ignores it could still access the content, so it's a strong signal rather than an absolute technical barrier.
- Will blocking AI crawlers hurt my Google search ranking?
- No, traditional search crawlers like Googlebot are typically addressed by separate, specific rules from AI training crawlers, so you can block one category while leaving the other fully able to crawl and index your site.
- What's the difference between robots.txt and a noindex meta tag?
- Robots.txt asks crawlers not to visit certain paths at all, while a noindex tag asks a crawler that does visit a page not to include it in search results; a page already indexed needs noindex, not a robots.txt rule, to be removed from results.
- Where exactly does the robots.txt file need to go?
- It must be uploaded to the root of your domain so it's reachable at yourdomain.com/robots.txt; placing it in a subfolder means crawlers won't find it at all.
- Can I block one folder but allow a subfolder inside it?
- Yes, you can add an explicit allow rule for the subfolder alongside the broader disallow rule for its parent folder, and most crawlers will respect the more specific rule.
- Do I need a separate User-agent block for every AI crawler I want to block?
- Yes, each named crawler you want to address individually needs its own User-agent line with its own disallow rules underneath it, since a rule under one bot's block doesn't apply to a different bot.
- Is my site structure exposed by listing my disallowed paths?
- Robots.txt is a publicly readable file by design, so anything listed in it, including disallowed paths, is visible to anyone who looks; it should not be relied on to hide sensitive content, only to guide well-behaved crawler behavior.
- Does adding my sitemap to robots.txt replace submitting it to a search console?
- It's a helpful addition that lets any crawler discover your sitemap directly, but it doesn't replace search console submission, which can offer added indexing feedback that robots.txt alone doesn't provide.