Which AI Bots to Block in robots.txt (and Which to Keep)
Not all AI crawlers are the same. Some scrape your content to train models. Others power AI search results that send you traffic. Blocking the wrong ones makes your site invisible to ChatGPT, Perplexity, and Gemini. Here is how to tell them apart.
Two types of AI crawlers
AI companies use separate bots for separate purposes. The distinction matters because blocking a training bot protects your intellectual property, while blocking a retrieval bot removes you from AI-powered search results entirely.
Training bots download your content to build datasets for training future models. Your content becomes part of the model weights. You get no attribution and no traffic. Retrieval bots fetch your content in real time to answer a specific user query. They cite your page and often link back to it.
Training bots (safe to block)
These crawlers collect content for model training. Blocking them has no effect on your search visibility. If you do not want your content in training datasets, block all of these.
| User Agent | Operator | Purpose |
|---|---|---|
GPTBot | OpenAI | Trains future GPT models |
ClaudeBot | Anthropic | Trains future Claude models |
CCBot | Common Crawl | Open dataset used by many AI labs |
Google-Extended | Trains Gemini models | |
Bytespider | ByteDance | Trains TikTok/Doubao models |
Amazonbot | Amazon | Trains Alexa and internal models |
Retrieval bots (think twice before blocking)
These crawlers fetch your content live when a user asks a question. They power the AI search results in ChatGPT, Perplexity, Google AI Overviews, and Apple Intelligence. Blocking them means your pages will never appear in those results.
| User Agent | Operator | Purpose |
|---|---|---|
ChatGPT-User | OpenAI | Powers ChatGPT search (live results) |
PerplexityBot | Perplexity | Powers Perplexity search answers |
GoogleOther | Used for AI Overviews and other AI features | |
Applebot-Extended | Apple | Used for Apple Intelligence search features |
Recommended: block training, allow retrieval
This is the configuration most sites should use. It prevents your content from being used to train models while keeping you visible in AI search results.
# Block AI training crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / # Allow AI search/retrieval crawlers User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / User-agent: GoogleOther Allow: / User-agent: Applebot-Extended Allow: / # Allow regular search engines User-agent: * Allow: / Disallow: /dashboard/ Disallow: /api/ Sitemap: https://yoursite.com/sitemap.xml
Alternative: block all AI crawlers
If you want to block all AI access, training and retrieval, use this. Be aware that your content will not appear in ChatGPT search, Perplexity answers, or Google AI Overviews.
# Block all known AI crawlers User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: GoogleOther Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / User-agent: Applebot-Extended Disallow: / # Allow regular search engines User-agent: * Allow: / Sitemap: https://yoursite.com/sitemap.xml
Implementing this in Next.js
If your site runs on Next.js App Router, you can generate robots.txt programmatically using the app/robots.ts file:
// app/robots.ts
import type { MetadataRoute } from "next"
export default function robots(): MetadataRoute.Robots {
return {
rules: [
// Block AI training crawlers
{ userAgent: "GPTBot", disallow: ["/"] },
{ userAgent: "ClaudeBot", disallow: ["/"] },
{ userAgent: "CCBot", disallow: ["/"] },
{ userAgent: "Google-Extended", disallow: ["/"] },
{ userAgent: "Bytespider", disallow: ["/"] },
{ userAgent: "Amazonbot", disallow: ["/"] },
// Allow everything else (including retrieval bots)
{
userAgent: "*",
allow: "/",
disallow: ["/dashboard/", "/api/"],
},
],
sitemap: "https://yoursite.com/sitemap.xml",
}
}This approach keeps your robots.txt in version control and type-checked. No static file to forget about.
Common mistakes
Blocking GPTBot and thinking you blocked ChatGPT search
GPTBot and ChatGPT-User are separate user agents. GPTBot is for training. ChatGPT-User is for live search queries. Blocking GPTBot does not remove you from ChatGPT search results.
Using a blanket wildcard block
Adding User-agent: * / Disallow: / blocks everything, including Googlebot. Never do this unless you genuinely want zero search traffic.
Not having a robots.txt at all
If there is no robots.txt, all bots (training and retrieval) assume they have full access. That is fine for retrieval bots, but it means your entire site is open for AI training scraping.
Pair robots.txt with llms.txt
While robots.txt controls who can crawl, llms.txt tells AI models what your site is about in plain language. It is a simple text file at your root that helps ChatGPT, Perplexity, and Gemini understand your product without parsing your entire HTML. Think of robots.txt as the bouncer and llms.txt as the welcome mat.
FAQ
Should I block GPTBot in robots.txt?
It depends on your goals. GPTBot is used by OpenAI to train future models. If you do not want your content used for training, block it. But note that GPTBot is separate from ChatGPT-User, which powers ChatGPT search. You can block one without blocking the other.
What happens if I block all AI bots in robots.txt?
You prevent your content from appearing in AI-powered search results (ChatGPT, Perplexity, Gemini). This means fewer citations, less referral traffic, and reduced visibility in the fastest-growing search surfaces. Only block training bots, not retrieval bots.
Does blocking AI crawlers affect my Google rankings?
No. Googlebot is separate from AI training crawlers. Blocking GPTBot, CCBot, or ClaudeBot has zero effect on your Google search rankings. However, blocking GoogleOther may affect your visibility in Google AI Overviews.
How do I check if AI bots are crawling my site?
Check your server access logs for user agent strings like GPTBot, ClaudeBot, CCBot, PerplexityBot, and ChatGPT-User. You can also use SEOLint to scan your robots.txt and flag any misconfigured AI bot rules.
Check your robots.txt automatically
SEOLint scans your site and flags misconfigured AI bot rules, missing sitemaps, and more.