How AI Crawlers Work: GPTBot, ClaudeBot, PerplexityBot, and Google-Extended Explained

For most of the web's history there was one kind of robot that mattered: the search crawler. Googlebot and Bingbot read your pages, built an index, and decided where you ranked. In 2026 a second class of automated visitors has become just as consequential. AI crawlers from OpenAI, Anthropic, Perplexity, Google, and Microsoft now fetch your pages to train models and to answer live questions in ChatGPT, Claude, Perplexity, and AI Overviews. If you want to be cited in those answers, you need to understand who these crawlers are, what they want, and how to control them.

The confusing part is that AI crawler traffic is not one thing. Some of it is bulk training ingestion that happens on its own schedule. Some of it is a real-time fetch triggered the instant a user asks a question. They use different user agents, respect different rules, and carry very different strategic weight. This guide profiles the major AI bot user agents, explains how to allow AI crawlers without losing control, and shows how to verify the hits in your logs. Run an AI Visibility Score first to see how often AI systems are actually surfacing your brand today.

Image: A network diagram showing AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) fetching pages from a single website root

The Two Kinds of AI Crawling

Before you decide what to allow or block, you have to understand that AI crawlers do two fundamentally different jobs. Treating them as a single category is the most common mistake publishers make, and it leads to robots.txt rules that quietly cost you visibility.

Training crawls are bulk ingestion. A bot like GPTBot or ClaudeBot fetches large volumes of public pages on its own schedule and the content may be used to train or improve future versions of a model. There is no user waiting on the other end. Blocking these crawlers keeps your content out of training data, which some publishers want for copyright or competitive reasons. The tradeoff is that your brand becomes less familiar to the underlying model over time.

Real-time retrieval, sometimes called user-triggered fetching, is the opposite. When someone asks ChatGPT or Perplexity a question that needs fresh information, the assistant fetches relevant URLs on the spot and uses them to build a grounded, cited answer. A real person is waiting. Bots like OAI-SearchBot, Perplexity-User, and the search-focused agents do this work. Blocking them does not protect training data, it just removes you from the live answer, often with a visible citation link that would have sent you traffic.

This distinction drives every decision that follows. Many publishers reasonably want to limit training crawls while keeping retrieval bots fully welcome, because retrieval is where the clicks and citations live. The good news is that the major providers split these jobs across separate, documented user agents, so you can make exactly that distinction in robots.txt.

Crawl Type	What It Does	Why You Might Keep It
Training crawl	Bulk ingestion of public pages on the provider's schedule, may feed model training	Makes your brand and content more familiar to the model itself over time
Real-time retrieval	User-triggered fetch of specific URLs to answer a live question with citations	This is where AI answers cite you and send referral clicks, blocking it removes you from answers

Profiles of the Major AI Crawlers

Each AI provider runs more than one bot, and the user agent strings are how you tell them apart in robots.txt and in your logs. Here are the well-established ones you will actually see in 2026. Use the exact names below, since AI bot user agents are case-insensitive in robots.txt but must otherwise match the documented token.

OpenAI: GPTBot and OAI-SearchBot

OpenAI runs several distinct agents. GPTBot is the training crawler. It fetches public content that may be used to improve future models, and it is the one most publishers think of when they consider blocking AI. OpenAI also runs OAI-SearchBot, which powers search features and surfaces sites as sources in ChatGPT. There is also a user-triggered agent, commonly identified as ChatGPT-User, that fetches a page when a user or a ChatGPT action explicitly requests it. The practical takeaway: if you block GPTBot to stay out of training but leave OAI-SearchBot and ChatGPT-User allowed, you can keep appearing as a cited source while opting out of training ingestion.

Anthropic: ClaudeBot and Related Agents

Anthropic's primary crawler is ClaudeBot, which fetches public web content associated with Claude. Anthropic has also operated agents identified as Claude-Web and a user-triggered fetcher commonly seen as Claude-User, which retrieves a page when a Claude user's request needs live information. Anthropic publishes its crawler details and respects robots.txt directives. As with OpenAI, the cleanest approach is to decide separately on bulk crawling versus user-triggered retrieval rather than blanket-blocking every Anthropic agent.

Perplexity: PerplexityBot and Perplexity-User

Perplexity splits its work cleanly. PerplexityBot is the indexing crawler that builds Perplexity's own search index of the web. Perplexity-User is the user-triggered agent that fetches a specific page in response to a live user query, then cites it in the answer. Because Perplexity is fundamentally an answer engine that shows sources, blocking Perplexity-User is one of the most direct ways to remove yourself from cited answers that would otherwise drive referral traffic. Most brands that want AI visibility keep both Perplexity agents allowed.

Google: Google-Extended vs Googlebot

Google deliberately separates its AI controls from its search crawler. Googlebot is the long-standing search crawler that indexes your pages for Google Search. It is not an AI training token, and blocking it removes you from Google Search entirely, which almost nobody wants. Google-Extended is not a separate bot that fetches pages on its own. It is a robots.txt control token that lets you opt out of having your content used to train and ground Google's generative models, such as Gemini, while keeping normal Search indexing intact. The key nuance: disallowing Google-Extended does not affect your Google Search ranking, and allowing it does not give Google any access it did not already have through Googlebot. It is purely a consent signal for generative AI use.

Microsoft: Bingbot's Role in AI Answers

Bing matters more for AI than its search market share suggests, because the Bing index feeds AI answer experiences across Microsoft Copilot and is one of the retrieval sources several assistants lean on. Bingbot is the crawler that builds that index. If you block Bingbot, you do not just lose Bing Search visibility, you weaken your presence in the AI answers that draw on Bing's index. For that reason, Bingbot should almost always stay allowed even for publishers who are cautious about other AI crawlers.

Deciding What to Allow or Block

There is no universally correct policy, only the right policy for your goals. The core tradeoff is visibility versus control. Every AI crawler you block is a place you cannot be cited or remembered. Every crawler you allow is content you give up some control over. Frame the decision around a few honest questions:

Do you want to be cited in AI answers? If yes, keep the retrieval and search agents allowed: OAI-SearchBot, ChatGPT-User, Perplexity-User, Claude-User, and Bingbot. These are the ones that put your link in front of a real person.
Do you object to training use? If you have copyright, licensing, or competitive concerns, disallow the training crawlers (GPTBot, ClaudeBot) and the Google-Extended token while leaving retrieval agents allowed.
Is server load a problem? Aggressive crawling can add real traffic. If a specific bot is hammering you, a crawl-delay or a targeted disallow on heavy paths is better than a blanket block.
Is your content your product? Publishers whose archives are the business (news, research, paid databases) often block training broadly and negotiate licensing instead. Tool and SaaS sites usually want maximum AI visibility and allow nearly everything.

For most marketing and product sites whose goal is to be found and cited, the default should lean open. The biggest risk is not over-exposure, it is accidentally blocking a retrieval bot and quietly vanishing from AI answers. Audit how AI systems read your site with the GEO Audit before you change any crawler rules.

How to Verify AI Crawler Hits in Your Logs

You cannot manage what you cannot see. Server access logs are the ground truth for which AI crawlers are actually visiting and how often. Start by grepping your logs for the user agent tokens above. A simple filter on a single token tells you whether a given bot is reaching you:

# Count hits from each major AI crawler in an access log
grep -aiE "GPTBot|OAI-SearchBot|ChatGPT-User" access.log | wc -l
grep -aiE "ClaudeBot|Claude-Web|Claude-User" access.log | wc -l
grep -aiE "PerplexityBot|Perplexity-User" access.log | wc -l
grep -ai "Google-Extended" access.log | wc -l
grep -ai "bingbot" access.log | wc -l

# See the most active bots overall
grep -aiE "bot|crawler|spider" access.log \
  | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -20

Two cautions when reading the results. First, the user agent string alone is easy to spoof, so a hit claiming to be GPTBot is not proof. The major providers publish IP ranges or support reverse DNS verification, so for anything that matters you should confirm the request comes from the provider's declared IPs, not just the right user agent. Second, watch the difference between bulk training crawls, which arrive in waves on the provider's schedule, and retrieval hits, which spike right after your content gets mentioned in a popular answer.

If you do not have raw log access, your CDN or edge platform usually exposes bot analytics, and several offer dedicated AI crawler dashboards. The goal is the same either way: a clear picture of which AI bot user agents reach your site, which paths they hit, and whether your robots.txt rules are doing what you intended.

Robots.txt Examples for AI Crawlers

Robots.txt is where you turn your policy into rules. It is a public file at the root of your domain that names user agents and sets allow or disallow paths for each. AI crawlers from the major providers respect it. Here is a permissive setup that welcomes every AI crawler, ideal for a brand that wants maximum AI visibility:

# Allow all AI crawlers full access
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: bingbot
Allow: /

And here is the nuanced setup most publishers actually want: opt out of training ingestion while staying fully available for real-time retrieval and citation. This blocks the training crawlers and the Google-Extended consent token, but keeps the search and user-triggered agents open so you still appear as a cited source:

# Opt out of training, stay available for live AI answers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Keep retrieval and search agents allowed
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: bingbot
Allow: /

# Standard search crawler, always keep allowed
User-agent: Googlebot
Allow: /

A few rules to get this right. Each User-agent block applies only to the named token, and most crawlers match the most specific block that names them. A blanket User-agent: * rule does not override a more specific named block for the same bot. Never accidentally disallow Googlebot or Bingbot while trying to block AI, since those are your search and AI-answer lifelines. Generate a correct file with the Robots.txt Generator and confirm it does what you think with the Robots.txt Analyzer before you ship.

Image: A robots.txt file open in an editor with separate User-agent blocks for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended highlighted

How This Connects to llms.txt

Robots.txt and AI crawler policy answer the question of access: who is allowed to fetch what. They do not tell an AI system anything about what your site is or which pages matter most. That is the gap llms.txt fills. Where robots.txt is a permissions file, llms.txt is a comprehension file, a curated markdown map at the root of your domain that points AI systems at your most important pages.

The two work as a pair. First you use robots.txt to allow the AI crawlers you want, so they can reach you at all. Then you use llms.txt to guide the ones you allowed toward your best content, so the comprehension they build is accurate and flattering. Allowing a crawler with no guidance lets it form its own, sometimes wrong, picture of your brand. For the full format and a step-by-step build, read our companion guide on the llms.txt AI crawler guide. And to understand what actually makes a page citation-worthy once a crawler reaches it, see how AI search engines decide what to cite.

Practical Checklist

Here is the short version you can act on today, in order:

Pull your access logs and count hits from GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Perplexity-User, and bingbot so you know who is already visiting
Decide your policy along one axis: opt out of training or not, and keep retrieval agents allowed either way
Verify suspicious bot hits against the provider's published IP ranges, since user agents can be spoofed
Write per-agent robots.txt blocks with the Robots.txt Generator and confirm them with the Robots.txt Analyzer
Double-check you never disallowed Googlebot or bingbot while blocking AI training crawlers
Add a llms.txt file to guide the crawlers you allowed toward your best pages
Run a GEO Audit and track your AI Visibility Score so you can tell whether your policy is helping or hurting

Common Questions About AI Crawlers

Does blocking GPTBot remove me from ChatGPT?

Not entirely. GPTBot is the training crawler. Blocking it opts you out of training ingestion, but ChatGPT can still cite you through its search and user-triggered agents, OAI-SearchBot and ChatGPT-User, as long as you leave those allowed. If you block every OpenAI agent, then yes, you remove yourself from ChatGPT answers that rely on live retrieval.

Will disallowing Google-Extended hurt my Google ranking?

No. Google-Extended is a consent token for generative AI use, separate from Googlebot. Disallowing it opts your content out of training and grounding Google's generative models without affecting how Googlebot crawls and ranks you in Google Search. The two controls are deliberately independent.

Can I trust the user agent string in my logs?

Only as a first signal. User agent strings are trivial to spoof, so a request claiming to be ClaudeBot or PerplexityBot may be something else entirely. The major providers publish IP ranges or support reverse DNS lookups so you can verify that a request truly came from them before you rely on it.

Do AI crawlers actually respect robots.txt?

The established crawlers from OpenAI, Anthropic, Perplexity, Google, and Microsoft document their user agents and respect robots.txt directives. Robots.txt is voluntary by design, so it is not a security boundary, and lesser-known scrapers may ignore it. For the major providers, though, a correct robots.txt is a reliable control. If you need a hard block, enforce it at the server or firewall level instead.

Should a typical marketing site block AI crawlers?

Usually not. If your goal is to be found and cited, the bigger risk is accidentally blocking a retrieval agent and disappearing from AI answers. Most brands should lean open, allow the search and user-triggered agents, and only restrict training crawls if they have a specific copyright or competitive reason. Start by running an AI Visibility Score to see where you stand, then tune your crawler policy from there.