llms.txt: The Complete Guide to the AI Crawler Standard (2026)

The web has lived under two machine-readable contracts for decades. Robots.txt tells crawlers what they are allowed to access, and sitemap.xml tells them what exists. Neither was designed for large language models that need to understand a site, not just index it. In September 2024, Jeremy Howard of Answer.AI proposed a third standard called llms.txt, a small markdown file at the root of your domain that gives AI systems a curated map of the most important content on your site. It is a simple idea with surprising leverage.

llms.txt is not a magic bullet, and as of 2026 no major LLM officially crawls it on a production schedule. But Anthropic, Perplexity, and a growing number of AI agents are starting to read it, and the cost of adding one is roughly an hour of work. For brands that want to be cited by ChatGPT, Claude, Gemini, and Perplexity, it is cheap insurance and a forcing function for cleaner information architecture. Run a GEO Audit first to see where your AI readiness stands, then layer llms.txt on top.

Image: A browser address bar showing /llms.txt at the root of a domain with the markdown contents preview underneath

What is llms.txt and Where Did It Come From

llms.txt is a proposed standard introduced by Jeremy Howard at Answer.AI in September 2024. It is a plain markdown file that lives at the root of your domain (e.g., yourdomain.com/llms.txt) and gives large language models a structured, human-curated overview of your site. Where robots.txt controls access and sitemap.xml lists every URL, llms.txt highlights the pages that matter most for understanding what your business does, what topics you cover, and where the canonical sources of truth live.

The motivation is simple. LLM context windows are large but not infinite, and crawling a full site to understand it is wasteful when most pages are noise (login screens, paginated archives, duplicate variants). A llms.txt file lets the publisher say: here is the short version. Here are the docs, the product pages, the canonical guides. If you only have time to read ten things on this site, read these. That curation is valuable to any AI system that wants to give a grounded answer about your brand.

It is worth noting that llms.txt is a community proposal, not a W3C standard or an official protocol from any LLM provider. Adoption is voluntary on both sides. But the format has already been embraced by FastHTML, Anthropic's docs, Cloudflare, and hundreds of developer-tool companies, and the pattern is spreading quickly. Pair llms.txt with a strong AI Visibility Score and proper schema markup, and you have a complete machine-readable surface area.

How llms.txt Differs From robots.txt and sitemap.xml

The three files often get conflated because they all live at the root of a domain and all serve machines. They solve very different problems. Robots.txt is a permissions file. Sitemap.xml is a discovery file. llms.txt is a comprehension file. Here is the clean breakdown:

File	Who Reads It	What It Does
robots.txt	Search crawlers (Googlebot, Bingbot) and AI crawlers (GPTBot, ClaudeBot, PerplexityBot)	Sets allow and disallow rules for specific user agents, controls access to paths
sitemap.xml	Search crawlers and indexers	Lists every indexable URL with last-modified dates and priority hints for discovery
llms.txt	LLMs and AI agents that need a curated overview	Provides a markdown index of the most important pages with descriptions and section grouping

You need all three. Use the Robots.txt Generator to define crawler permissions, the XML Sitemap Generator to expose every indexable URL, and llms.txt to highlight the pages that matter for AI comprehension. Validate your existing setup with the Robots.txt Analyzer and the Sitemap Validator before adding llms.txt on top.

The llms.txt File Format

The format is intentionally minimal. It is just markdown, parseable by humans and by any LLM without a custom schema. The structure follows four conventions:

H1 heading: The name of your project, product, or site. There should be exactly one H1 at the top of the file.
Blockquote summary: A short one or two sentence description of what the project is, immediately under the H1. This is what an LLM uses to anchor its understanding of your brand.
H2 sections with linked lists: Each H2 groups related resources (Docs, Guides, API Reference, Pricing, About). Under each H2 is a markdown list of links, each with a short description after a colon.
Optional H2 section: An H2 literally titled "Optional" signals less critical pages. LLMs with limited context can safely skip this section.

Here is a minimal but complete example for a fictional SaaS company:

# Acme Analytics

> Acme Analytics is a privacy-first product analytics platform for SaaS teams.
> We help companies understand user behavior without third-party cookies.

## Docs

- [Quickstart](https://acme.com/docs/quickstart): Install the SDK and send your first event in under five minutes
- [API Reference](https://acme.com/docs/api): Full REST API documentation with example requests
- [SDKs](https://acme.com/docs/sdks): Official client libraries for JavaScript, Python, Ruby, and Go

## Guides

- [Self-hosting](https://acme.com/guides/self-host): How to deploy Acme Analytics on your own infrastructure
- [GDPR Compliance](https://acme.com/guides/gdpr): Configuration patterns for EU privacy requirements
- [Migration from GA4](https://acme.com/guides/migrate-ga4): Step-by-step migration playbook

## Product

- [Pricing](https://acme.com/pricing): Plan tiers, usage limits, and enterprise options
- [Security](https://acme.com/security): SOC 2 details, encryption, and data residency options

## Optional

- [Changelog](https://acme.com/changelog): Release notes for the past 12 months
- [Blog](https://acme.com/blog): Long-form posts on analytics and product strategy

Notice three things about this example. First, every link is an absolute URL with the full domain. Relative paths break when the file is fetched out of context. Second, every link has a short description after the colon, which gives the LLM enough signal to decide whether to fetch the page. Third, the Optional section is genuinely optional content. Read more about how AI search engines decide what to cite in our guide on how AI search engines decide what to cite.

Image: A side-by-side view of an llms.txt markdown source on the left and the rendered structured outline (project name, summary, sections, links) on the right

Who Currently Reads llms.txt

It is important to be honest here. As of 2026, no major LLM provider has officially committed to crawling llms.txt on a regular schedule the way Googlebot crawls sitemap.xml. OpenAI, Google, and Microsoft have not made public statements about systematic ingestion. That is the realistic baseline.

That said, the picture is not static. Anthropic has been actively considering the standard and references it in developer documentation. Perplexity has been observed fetching llms.txt files during agentic workflows. A growing class of AI agents and coding assistants (Cursor, Aider, Cline, custom MCP servers) routinely look for llms.txt before ingesting a site, because it is the cheapest way to get a curated map without burning context. And third-party AI search tools that build their own indexes are increasingly including llms.txt as a signal.

The right way to think about this is forward investment. The cost of publishing a llms.txt file is roughly an hour of work. The upside is that as adoption grows, you are already in the index. The downside is essentially zero. This is the same shape as adopting schema markup in 2012, or adopting OpenGraph tags in 2014. Early adopters paid almost nothing and captured outsized rewards as the standard matured. For a deeper view of what AI search engines reward today, read our guide on getting content found by AI search engines.

How to Create Your llms.txt File: Step by Step

Building a llms.txt file is a forcing function for clarity. If you cannot describe your site in 10 to 30 curated links, your information architecture probably needs work. Follow these steps:

Step 1: Pick the Canonical URLs That Matter

Start by listing the 10 to 30 pages an LLM would need to read to understand your business. Think: what does the model need to answer questions about your product, your pricing, your documentation, and your point of view? Avoid duplicate variants, paginated archives, and login-walled pages. Every URL in llms.txt should be a public, canonical, fully-rendered page. If you are unsure which pages carry the most weight, run them through the AEO Ready Checker first.

Step 2: Write a Tight Project Description

The H1 and blockquote summary are the most important lines in the entire file. They are what an LLM reads first and what it weights most heavily when forming an internal model of your brand. Keep the H1 to your real product or company name (no slogans). Keep the blockquote to one or two sentences that name what you do, who you do it for, and what makes you specific. If you cannot write this in 30 seconds, your positioning is unclear, and that is a much bigger problem than llms.txt.

Step 3: Organize Sections by Intent

Group your links under H2 sections that match user intent, not internal team structure. Common patterns: Docs, Guides, API Reference, Product, Pricing, About, Case Studies. Each section should have between 2 and 8 links. If a section has only one link, fold it into another. If it has more than 10, you are probably over-listing. Put less critical pages under an explicit H2 titled "Optional" so that LLMs with tight context budgets can skip them safely.

Step 4: Host the File at Your Root

The file must live at https://yourdomain.com/llms.txt. Not /docs/llms.txt, not /assets/llms.txt. Root only. Serve it as text/plain or text/markdown with a 200 status code. Do not gate it behind auth, do not redirect it, do not 404 in some regions. Test that an unauthenticated curl request returns the expected content from any geography. While you are there, also add a reference to it in your robots.txt as a comment so future crawlers can discover it more easily.

Step 5: Validate the File

Several open-source validators exist (llmstxt.org has a community validator, and a handful of GitHub projects offer linting). Check that your H1 is unique, that every link returns a 200, that the markdown parses cleanly, and that absolute URLs are used everywhere. Treat any link that 301s, 404s, or requires auth as a bug. Once validated, add the file to your deploy pipeline so it stays in sync with your actual site structure.

Step 6: Monitor and Iterate

Watch your access logs for fetches of /llms.txt. You will see a small but steady stream of agent traffic from Perplexity, Anthropic, Cursor, Aider, custom MCP servers, and various crawlers. Track which user agents are reading the file and how often. Whenever you launch a major new doc, product page, or guide, update llms.txt to point at it. A stale llms.txt is worse than no llms.txt because it actively misleads AI systems about what your site covers.

llms.txt vs llms-full.txt

Howard's original proposal also introduced a sibling file: llms-full.txt. The two serve very different roles. llms.txt is an index, a curated table of contents that tells an LLM where to go. llms-full.txt is the full content, with every page concatenated into a single markdown document for one-shot ingestion. An agent that hits llms-full.txt does not need to fetch each page individually because everything is already in one request.

For most sites, llms.txt alone is the right starting point. llms-full.txt is most useful when you have a tightly bounded body of content (developer docs, an API reference, a recipe library) that fits comfortably in a modern context window. A 200-page documentation site can ship as a single 500 KB markdown file and be ingested in one model call. A general marketing site with sprawling content cannot. If you do publish llms-full.txt, keep it under roughly 1 MB and regenerate it on every deploy so it never drifts from reality.

How llms.txt Fits Into Your AI Search Strategy

Be clear-eyed about what llms.txt does and does not do. It is one signal among many, and right now it is not the highest-leverage signal. Domain authority, schema markup, page-level GEO content quality, and third-party citations all move the needle more today. llms.txt is closer to a clean robots.txt than it is to a backlink campaign. It is hygiene, not strategy.

The honest priority order for AI search visibility, ranked by impact in 2026:

Domain authority and trust signals (audit with Domain Authority Checker) determine whether AI systems consider your domain at all
GEO-ready content structure (test with GEO Audit) determines whether your individual pages are extractable and citation-ready
Schema markup (build with Schema Markup Logic Builder) gives AI systems machine-readable entity data. See our deep dive on schema markup for AI search
Meta tag clarity (review with Meta Tag Analyzer) tells AI crawlers how to classify each individual page
llms.txt sits at the bottom of this list today, but its cost is so low that there is no reason not to add it

Think of llms.txt as the cheapest item on the AI-search hygiene checklist. It will not save a site with thin content and zero authority, but it adds polish to a site that has the fundamentals right. And if adoption accelerates over the next 12 to 24 months, you are already positioned.

Common Mistakes to Avoid

Most failed llms.txt files fail in the same handful of ways. Watch for these patterns before you ship:

Linking to login-walled pages. If a URL requires authentication, an agent fetching it will get a 401 or a generic login screen. Strip these. llms.txt is a public file pointing at public pages.
Listing every page on the site. That is what sitemap.xml is for. llms.txt should be curated. If your file is over 100 links, you are doing it wrong. Most strong llms.txt files have between 15 and 40 carefully chosen links.
Broken or redirecting links. Every URL must return a 200. A 301 or 404 in your llms.txt is a credibility hit. Add a link checker to your CI pipeline so the file fails the build when something breaks.
Relative URLs. Always use absolute URLs with the full domain. The file gets fetched and parsed in isolation, so any relative path resolves incorrectly.
No canonical project name. Your H1 is the canonical name an LLM will associate with the content. If you use a slightly different name in different sections (Acme, Acme Inc, Acme Analytics), you confuse the model.
Forgetting to update. A stale llms.txt is worse than none. If you launch a major new product or rename a section, update the file the same day. Add a reminder to your release checklist.
No descriptions on links. Every link needs a short description after the colon. Bare links waste the LLM's attention budget, and short descriptions are cheap to write.

Start with Your llms.txt File Today

llms.txt is the cheapest forward-investment in AI search visibility you can make in 2026. It costs an hour, it forces you to clarify your information architecture, and it positions you to benefit as adoption among LLM providers and AI agents grows. It will not fix a weak site, but it will add a clean layer of machine-readability on top of one that is already solid.

Here is your action plan:

Audit your current AI readiness with the GEO Audit and the AI Visibility Score
Pick the 15 to 30 most important canonical pages on your site
Write a tight one-line description of your project for the blockquote summary
Group pages under H2 sections by intent (Docs, Guides, Product, Pricing)
Host the file at https://yourdomain.com/llms.txt with a 200 response and text/plain content type
Validate every link returns 200 and add a link checker to CI
Confirm your robots.txt and sitemap.xml are healthy alongside the new llms.txt
Add llms.txt updates to your release checklist so it stays in sync with the site

The brands that get cited by AI in 2027 are the ones that treated machine-readability as a first-class concern in 2026. llms.txt is a small piece of that, but it is a piece you can ship today. Run a GEO Audit first, fix the page-level fundamentals, then layer on llms.txt as the final polish.