llms.txt: The New robots.txt for AI Crawlers (Full Guide)

Q: Do I need llms.txt if I already have a sitemap?

They serve different purposes. `sitemap.xml` is for search-engine indexing — it lists every page. `llms.txt` is curated — it lists the pages worth an AI's attention. Most sites should ship both.

Q: Will llms.txt help my Google AI Overview rankings?

There's no public evidence Google reads `llms.txt`. The file is unlikely to influence Google AI Overviews directly. The benefit is for the broader set of AI agents and tools that do consume it.

Q: What's the difference between llms.txt and llms-full.txt?

`llms.txt` is an index — links plus short descriptions. `llms-full.txt` includes the full Markdown content of the linked pages inline, so an agent can ingest the whole content set in one fetch. Many sites ship both.

Q: Should I block AI crawlers in robots.txt and still ship llms.txt?

You can. The two files serve different functions. `robots.txt` blocks; `llms.txt` informs. If you've selectively allowed some crawlers, ship `llms.txt` to make those crawlers' indexing of your site cleaner. If you've blocked all AI crawlers, the file has less utility but also no cost.

A new file has been quietly accumulating at the root of well-run websites: llms.txt. It sits next to robots.txt and sitemap.xml, written in Markdown, listing the pages an AI model should read if it wants to understand the site without crawling the entire thing. Proposed in late 2024 by Jeremy Howard, it's not a standard in the IETF sense. No crawler is required to honor it. Most large LLM providers don't fetch it yet. And despite all of that, the file is showing up on more sites every week, including ones whose maintainers are not particularly enthusiastic about being scraped.

The reason is that llms.txt solves a real problem in a way nothing else does. As AI models become a primary surface for discovering content, sites need a way to give models a clean, intentional map of what they offer — not the noisy crawl of every paginated archive, but a curated index of the pages that actually matter. robots.txt was built to gatekeep. sitemap.xml was built for search-engine indexing. llms.txt is built to teach, and that's a different job. Whether the file becomes a permanent fixture of the web or quietly fades depends on adoption, but the case for shipping one today is stronger than the lukewarm coverage suggests.

What llms.txt actually is

llms.txt is a Markdown file placed at the root of a domain (https://yoursite.com/llms.txt) that contains a structured summary of the site's most important content. The format follows a loose convention rather than a strict schema, but the working pattern is:

# Site Name

> One-paragraph description of the site, who it serves, and what problems it solves.

## Documentation

- [Getting Started](https://yoursite.com/docs/getting-started): What new users do first.
- [API Reference](https://yoursite.com/docs/api): Full API endpoints, request shapes, responses.

## Product

- [Pricing](https://yoursite.com/pricing): Tier breakdown and what's included.
- [Features](https://yoursite.com/features): Capabilities and use cases.

## Blog

- [How X Works](https://yoursite.com/blog/how-x-works): Deep dive on the X system.

The file is Markdown because Markdown is what LLMs read best. The bullet structure mirrors a sitemap, but each item gets a short prose description, which both helps the model decide which links to fetch and provides immediate context if the model can't crawl them all.

A companion file, llms-full.txt, is often shipped alongside. The convention there is to include the full Markdown content of the linked pages inline, not just the links — so a single fetch gives the model the whole documentation set without crawling. It's bandwidth-heavier but more reliable, since some agents can't or don't follow links from llms.txt.

What llms.txt is not

A few myths are worth clearing up before the file's reputation outruns its capabilities.

It is not a permission system. robots.txt tells crawlers what they may or may not access. llms.txt tells them what's worth accessing. The two are complementary, not substitutes. If you want to block GPTBot, you do that in robots.txt; llms.txt doesn't have a Disallow directive.

It is not a Google ranking signal. Google has not announced any use of llms.txt for crawling or ranking, and there's no evidence it's currently being read by Googlebot or feeding into AI Overviews. Sites that ship the file should not expect Google traffic gains from it.

It is not honored by every LLM. Major providers (Anthropic, OpenAI, Google) have not publicly committed to reading llms.txt. Some third-party crawlers, AI search startups, and developer agents do consume it. The honest summary in 2026 is that adoption is uneven and the file's audience is partial.

And it is not a sitemap replacement. sitemap.xml still serves a distinct purpose for traditional search engines. llms.txt adds a layer; it doesn't subtract one.

Why ship it anyway

If most large LLMs don't read it yet, the file's value depends on a more nuanced calculus.

Smaller AI tools do read it. Open-source agents, RAG-based startups, developer tools like Cursor and Cline, internal enterprise agents, and a growing set of AI search products either fetch llms.txt automatically or have it as an option. The long tail of AI consumers is non-trivial, even if the head tail (Google, OpenAI) isn't reading it.

It's cheap to ship. A static Markdown file, generated once, served at a known URL. The maintenance cost is trivial — usually a build step that regenerates the file from your content registry. The downside is essentially zero.

It forces an editorial decision about what matters. Writing llms.txt is the exercise of picking the 20 to 200 URLs on your site that you most want an AI to know about. That exercise is useful regardless of whether anything reads the file. Most sites have never explicitly answered "which 50 pages would we want an LLM to ingest if it could only ingest 50?" — and the answer is often surprising.

It's a forward bet. If llms.txt becomes a de facto standard the way robots.txt did, sites that ship it now have a head start. The risk of being early is low; the cost of being late depends on how the standard evolves.

It sets up llms-full.txt, which is more immediately useful. llms-full.txt is consumed today by tools doing on-the-fly RAG against your documentation. Anthropic's Claude, when given an llms-full.txt URL, can ingest the entire documentation set in one request, which makes it dramatically better at answering questions about the product. That's a measurable user-experience win.

None of these are a slam-dunk on their own. Together they justify the modest engineering cost.

How to structure llms.txt for your site

The file's structure depends on what your site is. Three common archetypes:

Documentation-heavy product sites (Stripe, Vercel, Supabase). The body of the file is organized by documentation section: Getting Started, Core Concepts, API Reference, Guides, Reference. The blog and marketing pages get a smaller section near the end. The full content of the docs is shipped via llms-full.txt. This is the most mature use case and the one where current AI agents extract the most value.

Content-heavy marketing sites (most SaaS, most media). The file is organized by content category: Product, Blog, Customers, Pricing. The blog section lists the top 20 to 50 evergreen articles, not the full archive. The descriptions are written to give the model context about what's in each piece, not to repeat the meta description.

Marketplace or directory sites (job boards, app directories, etc.). The file is organized by entity type: Categories, Top Listings, How It Works, About. Listings change too frequently to enumerate, so the file points at category-level pages and trusts the model to crawl from there if needed.

For all archetypes, the description prose matters. The convention is short — one sentence, sometimes two — but the sentence should be informative, not promotional. "The pricing page with tier breakdown and feature inclusions" is more useful to a model than "Our flexible pricing options." Treat the descriptions as the metadata you'd give a librarian, not the copy you'd give an ad designer.

Generating llms.txt programmatically

Hand-curated llms.txt works for small sites. For sites with more than 50 to 100 important pages, generate the file from your content registry as part of the build.

The pattern most teams adopt: the file is generated by a script that reads from the same content source the sitemap uses, filtered to a curated subset based on metadata. In a typical Next.js or Astro site, the script lives in the build pipeline and writes the file to public/llms.txt. The metadata that drives inclusion can be a flag in frontmatter (includeInLlmsTxt: true), a category whitelist, or a quality score above a threshold.

For llms-full.txt, the script concatenates the full Markdown content of the included pages, separated by horizontal rules and prefixed with the URL each section corresponds to. The file ends up large — often megabytes — but the size is acceptable because the file is fetched rarely and the consumers are agents, not browsers.

A few practical notes. Strip JavaScript and HTML from the inlined Markdown; agents want clean text. Include the publication date and last-modified date in each section's header. Keep the URLs absolute, not relative — agents may not preserve base URLs across the file.

What to put on the page (and what to keep out)

The curation question is the substantive one. Three rules of thumb:

Include pages that answer questions, not pages that drive conversions. Pricing, features, and product pages are useful for context but rarely useful for LLM citations. Documentation, guides, deep blog posts, and explainers are what models cite and quote. Weight the file accordingly.

Exclude noisy aggregations. Tag pages, paginated archives, search result pages, and category indexes don't help a model. They dilute the signal. The full sitemap can include them; llms.txt should not.

Exclude duplicates and near-duplicates. If you have a landing page version, a blog version, and a documentation version of the same content, pick one — the one with the strongest, cleanest writing — and exclude the rest. Models that crawl both versions hit a coherence problem.

A reasonable target for content sites is 50 to 200 entries. Documentation sites can have several hundred. Files with thousands of entries become harder for models to use effectively, defeating the curation purpose.

Adoption and the state of the spec

The llms.txt proposal lives in the open at llmstxt.org, where Jeremy Howard and contributors maintain a reference and a list of sites that have shipped one. The list is growing but still small relative to the broader web. Notable early adopters include Anthropic, Mintlify, Hugging Face, and a handful of developer tools companies.

The spec itself is intentionally loose. There's no formal schema, no validator, no required directives. That looseness is part of why adoption is gradual — there's no obvious "you must follow this format" pressure — but it also keeps the barrier to shipping a file very low.

What happens next depends on whether one or more major LLM providers commits to reading the file. If Google or OpenAI publicly start consuming it, the file will become standard infrastructure within a year. If they don't, it will remain a useful but optional file for the long tail of AI consumers. Either outcome justifies shipping one now.

What FastWrite does with llms.txt

FastWrite generates llms.txt and llms-full.txt for sites in its publishing pipeline. The file is regenerated on every content publish, includes the curated set of blog posts and pillar pages (filtered by quality score and topical relevance), and ships at the site root. For sites that opt in, FastWrite also tracks which AI agents fetch the file via server logs — partial data, but enough to validate whether the file is being read.

The pattern we recommend: ship llms.txt from the start, include the 30 to 100 most important pages, and revisit the curation quarterly as the content portfolio grows.

FAQ

Do I need llms.txt if I already have a sitemap? They serve different purposes. sitemap.xml is for search-engine indexing — it lists every page. llms.txt is curated — it lists the pages worth an AI's attention. Most sites should ship both.

Will llms.txt help my Google AI Overview rankings? There's no public evidence Google reads llms.txt. The file is unlikely to influence Google AI Overviews directly. The benefit is for the broader set of AI agents and tools that do consume it.

What's the difference between llms.txt and llms-full.txt? llms.txt is an index — links plus short descriptions. llms-full.txt includes the full Markdown content of the linked pages inline, so an agent can ingest the whole content set in one fetch. Many sites ship both.

Should I block AI crawlers in robots.txt and still ship llms.txt? You can. The two files serve different functions. robots.txt blocks; llms.txt informs. If you've selectively allowed some crawlers, ship llms.txt to make those crawlers' indexing of your site cleaner. If you've blocked all AI crawlers, the file has less utility but also no cost.

How often should llms.txt be updated? Regenerate it on every content publish if you generate it programmatically. Re-curate the included set quarterly to drop stale entries and add new pillar pieces.

llms.txt: The New robots.txt for AI Crawlers (and Why You Probably Need One)