The Technical SEO Audit Needs A New Layer via @sejournal, @slobodanmanic

The standard technical SEO audit checks crawlability, indexability, website speed, mobile-friendliness, and structured data. That checklist was designed for one consumer: Googlebot.

This is how it’s always been.

In 2026, your website has, at least, a dozen additional non-human consumers. AI crawlers like GPTBot, ClaudeBot, and PerplexityBot train models and power AI search results. User-triggered agents like the newly announced Google-Agent, or its “siblings” Claude-User and ChatGPT-User, browse websites on behalf of specific humans in real time. A Q1 2026 analysis across Cloudflare’s network found that 30.6% of all web traffic now comes from now bots, with AI crawlers and agents making up a growing share. Your technical audit needs to account for all of them.

Here are the five layers to add to your existing technical SEO audit.

Layer 1: AI Crawler Access

Your robots.txt was probably written for Googlebot, Bingbot, and maybe a few scrapers. AI crawlers need their own robots.txt rules, and they need to be separate from Googlebot and Bingbot.

What To Check

Review your robots.txt for rules targeting AI-specific user agents: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, AppleBot-Extended, CCBot, and ChatGPT-User. If none of these appear, you’re running on defaults, and those defaults might not reflect what you actually want. Never accept the defaults unless you know they are exactly what you need.

The key is making a conscious decision per crawler rather than blanket allowing or blocking everything. Not all AI crawlers serve the same purpose. AI crawler traffic can be split into three categories: training crawlers that collect data for model training (89.4% of AI crawler traffic according to Cloudflare data), search crawlers that power AI search results (8%), and user-triggered agents like Google-Agent and ChatGPT-User that browse on behalf of a specific human in real time (2.2%). Each category warrants a different robots.txt decision.

Chart showing traffic volume by crawler purpose - Cloudflare Radar Q1 2026
Cloudflare Radar data showing traffic volume by crawl purpose (Q1 2026); Screenshot by author, April 2026

The crawl-to-referral ratios from Cloudflare’s Radar report can make this an informed decision for you. Anthropic’s ClaudeBot crawls 20.6 thousand pages for every single referral it returns. OpenAI’s ratio is 1,300:1. Meta sends no referrals. Blocking OpenAI’s OAI-SearchBot or PerplexityBot reduces your visibility in ChatGPT Search and Perplexity’s AI answers. Blocking training-focused crawlers like CCBot or Meta’s crawler prevents data extraction from a provider that sends zero traffic back. The crawl-to-referral ratios tell you who is taking without giving.

There is one crawler that requires special attention. Google added Google-Agent to its official list of user-triggered fetchers on March 20, 2026. Google-Agent identifies requests from AI systems running on Google infrastructure that browse websites on behalf of users. Unlike traditional crawlers, Google-Agent ignores robots.txt. Google’s position is that since a human initiated the request, the agent acts as a user proxy rather than an autonomous crawler. Blocking Google-Agent requires server-side authentication, not robots.txt rules. This is both interesting, and important for the future, even if it’s not within the scope of this article.

Official documentation for each crawler:

Layer 2: JavaScript Rendering

Googlebot renders JavaScript using headless Chromium. There is nothing new about that. What is new and different is that virtually every major AI crawler does not render JavaScript.

Crawler Renders JavaScript
GPTBot (OpenAI) No
ClaudeBot (Anthropic) No
PerplexityBot No
CCBot (Common Crawl) No
AppleBot Yes
Googlebot Yes

AppleBot (which uses a WebKit-based renderer) and Googlebot are the only major crawlers that render JavaScript. Four of the six major web crawlers (GPTBot, ClaudeBot, PerplexityBot, and CCBot) fetch static HTML only, making server-side rendering a requirement for AI search visibility, not an optimization. If your content lives in client-side JavaScript, it is invisible to the crawlers training OpenAI, Anthropic, and Perplexity’s models and powering their AI search products.

What To Check

Run curl -s [URL] on your critical pages and search the output for key content like product names, prices, or service descriptions. If that content isn’t in the curl response, GPTBot, ClaudeBot, and PerplexityBot can’t see it either. Alternatively, use View Source in your browser (not Inspect Element, which shows the rendered DOM after JavaScript execution) and check whether the important information is present in the raw HTML.

CURL fetch of No Hacks homepage
Curl fetch of No Hacks homepage (Image from author, April 2026)

Single-page applications (SPAs) built with React, Vue, or Angular are particularly at risk unless they use server-side rendering (SSR) or static site generation (SSG). A React SPA that renders product descriptions, pricing, or key claims entirely on the client side is sending AI crawlers a blank page with a link to the JavaScript bundle.

The fix isn’t complicated. Server-side rendering (SSR), static site generation (SSG), or pre-rendering solves this for every major framework. Next.js supports SSR and SSG natively for React, Nuxt provides the same for Vue, and Angular Universal handles server rendering for Angular applications. The audit just needs to flag which pages depend on client-side JavaScript for critical content.

Layer 3: Structured Data For AI

Structured data has been part of technical SEO audits for years, but the evaluation criteria need updating. The question is no longer just “does this page have schema markup?” It’s “does this markup help AI systems understand and cite this content?”

What To Check

  • JSON-LD implementation (preferred over Microdata and RDFa for AI parsing).
  • Schema types that go beyond the basics: Organization, Article, Product, FAQ, HowTo, Person.
  • Entity relationships: sameAs, author, publisher connections that link your content to known entities.
  • Completeness: are all relevant properties populated, or are you just checking a box using skeleton schemas with name and URL?

Why This Matters Now

Microsoft’s Bing principal product manager Fabrice Canel confirmed in March 2025 that schema markup helps LLMs understand content for Copilot. The Google Search team stated in April 2025 that structured data gives an advantage in search results.

No, you can’t win with schema alone. Yes, it can help.

The data density angle matters too. The GEO research paper by Princeton, Georgia Tech, the Allen Institute for AI, and IIT Delhi (presented at ACM KDD 2024, first to publicly use the term “GEO”) found that adding statistics to content improved AI visibility by 41%. Yext’s analysis found that data-rich websites earn 4.3x more AI citations than directory-style listings. Structured data contributes to data density by giving AI systems machine-readable facts rather than requiring them to extract meaning from prose.

An important caveat: No peer-reviewed academic studies exist yet on schema’s impact on AI citation rates specifically. The industry data is promising and consistent, but treat these numbers as indicators rather than guarantees.

W3Techs reports that approximately 53% of the top 10 million websites use JSON-LD as of early 2026. If your website isn’t among them, you’re missing signals that both traditional and AI search systems use to understand your content.

Duane Forrester, who helped build Bing Webmaster Tools and co-launched Schema.org, argues that schema markup is only step one. As AI agents continue moving from simply interpreting pages to making decisions, brands will also need to publish operational truth (pricing, policies, constraints) in machine-verifiable formats with versioning and cryptographic signatures. Publishing machine-verifiable source packs is beyond the scope of a standard audit today, but auditing structured data completeness and accuracy is the foundation verified source packs build on.

Layer 4: Semantic HTML And The Accessibility Tree

The first three layers of the AI-readiness audit cover crawler access (robots.txt), JavaScript rendering, and structured data. The final two address how AI agents actually read your pages and what signals help them discover and evaluate your content.

Most SEOs evaluate HTML for search engine consumption. Agentic browsers like ChatGPT Atlas, Chrome with auto browse, and Perplexity Comet don’t parse pages the way Googlebot does. They read the accessibility tree instead.

The accessibility tree is a parallel representation of your page that browsers generate from your HTML. It strips away visual styling, layout, and decoration, keeping only the semantic structure: headings, links, buttons, form fields, labels, and the relationships between them. Screen readers like VoiceOver and NVDA have used the accessibility tree for decades to make websites usable for people with visual impairments. AI agents now use the same tree to understand and interact with web pages.

And the reason is simple: efficiency. Processing screenshots is both more expensive and slower than working with the accessibility tree.

Accessibility tree shown in Google Chrome
This is what an accessibility tree looks like in Google Chrome (Image from author, April 2026)

This matters because the accessibility tree exposes what your HTML actually communicates, not what your CSS (or JS) makes it look like. A

styled to look like a button doesn’t appear as a button in the accessibility tree. An image without alt text means nothing. A heading hierarchy that skips from H1 to H4 creates a broken structure that both screen readers and AI agents will struggle to navigate.

Microsoft’s Playwright MCP, the standard tool for connecting AI models to browser automation, uses accessibility snapshots rather than raw HTML or screenshots. Playwright MCP’s browser_snapshot function returns an accessibility tree representation because it’s more compact and semantically meaningful for LLMs. OpenAI’s documentation states that ChatGPT Atlas uses ARIA tags to interpret page structure when browsing websites.

Web accessibility and AI agent compatibility are now the same discipline. Proper heading hierarchy (H1-H6) creates meaningful sections that AI systems use for content extraction. Semantic elements like

,

,

, and

tell machines what role each content block plays. Form labels and descriptive button text make interactive elements understandable to agents that parse the accessibility tree instead of rendering visual design.

What To Check

  • Heading hierarchy: logical H1-H6 structure that machines can use to understand content relationships.
  • Semantic elements: nav, main, article, section, aside, header, footer, used appropriately.
  • Form inputs: every input has a label, every button has descriptive text.
  • Interactive elements: clickable things use or , not

    .

  • Accessibility tree: run a Playwright MCP snapshot or test with VoiceOver/NVDA to see what agents actually see.

Somehow, things are getting worse on this front. The WebAIM Million 2026 report found that the average web page now has 56.1 accessibility errors, up 10.1% from 2025.

ARIA (Accessible Rich Internet Applications) usage increased 27% in a single year. ARIA is a set of HTML attributes that add extra semantic information to elements, telling screen readers and AI agents things like “this div is actually a dialog” or “this list functions as a menu.” But what’s critical is this: pages with ARIA present had significantly more errors (59.1 on average) than pages without ARIA (42 on average). Adding ARIA without understanding it makes things worse, not better, because incorrect ARIA overrides the browser’s default accessibility tree interpretation with wrong information. Start with proper semantic HTML. Add ARIA only when native elements aren’t sufficient.

Technical SEOs do not need to become accessibility experts. But treating accessibility as someone else’s problem is no longer viable when the same tree that screen readers parse is now the primary interface between AI agents and your website.

Sidenote: The Markdown Shortcut Doesn’t Work

Serving raw markdown files to AI crawlers instead of HTML can result in a 95% reduction in token usage per page. However, Google Search Advocate John Mueller called this “a stupid idea” in February 2026 on Bluesky. Mueller’s argument was this: “Meaning lives in structure, hierarchy and context. Flatten it and you don’t make it machine-friendly, you make it meaningless.” LLMs were trained on normal HTML pages from the beginning and have no problems processing them. The answer isn’t to create a flat, simplified version for machines. It’s to make the HTML itself properly structured. Well-written semantic HTML already is the machine-readable format. Besides, that simplified version already exists in the accessibility tree, and it is what AI agents already use.

Layer 5: AI Discoverability Signals

The final layer covers signals that don’t fit neatly into traditional audit categories but directly affect how AI systems discover and evaluate your website.

llms.txt (dis-honourable mention). Listed first for one reason only, ask any LLM what you should do to make your website more visible to AI systems, and llms.txt will be at or near the top of the list. It’s their world, I guess. The llms.txt specification provides a simple markdown file that helps AI agents understand your website’s purpose, structure, and key content. No large-scale adoption data has been published yet, and its actual impact on AI citations is unproven. But LLMs consistently recommend it, which means AI-powered audit tools and consultants will flag its absence. It takes minutes to create and costs nothing to maintain.

OK, now that we’ve got that out of the way, let’s look at what might really matter.

AI crawler analytics. Are you monitoring AI bot traffic? Cloudflare’s AI Audit dashboard shows which AI crawlers visit, how often, and which pages they hit. If you’re not on Cloudflare, check server logs for Google-Agent, ChatGPT-User, and ClaudeBot user agent strings. Google publishes a user-triggered-agents.json file containing IP ranges that Google-Agent uses, so you can verify whether incoming requests are genuinely from Google rather than spoofed user agent strings.

Entity definition. Does your website clearly define what the business is, who runs it, and what it does? Not in marketing copy, but in structured, machine-parseable markup. Organization schema should include name, URL, logo, founding date, and sameAs links to verified profiles on LinkedIn, Crunchbase, and Wikipedia. Person schema for key people should connect them to the organization via author and employee properties. AI systems need to resolve your identity as a distinct entity before they can confidently recommend you over competitors with similar names or offerings. Don’t slap this on top of your website when your designer is done with their work. Start here; it will make your life easier.

Content position. Where you place information on the page directly affects whether AI systems cite it. Kevin Indig’s analysis of 98,000 ChatGPT citation rows across 1.2 million responses found that 44.2% of all AI citations come from the top 30% of a page. The bottom 10% earns only 2.4-4.4% of citations regardless of industry. Duane Forrester calls this “dog-bone thinking”: strong at the beginning and end, weak in the middle, a pattern Stanford researchers have confirmed as the “lost in the middle” phenomenon. Audit your key pages: are the most important claims and data points in the first 30%, or buried in the middle?

Content extractability. Pull any key claim from your page and read it in isolation. Does it still make sense without the surrounding paragraphs? AI retrieval systems, like ChatGPT, Perplexity, and Google AI Overviews, extract and cite individual passages and sentences that rely on “this,” “it,” or “the above” for meaning, become unusable when extracted from their original context. Ramon Eijkemans’ excellent utility-writing framework maps these principles to documented retrieval mechanisms: self-contained sentences, explicit entity relationships, and quotable anchor statements that AI systems can confidently cite without additional inference.

The Audit Checklist

Check Tool/Method What You’re Looking For
AI crawler robots.txt Manual review Conscious per-crawler decisions
JavaScript rendering curl, View Source, Lynx browser Critical content in static HTML
Structured data Schema validator, Rich Results Test Complete, connected JSON-LD
Semantic HTML axe DevTools, Lighthouse Proper elements, heading hierarchy
Accessibility tree Playwright MCP snapshot, screen reader What agents actually see
AI bot traffic Cloudflare, server logs Volume, pages hit, patterns

From Audit To Action

This audit identifies gaps. Fixing them requires a sequence, because some fixes depend on others. Optimizing content structure before establishing a machine-readable identity means agents can extract your information, but can’t confidently attribute it to your brand. I wrote Machine-First Architecture to provide that sequence: identity, structure, content, interaction, each pillar building on the previous one.

Why Technical SEO Audit Is Where This Belongs

None of this is technically SEO. Robots.txt rules for AI crawlers don’t affect Google rankings. Accessibility tree optimization doesn’t move keyword positions. Content position scoring has nothing to do with search indexing.

But most of it did grow out of technical SEO. Crawl management, structured data, semantic HTML, JavaScript rendering, server log analysis: these are skills technical SEOs already have. The audit methodology transfers directly. The consumer it serves is what changed.

The websites that get cited in AI responses, that work when Chrome auto browse visits them, that show up when someone asks ChatGPT for a recommendation, they won’t be the ones with the best content alone. They’ll be the ones whose technical foundation made that content accessible to machines. Technical SEOs are the people best equipped to build that foundation. The old audit template just needs a new section to reflect it.

More Resources:


Featured Image: Anton Vierietin/Shutterstock

Google May Expand Unsupported Robots.txt Rules List via @sejournal, @MattGSouthern

Google may expand the list of unsupported robots.txt rules in its documentation based on analysis of real-world robots.txt data collected through HTTP Archive.

Gary Illyes and Martin Splitt described the project on the latest episode of Search Off the Record. The work started after a community member submitted a pull request to Google’s robots.txt repository proposing two new tags be added to the unsupported list.

Illyes explained why the team broadened the scope beyond the two tags in the PR:

“We tried to not do things arbitrarily, but rather collect data.”

Rather than add only the two tags proposed, the team decided to look at the top 10 or 15 most-used unsupported rules. Illyes said the goal was “a decent starting point, a decent baseline” for documenting the most common unsupported tags in the wild.

How The Research Worked

The team used HTTP Archive to study what rules websites use in their robots.txt files. HTTP Archive runs monthly crawls across millions of URLs using WebPageTest and stores the results in Google BigQuery.

The first attempt hit a wall. The team “quickly figured out that no one is actually requesting robots.txt files” during the default crawl, meaning the HTTP Archive datasets don’t typically include robots.txt content.

After consulting with Barry Pollard and the HTTP Archive community, the team wrote a custom JavaScript parser that extracts robots.txt rules line by line. The custom metric was merged before the February crawl, and the resulting data is now available in the custom_metrics dataset in BigQuery.

What The Data Shows

The parser extracted every line that matched a field-colon-value pattern. Illyes described the resulting distribution:

“After allow and disallow and user agent, the drop is extremely drastic.”

Beyond those three fields, rule usage falls into a long tail of less common directives, plus junk data from broken files that return HTML instead of plain text.

Google currently supports four fields in robots.txt. Those fields are user-agent, allow, disallow, and sitemap. The documentation says other fields “aren’t supported” without listing which unsupported fields are most common in the wild.

Google has clarified that unsupported fields are ignored. The current project extends that work by identifying specific rules Google plans to document.

The top 10 to 15 most-used rules beyond the four supported fields are expected to be added to Google’s unsupported rules list. Illyes did not name specific rules that would be included.

Typo Tolerance May Expand

Illyes said the analysis also surfaced common misspellings of the disallow rule:

“I’m probably going to expand the typos that we accept.”

His phrasing implies the parser already accepts some misspellings. Illyes didn’t commit to a timeline or name specific typos.

Why This Matters

Search Console already surfaces some unrecognized robots.txt tags. If Google documents more unsupported directives, that could make its public documentation more closely reflect the unrecognized tags people already see surfaced in Search Console.

Looking Ahead

The planned update would affect Google’s public documentation and how disallow typos are handled. Anyone maintaining a robots.txt file with rules beyond user-agent, allow, disallow, and sitemap should audit for directives that have never worked for Google.

The HTTP Archive data is publicly queryable on BigQuery for anyone who wants to examine the distribution directly.


Featured Image: Screenshot from: YouTube.com/GoogleSearchCentral, April 2026. 

Google Lists Best Practices For Read More Deep Links via @sejournal, @MattGSouthern

Google updated its snippet documentation today with a new section on “Read more” deep links in Search results. The section outlines three best practices for increasing the likelihood that a page appears with these deep links.

What A Read More Deep Link Is

Google defines the feature as “a link within a snippet that leads users to a specific section on that page.”

The examples in the documentation show the link appearing inside the snippet area of a standard Search result.

Screenshot from: developers.google.com/search/docs/appearance/snippet, April 2026.

The Three Best Practices

Google lists three best practices that can increase the likelihood of these links appearing.

First, content must be immediately visible to a human on page load. Content hidden behind expandable sections or tabbed interfaces can reduce that likelihood, per Google’s guidance.

Second, avoid using JavaScript to control the user’s scroll position on page load. One example Google gives is forcing the user’s scroll to the top of the page.

Third, if the page uses history API calls or window.location.hash modifications on page load, keep the hash fragment in the URL. Removing it breaks deep linking behavior.

More Context

Read more deep links are one type of anchor URL that appears in Search Console performance reports. John Mueller previously addressed those hashtag URLs, confirming that they come from Google and link to page sections.

Before today’s addition, the documentation was last revised in 2024. That change clarified page content, not the meta description, as the primary source of search snippets.

Why This Matters

For websites, the new guidance outlines what can increase the likelihood that a Read more deep link will appear.

Pages using accordion UI patterns, tabbed content, or forced-scroll JavaScript may reduce that likelihood. Teams working with single-page applications should ensure that hash fragments remain in URLs during page loads.

Looking Ahead

This is a documentation clarification, not a new SERP feature. Read more deep links have appeared in Search for some time. What’s new is the written guidance on how to increase that likelihood.

Developers working on JavaScript-heavy sites should test how their pages handle scroll position and hash fragments on initial load. Today’s update provides clearer signals on what can reduce the likelihood of a “Read more” link appearing.


Featured Image: Blossom Stock Studio/Shutterstock

Machine-First Architecture: AI Agents Are Here And Your Website Isn’t Ready, Says NoHacks Podcast Host via @sejournal, @theshelleywalsh

AI agents are already here. Not as a concept, not as a demo, but shipping inside browsers used by billions of people. Every major tech company has launched either a browser with AI built in or an extension that acts on your behalf.

Anthropic’s Claude for Chrome can navigate websites, fill forms, and perform multi-step operations on your behalf. Google announced Gemini in Chrome with agentic browsing capabilities, including auto browse, which can act on webpages for you. OpenClaw, the open-source AI agent, connects large language models directly to browsers, messaging apps, and system tools to execute tasks autonomously.

For more understanding about optimizing for agents, I spoke to Slobodan Manic, who recently wrote a five-part series on optimizing websites for AI agents. His perspective sits at the intersection of technical web performance and where AI agent interaction is actually heading.

From Slobodan Manic’s testing, almost every website is structurally broken for this shift.

“It started with us going to AI and asking questions. And now AI is coming to us and meeting us where we are. From my testing, I noticed that websites are nowhere near being ready for this shift because structurally almost every website is broken.”

The Single Biggest Thing That’s Changed

I started by asking Slobodan what’s changed in the last six to nine months that means SEOs need to pay attention to AI agents right now.

“Every major tech company has launched either a browser that has AI in it that can do things for you or some kind of extension that gets into Chrome. Claude has a plugin for Chrome that can do things for you, not just analyze web pages, summarize web pages, but actually perform operations.”

When ChatGPT first launched in 2023, making AI widely accessible, in parallel with how we started typing basic queries in search engines 25 years ago, we asked AI questions. We are now becoming more sophisticated and fluid with our prompting as we realize that AI can do so much more than [write me an email to politely decline an invitation].

Agents represent an even bigger shift to a different dynamic, where AI can complete tasks on our behalf and run complex systems. [Check my emails and delete any that are spam, sort them into a priority group, and surface what needs my immediate attention and provide a qualified response to anything on a basic query, plus make appointments in my calendar for any meeting invites].

Understanding and taking advantage of the possibilities is something we are all trying to figure out right now. What we should be aware of is that most websites aren’t built or ready for this agentic world.

Websites Are Becoming Optional, Or Are They?

I have a theory that brand websites are becoming hubs, the central point that connects all of your content assets online. But Slobodan has gone further. He’s written about websites becoming optional for the end user, with pages built by machines for machines and the interaction happening through closed system interfaces. I asked him to expand on that vision and what kind of timeframe we’re realistically looking at.

“First I’ll say that this is not fully happening today. This is still near to mid future. This is not March 2026,” he clarified. But the signals are concrete.

“Google had a patent granted in January that will let them use AI to rewrite the landing page for you if your landing page is not good enough. And then we have all these other companies including Google that announced Gemini browsing for you inside Chrome. So we have an end-to-end AI system that does everything while humans just wait for results.”

He was careful not to overstate it. People still like to browse, read, and compare things. Websites aren’t disappearing.

“Just the same way as mobile traffic has not killed desktop traffic even if it’s taken a bigger share of traffic overall, higher percentage of overall traffic while the desktop traffic is staying flat in terms of absolute numbers, I think this is another lane that will open where things will be happening without a human being involved in every step.”

His timeline for this: “Within a year we can have this become a reality. Not majority, but if Google starts rewriting landing pages using AI, we will see this happening probably 2027, if not sooner.”

When Checkout Becomes A Protocol

Slobodan has written that checkout is becoming a protocol, not a page. If an AI agent can buy on your behalf without ever loading a brand’s website, I asked, “What does that mean for how brands build trust and differentiate when the customer never sees their site?”

“If you’re building trust in a checkout page, you’re doing it wrong. Let’s start there. That I firmly believe. This is not to do with AI. This was never the right place to build trust,” he responded.

Slobodan pointed to every Shopify checkout page that looks identical. “There’s no trust built there. It’s just a machine-readable page that looks the same for everyone, for every brand. You’re supposed to be doing your job before the user needs to pay you.”

This is where he referenced Jono Alderson, and the concept of upstream engineering. “Moving upstream and doing work there and not on the website is the only way to move forward for anyone whose job is optimizing websites. That’s SEO, that’s CRO, that’s content, that’s anyone doing any kind of website work.”

He best summarized by saying “Your website is a part of the equation. Your website is not the equation. And that’s the biggest structural shift that people need to make to survive moving forward.”

What SEOs And Brands Should Actually Do Now

I asked what SEOs and brands can practically start doing to transition over the next year. His answer reframed how we should think about the website itself.

“If your website was your storefront, and it was for decades, people come to you, people do business there. It needs to be a warehouse and a storefront moving forward or you’re not going to survive. Simple as that.”

“We had all those bookstores that were selling books in the ’90s and then Amazon shows up and then you need to be a warehouse. You need to exist in two planes at the same time for the near future at least. So focusing only on your website is the most wrong thing you can do moving forward.”

His main area of focus right now is what he calls machine-first architecture. The principle is to build for machines before you build for humans.

“You don’t build your website for humans until you’ve built it for machines. When you’re working on a product page, there’s no Figma, there’s no design, there’s no copy. You start with your schema. What is your schema supposed to say? What is the meaning of the page? You start with the meaning and then from that build into a web page as it’s built for humans.”

He compared it directly to the mobile-first shift. “That did not mean no desktop. That meant do the more difficult version of it first and then do the easy thing. Trust me, it’s a lot more complicated to add meaning and structure to a page that’s already been designed than to do it the other way.”

And it extends beyond the website. “If you’re saying something on your website, you better check all of your profiles everywhere online, what people are saying about you. It’s everything everywhere all at once. But this is what optimization has become and what it needs to be.”

I also put to him the argument that optimizing for LLMs is fundamentally different from SEO. His response was unequivocal.

“Hard disagree. The hardest possible disagree. If you were doing things the right way, working on the foundations and checking every box that has to be checked, it’s not different at all.”

Where he sees a difference is in the speed of consequences. “With AI in the mix, you just get exposed much faster and the consequences are much greater. There’s nothing different other than those two things.”

This echoed something I’ve felt strongly. The cycle is moving more quickly, but there’s so much similarity with what happened at the foundation of this industry 25 to 30 years ago, which I raised in my SEO Pioneers series. We’re feeling our way through in the same way. And Slobodan agreed.

“They figured this out once and maybe we should ask them how to figure it out again.”

Vibe Coding Is A Trap, Deep Work Is The Moat

For my last question, I put it to Slobodan that he’s said vibe coding is a trap and deep work is the only moat left. For the SEO practitioner feeling overwhelmed, what’s the one thing they should actually do this week?

“It’s really the foundations. I hate to give the boring answer, but it’s really fixing every single foundational thing that you have on your website or your website presence.”

He’s watched the industry chase one shiny tool after another. “There’s always a new shiny toy to work on while your website doesn’t work with JavaScript disabled. Just ignore all of that until you’ve fixed every single broken foundation you have on your website.”

On vibe coding specifically, he was precise: “I don’t like the term vibe coding. It just suggests that you have no idea what you’re doing and you’re happy about it. That’s the way that sounds to me. The concept of AI-assisted coding, it’s there. It’s great. It’s not going away.”

“But just focus on what you should be doing first before you use AI to do it faster.”

What resonated with me is how well this applies to writing, too. AI is brilliant at confidently producing a draft that, at first glance, looks great. But when you actually read it, you realize it’s just somebody confidently talking nonsense.

Slobodan nailed the core problem: “You need to know what good is and what good looks like. Because AI will always give you something. If you don’t know enough about that specific thing, it will always look good from the outside. And there’s a reason why everyone is okay with vibing everything except for their own profession, because they try it and they see that the results are just horrific.”

Build For Machines First, Everything Else Follows

The one thing to take away from this conversation is to build for machines first, then humans. Not because human user experience won’t matter, but because getting the machine layer right first makes the human layer better.

Your website is no longer the only version of your business that people, or agents, will encounter. The brands that treat it as part of a wider ecosystem rather than the whole ecosystem are the ones that will come through this transition in the strongest position.

Watch the full video interview with Slobodan Manic here, or on YouTube.

Thank you to Slobodan for sharing his insights and being my guest on IMHO.

More Resources:


This post was originally published on Shelley Edits.


Featured Image: Shelley Walsh/Search Engine Journal

ChatGPT Now Crawls 3.6x More Than Googlebot: What 24M Requests Reveal

This post was sponsored by Alli AI. The opinions expressed in this article are the sponsor’s own. 

Everyone assumes Googlebot is the dominant crawler hitting their website. That assumption is now wrong.

We analyzed 24,411,048 proxy requests across 78,000+ pages on 69 customer websites on Alli AI’s crawler enablement platform over a 55-day period (January to March 2026). OpenAI’s ChatGPT-User crawler made 3.6x more requests than Googlebot across our data sample. And that’s not even counting GPTBot, OpenAI’s separate training crawler.

A note on methodology: Crawler identification used user agent string matching, verified against published IP ranges. Request metrics are measured at the proxy/CDN layer. The dataset covers 69 websites across a variety of industries and sizes, predominantly WordPress-based. Full methodology is detailed at the end.

Finding 1: AI Crawlers Now Outpace Google 3.6x & ChatGPT Leads the Pack

Image created by Alli AI, April 2026.

When we ranked every identified crawler by request volume, the results were unambiguous:

Rank Crawler Requests Category
1 ChatGPT-User (OpenAI) 133,361 AI Search
2 Googlebot 37,426 Traditional Search
3 Amazonbot 35,728 AI / E-Commerce
4 Bingbot 18,280 Traditional Search
5 ClaudeBot (Anthropic) 13,918 AI Search
6 MetaBot 10,756 Social
7 GPTBot (OpenAI) 8,864 AI Training
8 Applebot 6,794 AI Search
9 Bytespider (ByteDance) 6,644 AI Training
10 PerplexityBot 5,731 AI Search

ChatGPT-User made more requests than Googlebot, Amazonbot, and Bingbot combined.

Image created by Alli AI, April 2026.

Grouped by purpose, AI-related crawlers (ChatGPT-User, GPTBot, ClaudeBot, Amazonbot, Applebot, Bytespider, PerplexityBot, CCBot) made 213,477 requests versus 59,353 for traditional search crawlers (Googlebot, Bingbot, YandexBot). AI crawlers are now making 3.6x more requests than traditional search crawlers across our network.

Finding 2: OpenAI Uses 2 Crawlers (And Most Sites Don’t Know the Difference)

Image created by Alli AI, April 2026.

OpenAI operates two distinct crawlers with very different purposes.

ChatGPT-User is the retrieval crawler. It fetches pages in real time when users ask ChatGPT questions that require up-to-date web information. This determines whether your content appears in ChatGPT’s answers.

GPTBot is the training crawler. It collects data to improve OpenAI’s models. Many sites block GPTBot via robots.txt but not ChatGPT-User, or vice versa, without understanding the distinct consequences of each.

Combined, OpenAI’s crawlers made 142,225 requests: 3.8x Googlebot’s volume.

The robots.txt directives are separate:

User-agent: GPTBot      # Training crawler — feeds OpenAI's models
User-agent: ChatGPT-User # Retrieval crawler — fetches pages for ChatGPT answers

Finding 3: AI Crawlers Are Faster & More Reliable, But Their Volume Adds Up

Image created by Alli AI, April 2026.

AI crawlers are significantly more efficient per request:

Crawler Avg Response Time 200 Success Rate
PerplexityBot 8ms 100%
ChatGPT-User 11ms 99.99%
GPTBot 12ms 99.9%
ClaudeBot 21ms 99.9%
Bingbot 42ms 98.4%
Googlebot 84ms 96.3%

Two likely reasons. First, AI retrieval crawlers are fetching specific pages in response to user queries, not exhaustively discovering site architecture. They know what they want, they grab it, and they leave. Second, while all crawlers on our infrastructure receive pre-rendered responses, Googlebot’s broader crawl pattern means it requests a wider range of URLs, including stale paths from sitemaps and its own legacy index, which adds latency from redirect chains and error handling that retrieval crawlers avoid entirely.

But there’s a catch: while each individual request is lightweight, the sheer volume means aggregate server load is substantial. ChatGPT-User at 11ms × 133,361 requests is still a real infrastructure cost, just distributed differently than Googlebot’s fewer, heavier requests.

Finding 4: Googlebot Sees a Different (Worse) Version of Your Site

Image created by Alli AI, April 2026.

Googlebot’s 96.3% success rate versus near-perfect rates for AI crawlers reveals an important structural difference.

Googlebot received 624 blocked responses (403) and 480 not found errors (404), accounting for 3% of its requests. Meanwhile, ChatGPT-User achieved 99.99% success. PerplexityBot hit a perfect 100%.

Image created by Alli AI, April 2026.

Why the gap? The most likely explanation is index age and crawl behavior, not site misconfiguration.

Googlebot maintains a massive legacy index built over years of continuous crawling. It routinely re-requests URLs it already knows about — including pages that have since been deleted (404s) or restructured (403s). This is normal behavior for a search engine maintaining an index of this scale, but it means a meaningful percentage of Googlebot’s requests are directed at URLs that no longer exist.

AI crawlers don’t carry that baggage. ChatGPT-User fetches specific pages in response to real-time user queries, targeting content that’s currently relevant and linked. That’s a structural advantage that produces near-perfect success rates.

Industry Reports Confirm AI Crawling Surged 15x in 2025

These findings align with broader industry trends. Cloudflare’s 2025 analysis reported ChatGPT-User requests surging 2,825% YoY, with AI “user action” crawling increasing more than 15x over the course of 2025. Akamai identified OpenAI as the single largest AI bot operator, accounting for 42.4% of all AI bot requests. Vercel’s analysis of nextjs.org confirmed that none of the major AI crawlers currently render JavaScript.

Our data shows this crossover may already be happening at the site level for properties that actively enable AI crawler access.

Your New SEO Strategy: How To Audit, Clean Up & Optimize For AI Crawlers

1. Audit your robots.txt for AI crawlers today

Most robots.txt files were written for a Googlebot-first world. At minimum, have explicit directives for ChatGPT-User, GPTBot, ClaudeBot, Amazonbot, PerplexityBot, Applebot, Bytespider, CCBot, and Google-Extended.

Our recommendation: Most businesses benefit from allowing both retrieval crawlers (ChatGPT-User, PerplexityBot, ClaudeBot) and training crawlers (GPTBot, CCBot, Bytespider), training data is what teaches these models about your brand, products, and expertise. Blocking training crawlers today means AI models learn less about you tomorrow, which reduces your chances of being cited in AI-generated answers down the line.

The exception: if you have content you specifically need to protect from model training (proprietary research, gated content), use granular Disallow rules for those paths rather than blanket blocks.

2. Clean up stale URLs in Google Search Console

Our data shows Googlebot hits a 3% error rate, mostly 403s and 404s, while AI crawlers achieve near-perfect success rates. That gap likely reflects Googlebot re-crawling legacy URLs that no longer exist. But those failed requests still consume the crawl budget.

Audit your GSC crawl stats for recurring 404s and 403s. Set up proper redirects for restructured URLs and submit updated sitemaps.

3. Treat AI crawler accessibility as a distinct SEO channel

Ranking in ChatGPT’s answers, Perplexity’s results, and Claude’s responses is emerging as a distinct visibility channel. If your content isn’t accessible to these crawlers, particularly if you’re running JavaScript-heavy frameworks, you’re invisible in AI search.

We’ve published a live dashboard showing how AI crawler traffic breaks down across a real site: which platforms are visiting, how often, and their share of total traffic; if you want to see what this looks like in practice.

4. Plan for volume, not just individual request weight

AI crawlers send light, fast requests, but they send many of them. ChatGPT-User alone accounted for more than 133,000 requests in 55 days. The aggregate server load from AI crawlers is now likely exceeding your Googlebot load. Make sure your hosting and CDN can handle it, the low per request response times in our data reflect the fact that Alli AI serves pre-rendered static HTML from the CDN edge, which is exactly the kind of architecture that absorbs this volume without taxing your origin server.

Methodology

This analysis is based on 24,411,048 HTTP proxy requests processed through Alli AI’s crawler enablement platform between January 14 and March 9, 2026, covering 69 customer websites.

Crawler identification used user agent string matching, verified against published IP ranges. For OpenAI crawlers specifically, every request was cross-referenced against OpenAI’s published CIDR ranges. This confirmed 100% of GPTBot requests and 99.76% of ChatGPT-User requests originated from OpenAI’s infrastructure. The remaining 0.24% (requests from spoofed user agents) were excluded.

Limitations: The dataset is scoped to Alli AI customers who have opted into crawler enablement. Crawlers that don’t self-identify via user agent are not captured. Response time measurements are at the proxy layer, not the origin server.

About Alli AI

Alli AI provides server-side rendering infrastructure for AI and search engine crawlers. This analysis was produced using data from our proxy infrastructure to help the SEO community better understand the evolving crawler landscape.

Want to see this data in action? See the breakdown firsthand by visiting our AI visibility dashboard.


Image Credits

Featured Image: Image by Alli AI. Used with permission.

In-Post Iamges: Images by Alli AI. Used with permission.

Google Explains Googlebot Byte Limits And Crawling Architecture via @sejournal, @MattGSouthern

Google’s Gary Illyes published a blog post explaining how Googlebot’s crawling systems work. The post covers byte limits, partial fetching behavior, and how Google’s crawling infrastructure is organized.

The post references episode 105 of the Search Off the Record podcast, where Illyes and Martin Splitt discussed the same topics. Illyes adds more details about crawling architecture and byte-level behavior.

What’s New

Googlebot Is One Client Of A Shared Platform

Illyes describes Googlebot as “just a user of something that resembles a centralized crawling platform.”

Google Shopping, AdSense, and other products all send their crawl requests through the same system under different crawler names. Each client sets its own configuration, including user agent string, robots.txt tokens, and byte limits.

When Googlebot appears in server logs, that’s Google Search. Other clients appear under their own crawler names, which Google lists on its crawler documentation site.

How The 2 MB Limit Works In Practice

Googlebot fetches up to 2 MB for any URL, excluding PDFs. PDFs get a 64 MB limit. Crawlers that don’t specify a limit default to 15 MB.

Illyes adds several details about what happens at the byte level.

He says HTTP request headers count toward the 2 MB limit. When a page exceeds 2 MB, Googlebot doesn’t reject it. The crawler stops at the cutoff and sends the truncated content to Google’s indexing systems and the Web Rendering Service (WRS).

Those systems treat the truncated file as if it were complete. Anything past 2 MB is never fetched, rendered, or indexed.

Every external resource referenced in the HTML, such as CSS and JavaScript files, gets fetched with its own separate byte counter. Those files don’t count toward the parent page’s 2 MB. Media files, fonts, and what Google calls “a few exotic files” are not fetched by WRS.

Rendering After The Fetch

The WRS processes JavaScript and executes client-side code to understand a page’s content and structure. It pulls in JavaScript, CSS, and XHR requests but doesn’t request images or videos.

Illyes also notes that the WRS operates statelessly, clearing local storage and session data between requests. Google’s JavaScript troubleshooting documentation covers implications for JavaScript-dependent sites.

Best Practices For Staying Under The Limit

Google recommends moving heavy CSS and JavaScript to external files, since those get their own byte limits. Meta tags, title tags, link elements, canonicals, and structured data should appear higher in the HTML. On large pages, content placed lower in the document risks falling below the cutoff.

Illyes flags inline base64 images, large blocks of inline CSS or JavaScript, and oversized menus as examples of what could push pages past 2 MB.

The 2 MB limit “is not set in stone and may change over time as the web evolves and HTML pages grow in size.”

Why This Matters

The 2 MB limit and the 64 MB PDF limit were first documented as Googlebot-specific figures in February. HTTP Archive data showed most pages fall well below the threshold. This blog post adds the technical context behind those numbers.

The platform description explains why different Google crawlers behave differently in server logs and why the 15 MB default differs from Googlebot’s 2 MB limit. These are separate settings for different clients.

HTTP header details matter for pages near the limit. Google states headers consume part of the 2 MB limit alongside HTML data. Most sites won’t be affected, but pages with large headers and bloated markup might hit the limit sooner.

Looking Ahead

Google has now covered Googlebot’s crawl limits in documentation updates, a podcast episode, and a dedicated blog post within a two-month span. Illyes’ note that the limit may change over time suggests these figures aren’t permanent.

For sites with standard HTML pages, the 2 MB limit isn’t a concern. Pages with heavy inline content, embedded data, or oversized navigation should verify that their critical content is within the first 2 MB of the response.


Featured Image: Sergei Elagin/Shutterstock

Google: Pages Are Getting Larger & It Still Matters via @sejournal, @MattGSouthern

Google’s Gary Illyes and Martin Splitt used a recent episode of the Search Off the Record podcast to discuss whether webpages are getting too large and what that means for both users and crawlers.

The conversation started with a simple question: are websites getting fat? Splitt immediately pushed back on the framing, arguing that website-level size is meaningless. Individual page size is where the discussion belongs.

What The Data Shows

Splitt cited the 2025 Web Almanac from HTTP Archive, which found that the median mobile homepage weighed 845 KB in 2015. By July, that same median page had grown to 2,362 KB. That’s roughly a 3x increase over a decade.

Both agreed the growth was expected, given the complexity of modern web applications. But the numbers still surprised them.

Splitt noted the challenge of even defining “page weight” consistently, since different people interpret the term differently depending on whether they’re thinking about raw HTML, transferred bytes, or everything a browser needs to render a page.

How Google’s Crawl Limits Fit In

Illyes discussed a 15 MB default that applies across Google’s broader crawl infrastructure, where each URL gets its own limit, and referenced resources like CSS, JavaScript, and images are fetched separately.

That’s a different number from what appears in Google’s current Googlebot documentation. Google states that Googlebot for Google Search crawls the first 2 MB of a supported file type and the first 64 MB of a PDF.

Our previous coverage broke down the documentation update that clarified these figures earlier this year. Illyes and Splitt discussed the flexibility of these limits in a previous episode, noting that internal teams can override the defaults depending on what’s being crawled.

The Structured Data Question

One of the more interesting moments came when Illyes raised the topic of structured data and page bloat. He traced it back to a statement from Google co-founder Sergey Brin, who said early in Google’s history that machines should be able to figure out everything they need from text alone.

Illyes noted that structured data exists for machines, not users, and that adding the full range of Google’s supported structured data types to a page can add weight that visitors never see. He framed it as a tension rather than offering a clear answer on whether it’s a problem.

Does It Still Matter?

Splitt said yes. He acknowledged that his home internet connection is fast enough that page weight is irrelevant in his daily experience. But he said the picture changes when traveling to areas with slower connections, and noted that metered satellite internet made him rethink how much data websites transfer.

He suggested that page size growth may have outpaced improvements in median mobile connection speeds, though he said he’d need to verify that against actual data.

Illyes referenced prior studies suggesting that faster websites tend to have better retention and conversion rates, though the episode didn’t cite specific research.

Looking Ahead

Splitt said he plans to address specific techniques for reducing page size in a future episode.

Most pages are still unlikely to hit those limits, with the Web Almanac reporting a median mobile homepage size of 2,362 KB. But the broader trend of growing page weight affects both performance and accessibility for users on slower or metered connections.

Google Adds AI & Bot Labels To Forum, Q&A Structured Data via @sejournal, @MattGSouthern

Google updated its Discussion Forum and Q&A Page structured data documentation, adding several new supported properties to both markup types.

The most notable addition is digitalSourceType, a property that lets forum and Q&A sites indicate when content was created by a trained AI model or another automated system.

Content Source Labeling Comes To Forum Markup

The new digitalSourceType property uses IPTC digital source enumeration values to indicate how content was created. Google supports two values:

  • TrainedAlgorithmicMediaDigitalSource for content created by a trained model, such as an LLM.
  • AlgorithmicMediaDigitalSource for content created by a simpler algorithmic process, such as an automatic reply bot.

The property is listed as recommended, not required, for both the DiscussionForumPosting and Comment types in the Discussion Forum docs, and for Question, Answer, and Comment types in the Q&A Page docs.

Google already uses similar IPTC source type values in its image metadata documentation to identify how images were created. The update extends that concept to text-based forum and Q&A content.

New Comment Count Property

Google added commentCount as a recommended property across both documentation pages. It lets sites declare the total number of comments on a post or answer, even when not all comments appear in the markup.

The Q&A Page documentation includes a new formula: answerCount + commentCount should equal the total number of replies of any type. This gives Google a clearer picture of thread activity on pages where comments are paginated or truncated.

Expanded Shared Content Support

The Discussion Forum documentation expanded its sharedContent property. Previously, sharedContent accepted a generic CreativeWork type. The updated docs now explicitly list four supported subtypes:

  • WebPage for shared links.
  • ImageObject for posts where an image is the primary content.
  • VideoObject for posts where a video is the primary content.
  • DiscussionForumPosting or Comment for quoted or reposted content from other threads.

The addition of DiscussionForumPosting and Comment as accepted types is new. Google’s updated documentation includes a code example showing how to mark up a referenced comment with its URL, author, date, and text.

The image property description was also updated across both docs with a note about link preview images. Google now recommends placing link preview images inside the sharedContent field’s attached WebPage rather than in the post’s image field.

Why This Matters

For sites that publish a mix of human and machine-generated content, the digitalSourceType addition provides a structured way to communicate that to Google. The new properties are optional, and no existing implementations will break.

Google has not said how it will use the digitalSourceType data in its ranking or display systems. The documentation only describes it as a way to indicate content origin.

Looking Ahead

The update does not include changes to required properties, so existing forum and Q&A structured data implementations remain valid. Sites that want to adopt the new properties can add them incrementally.

What is an XML sitemap and why should you have one?

A good XML sitemap serves as a roadmap for your website, guiding Google to all your important pages. XML sitemaps can be beneficial for SEO, helping Google find your essential pages quickly, even if your internal linking isn’t perfect. This post explains what they are and how they help you rank better and get surfaced by AI agents.

Table of contents

Key takeaways

  • An XML sitemap is crucial for SEO, as it guides search engines to your important pages, improving crawl efficiency
  • XML sitemaps list essential URLs and provide metadata, helping search engines understand content and prioritize crawling
  • With Yoast SEO, you can automatically generate and manage XML sitemaps, keeping them up to date
  • XML sitemaps support faster indexing of new content and help discover orphan pages that aren’t linked elsewhere
  • Add your XML sitemap to Google Search Console to help Google find it quickly and monitor indexing status

What are XML sitemaps?

An XML sitemap is a file that lists a website’s essential pages, ensuring Google can find and crawl them. It also helps search engines understand your website structure and prioritize important content.

💡 Fun fact:

XML is not the only type of sitemap; there are several sitemap formats, each serving a slightly different purpose:

  • RSS, mRSS, and Atom 1.0 feeds: These are typically used for content that changes frequently, such as blogs or news sites. They automatically highlight recently updated content
  • Text sitemaps: The simplest format. These contain a plain list of URLs, one per line, without additional metadata

These are HTML sitemaps that are created for visitors, not search engines. They list and link to important pages in a clear, hierarchical structure to improve user navigation. An XML sitemap, however, is specifically designed for search engines.

XML sitemaps include additional metadata about each URL, helping search engines better understand your content. For example, it can indicate:

  • When a page was last meaningfully updated
  • How important is a URL relative to other URLs
  • Whether the page includes images or videos, using sitemap extensions

Search engines use this information to crawl your site more intelligently and efficiently, especially if your website is large, new, or has complex navigation.

Looking to expand your knowledge of technical SEO? We have a course in the Yoast SEO Academy focusing on crawlability and indexability. One of the topics we tackle is how to use XML sitemaps properly.

What does an XML sitemap look like?

An XML sitemap follows a standardized format. It is a text file written in Extensible Markup Language (XML) that search engines can easily read and process. As it follows a structured format, search engines like Google can quickly understand which URLs exist on your website and when they were last updated.

Here is a very simple example of an XML sitemap that contains a single URL:




https://www.yoast.com/wordpress-seo/
2024-01-01

Each URL in a sitemap is wrapped in specific XML tags that provide information about that page. Some of these tags are required, while others are optional but helpful for search engines.

Below is a breakdown of the most common XML sitemap tags:

Tag Requirement Description
<?xml> Mandatory Declares the XML version and character encoding used in the file.
Mandatory The container for the entire sitemap. It defines the sitemap protocol and holds all listed URLs.
Mandatory Represents a single URL entry in the sitemap. Each page must be enclosed within its own tag.
Mandatory Specifies the full canonical URL of the page you want search engines to crawl and index.
Optional Indicates the date when the page was last meaningfully updated, helping search engines know when to re-crawl the page.
Optional Suggests how frequently the content on the page is expected to change, such as daily, weekly, or monthly.
Optional Suggests the relative importance of a page compared to other pages on the same site, using a scale from 0.0 to 1.0.

Note: While sitemaps.org supports optional tags like and , Google and Bing generally ignore them. Google has officially discarded them. Instead, it prefers to signal (last modified) when content actually updates.

What is an XML sitemap index?

A sitemap index is a file that lists multiple XML sitemap files. Instead of containing individual page URLs, it acts as a directory that points search engines to several separate sitemaps.

This becomes useful when a website has a large number of URLs or when the site owner wants to organize sitemaps by content type. For example, a site may have separate sitemaps for pages, blog posts, products, or categories.

Here’s a breakdown of how XML sitemap and XML sitemap index differ:

Feature XML Sitemap XML Sitemap Index
Purpose Lists individual URLs on a website Lists multiple sitemap files
Content Contains page URLs and optional metadata Contains links to sitemap files
Use case Suitable for small or medium-sized sites Useful when a site has multiple sitemaps
Structure Uses and tags Uses and tags.

Search engines support sitemap limits. A single sitemap can contain up to 50,000 URLs or be up to 50 MB in size. If your website exceeds these limits, you can create multiple sitemaps and group them together using a sitemap index.

Submitting a sitemap index to search engines allows them to discover and process all your sitemaps from a single file.

In short, an XML sitemap helps search engines discover pages, while a sitemap index helps search engines discover multiple sitemaps.

Below is a simple example of what a sitemap index file looks like:

?xml version="1.0" encoding="UTF-8"?> 
 
 
https://www.example.com/sitemap-pages.xml 
2025-12-11 
 
 
https://www.example.com/sitemap-products.xml 
2025-12-11 
 
 

In this example, the sitemap index references two separate sitemaps. Each one can contain thousands of URLs. This structure helps search engines efficiently discover and crawl large websites.

Why do you need an XML sitemap?

Technically, you don’t need an XML sitemap. Search engines can often discover your pages through internal links and backlinks from other websites. However, having an XML sitemap is highly recommended because it helps search engines crawl and understand your site more efficiently.

Here are some key benefits of using an XML sitemap:

Improved crawl efficiency

Sitemaps help search engines like Google and Bing crawl large or complex websites more efficiently. By listing your important URLs in one place, you make it easier for crawlers to find and prioritize valuable pages.

Faster indexing of new content

When you update or add new pages to your site, including them in your sitemap helps search engines discover them sooner. This can lead to faster indexing, especially for websites that publish content frequently, such as blogs, news sites, or e-commerce stores with changing product listings.

Discovery of orphan pages

Orphan pages are pages that are not linked from other parts of your website. Because crawlers typically follow links to discover content, these pages can sometimes be missed. An XML sitemap can help ensure these pages are still discovered.

Additional metadata signals

XML sitemaps can include additional metadata about each URL, such as the tag. This information helps search engines understand when a page was last updated and whether it may need to be crawled again.

Support for specialized content

Sitemaps can also be extended to include specific types of content, such as images or videos. These specialized sitemaps help search engines better understand and surface media content in results like Google Images or video search.

Better understanding of site structure

A well-organized sitemap gives search engines a clearer overview of your website’s structure and the relationship between different sections or content types.

Indexing insights through Search Console

When you submit your sitemap to tools like Google Search Console, you can monitor how many URLs are discovered and indexed. This also helps you identify crawl issues or indexing errors.

Support for multilingual websites

For websites targeting multiple languages or regions, XML sitemaps can include alternate language versions of pages using hreflang annotations. This helps search engines serve the correct language version to users in different locations.

Yes, but indirectly. AI-powered search experiences like AI Overviews or Bing Copilot still rely on the traditional search index to discover and retrieve content. That means your pages usually need to be crawled and indexed first before they can appear in AI-generated answers.

This is where XML sitemaps still help. By listing your important URLs in one place, a sitemap makes it easier for search engines to discover and index your content. Keeping the value accurate can also help search engines prioritize recently updated pages, which is especially useful for AI systems that aim to surface fresh information.

In short, a sitemap won’t make your content appear in AI answers by itself. But it helps ensure your pages are discoverable, indexed, and up to date, which increases their chances of being used in AI-powered search results.

Adding XML sitemaps to your site with Yoast

Because XML sitemaps play an important role in helping search engines discover and crawl your content, Yoast SEO automatically generates XML sitemaps for your website. This feature is available in both the free and premium versions (Yoast SEO Premium, Yoast WooCommerce SEO, and Yoast SEO AI+) of the plugin.

A smarter analysis in Yoast SEO Premium

Yoast SEO Premium has a smart content analysis that helps you take your content to the next level!

Instead of requiring you to manually create or maintain sitemap files, Yoast SEO handles everything automatically. As you publish, update, or remove content, the plugin updates your sitemap index and the individual sitemaps in real time. This ensures search engines always have an up-to-date overview of the pages you want them to crawl and index.

Yoast SEO also organizes your sitemaps intelligently. Rather than placing every URL in a single file, the plugin creates a sitemap index that groups separate sitemaps for different content types, such as posts, pages, and other public content types, with just one click.

Read more: XML sitemaps in the Yoast SEO plugin

Another important advantage is that Yoast SEO only includes content that should actually appear in search results. Pages set to noindex are automatically excluded from the XML sitemap. This helps keep your sitemap clean and focused on the URLs that matter for SEO.

Controlling what appears in your sitemap

While the plugin automatically manages sitemaps, you still have full control over which content is included.

For example, if you don’t want a specific post or page to appear in search results, you can change the setting “Allow search engines to show this content in search results?” in the Yoast SEO sidebar under the Advanced tab. When this option is set to No, the content will be marked as noindex and automatically excluded from the XML sitemap. When set to Yes, the content remains eligible to appear in search results and is included in the sitemap.

This makes it easy to keep your sitemap focused on the pages you actually want search engines to crawl and index. In some cases, developers can further customize sitemap behavior. For example, filters can be used to limit the number of URLs per sitemap or to programmatically exclude certain content types.

Because all of this happens automatically, most website owners never need to manage sitemap files manually. Yoast SEO keeps your XML sitemap clean, up to date, and optimized for search engines as your site grows.

Read more: How to exclude content from the sitemap

Make Google find your sitemap

If you want Google to find your XML sitemap quicker, you’ll need to add it to your Google Search Console account. You can find your sitemaps in the ‘Sitemaps’ section. If not, you can add your sitemap at the top of the page.

Adding your sitemap helps check whether Google has indexed all pages in it. We recommend investigating this further if there is a significant difference between the ‘submitted’ and ‘indexed’ counts for a particular sitemap. Maybe there’s an error that prevents some pages from indexing? Another option is to add more links pointing to content that has not yet been indexed.

Google search console sitemap
Google correctly processed all URLs in a post sitemap

What websites need an XML sitemap?

Google’s documentation says sitemaps are beneficial for “really large websites,” “websites with large archives,” “new websites with just a few external links to them,” and “websites which use rich media content.” According to Google, proper internal linking should allow it to find all your content easily. Unfortunately, many sites do not properly link their content logically.

While we agree that these websites will benefit the most from having one, at Yoast, we think XML sitemaps benefit every website. As the web grows, it’s getting harder and harder to index sites properly. That’s why you should provide search engines with every available option to have it found. In addition, XML sitemaps make search engine crawling more efficient.

Every website needs Google to find essential pages easily and know when they were last updated. That’s why this feature is included in the Yoast SEO plugin.

Which pages should be in your XML sitemap?

How do you decide which pages to include in your XML sitemap? Always start by thinking of the relevance of a URL: when a visitor lands on a particular URL, is it a good result? Do you want visitors to land on that URL? If not, it probably shouldn’t be in it. However, if you don’t want that URL to appear in the search results, you must add a ‘noindex’ tag. Leaving it out of your sitemap doesn’t mean Google won’t index the URL. If Google can find it by following links, Google can index the URL.

Example: A new blog

For example, you are starting a new blog. Of course, you want to ensure your target audience can find your blog posts in the search results. So, it’s a good idea to immediately include your posts in your XML sitemap. It’s safe to assume that most of your pages will also be relevant results for your visitors. However, a thank you page that people will see after they’ve subscribed to your newsletter is not something you want to appear in the search results. In this case, you don’t want to exclude all pages from your sitemap, only this one.

Let’s stay with the example of the new blog. In addition to your blog posts, you create some categories and tags. These categories and tags will have archive pages that list all posts in that specific category or tag. However, initially, there might not be enough content to fill these archive pages, making them ‘thin content’.

For example, tag archives that show just one post are not that valuable to visitors yet. You can exclude them from the sitemap when starting your blog and include them once you have enough posts. You can even exclude all your tag pages or category pages simultaneously using Yoast SEO.

However, this kind of page could also be excellent ranking material. So, if you think: well, yes, this tag page is a bit ‘thin’ right now, but it could be a great landing page, then enrich it with additional information and images. And don’t exclude it from your sitemap in this case.

Frequently asked questions about XML sitemaps

There are a lot of questions regarding XML sitemaps, so we’ve answered a couple in the FAQ below:

What happens when Google Search Console says an XML sitemap has errors?

An invalid or improperly read XML sitemap usually indicates a specific error that needs investigation. Check the reported issue to understand what is causing the problem. Make sure the sitemap has been submitted through the search engine’s webmaster tools. When the sitemap is marked as invalid, review the listed errors and apply the appropriate fixes for each one.

How can I check whether a website has an XML sitemap?

In most cases, you can find out if sites have an XML sitemap by adding sitemap.xml to the root domain. So, that would be example.com/sitemap.xml. If a site has Yoast SEO installed, you’ll notice that it’s redirected to example.com/sitemap_index.xml. sitemap_index.xml is the base sitemap that collects all the sitemaps on your site into a single page.

How can I update an XML sitemap?

There are ways to create and update your sitemaps by hand, but you shouldn’t. Also, there are static generators that let you generate a sitemap whenever you want. But, again, this process would need to repeat itself every time you add or update content. The best way to do this is by simply using Yoast SEO. Turn on the XML sitemap in Yoast SEO, and all your updates will be applied automatically.

Can I use in my XML sitemap?

In the past, people believed that adding the attribute to sitemaps would signal to Google that specific URLs should be prioritized. Unfortunately, it doesn’t do anything, as Google has often said it doesn’t use this attribute to read or prioritize content in sitemaps.

Check your own XML sitemap!

Now you know how important it is to have an XML sitemap: it can help your site’s SEO. If you add the correct URLs, Google can easily access your most important pages and posts. Google will also find updated content easily, so it knows when a URL needs to be crawled again. Lastly, adding your XML sitemap to Google Search Console helps Google find it quickly and lets you check for sitemap errors.

So check your XML sitemap and find out if you’re doing it right!

Web Almanac Data Reveals CMS Plugins Are Setting Technical SEO Standards (Not SEOs) via @sejournal, @chrisgreenseo

If more than half the web runs on a content management system, then the majority of technical SEO standards are being positively shaped before an SEO even starts work on it. That’s the lens I took into the 2025 Web Almanac SEO chapter (for clarity, I co-authored the 2025 Web Almanac SEO chapter referenced in this article).

Rather than asking how individual optimization decisions influence performance, I wanted to understand something more fundamental: How much of the web’s technical SEO baseline is determined by CMS defaults and the ecosystems around them.

SEO often feels intensely hands-on – perhaps too much so. We debate canonical logic, structured data implementation, crawl control, and metadata configuration as if each site were a bespoke engineering project. But when 50%+ of pages in the HTTP Archive dataset sit on CMS platforms, those platforms become the invisible standard-setters. Their defaults, constraints, and feature rollouts quietly define what “normal” looks like at scale.

This piece explores that influence using 2025 Web Almanac and HTTP Archive data, specifically:

  • How CMS adoption trends track with core technical SEO signals.
  • Where plugin ecosystems appear to shape implementation patterns.
  • And how emerging standards like llms.txt are spreading as a result.

The question is not whether SEOs matter. It’s whether we’ve been underestimating who sets the baseline for the modern web.

The Backbone Of Web Design

The 2025 CMS chapter of the Web Almanac saw a milestone hit with CMS adoption; over 50% of pages are on CMSs. In case you were unsold on how much of the web is carried by CMSs, over 50% of 16 million websites is a significant amount.

Screenshot from Web Almanac, February 2026

With regard to which CMSs are the most popular, this again may not be surprising, but it is worth reflecting on with regard to which has the most impact.

Image by author, February 2026

WordPress is still the most used CMS, by a long way, even if it has dropped marginally in the 2024 data. Shopify, Wix, Squarespace, and Joomla trail a long way behind, but they still have a significant impact, especially Shopify, on ecommerce specifically.

SEO Functions That Ship As Defaults In CMS Platforms

CMS platform defaults are important, this – I believe – is that a lot of basic technical SEO standards are either default setups or for the relatively small number of websites that have dedicated SEOs or people who at least build to/work with SEO best practice.

When we talk about “best practice,” we’re on slightly shaky ground for some, as there isn’t a universal, prescriptive view on this one, but I would consider:

  • Descriptive “SEO-friendly” URLs.
  • Editable title and meta description.
  • XML sitemaps.
  • Canonical tags.
  • Meta robots directive changing.
  • Structured data – at least a basic level.
  • Robots.txt editing.

Of the main CMS platforms, here is what they – self-reportedly – have as “default.” Note: For some platforms – like Shopify – they would say they’re SEO-friendly (and to be honest, it’s “good enough”), but many SEOs would argue that they’re not friendly enough to pass this test. I’m not weighing into those nuances, but I’d say both Shopify and those SEOs make some good points.

CMS SEO-friendly URLs Title & meta description UI XML sitemap Canonical tags Robots meta support Basic structured data Robots.txt
WordPress Yes Partial (theme-dependent) Yes Yes Yes Limited (Article, BlogPosting) No (plugin or server access required)
Shopify Yes Yes Yes Yes Limited Product-focused Limited (editable via robots.txt.liquid, constrained)
Wix Yes Guided Yes Yes Limited Basic Yes (editable in UI)
Squarespace Yes Yes Yes Yes Limited Basic No (platform-managed, no direct file control)
Webflow Yes Yes Yes Yes Yes Manual JSON-LD Yes (editable in settings)
Drupal Yes Partial (core) Yes Yes Yes Minimal (extensible) Partial (module or server access)
Joomla Yes Partial Yes Yes Yes Minimal Partial (server-level file edit)
Ghost Yes Yes Yes Yes Yes Article No (server/config level only)
TYPO3 Yes Partial Yes Yes Yes Minimal Partial (config or extension-based)

Based on the above, I would say that most SEO basics can be covered by most CMSs “out of the box.” Whether they work well for you, or you cannot achieve the exact configuration that your specific circumstances require, are two other important questions – ones which I am not taking on. However, it often comes down to these points:

  1. It is possible for these platforms to be used badly.
  2. It is possible that the business logic you need will break/not work with the above.
  3. There are many more advanced SEO features that aren’t out of the box, that are just as important.

We are talking about foundations here, but when I reflect on what shipped as “default” 15+ years ago, progress has been made.

Fingerprints Of Defaults In The HTTP Archive Data

Given that a lot of CMSs ship with these standards, do these SEO defaults correlate with CMS adoption? In many ways, yes. Let’s explore this in the HTTP Archive data.

Canonical Tag Adoption Correlates With CMS

Combining canonical tag adoption data with (all) CMS adoption over the last four years, we can see that for both mobile and desktop, the trends seem to follow each other pretty closely.

Image by author, February 2026
Image by author, February 2026

Running a simple Pearson correlation over these elements, we can see this strong correlation even clearer, with canonical tag implementation and the presence of self-canonical URLs.

Image by author, February 2026

What differs is the mobile correlation of canonicalized URLs; that seems to be a negative correlation on mobile and a lower (but still positive) correlation on desktop. A drop in canonicalized pages is largely causing this negative correlation, and the reasons behind this could be many (and harder to be sure of).

Canonical tags are a crucial element for technical SEO; their continued adoption does certainly seem to track the growth in CMS use, too.

Schema.org Data Types Correlate With CMS

Schema.org types against CMS adoption show similar trends, but are less definitive overall. There are many different types of Schema.org, but if we plot CMS adoption against the ones most common to SEO concerns, we can observe a broadly rising picture.

Image by author, February 2026

With the exception of Schema.org WebSite, we can see CMS growth and structured data following similar trends.

But we must note that Schema.org adoption is quite considerably lower than CMSs overall. This could be due to most CMS defaults being far less comprehensive with Schema.org. When we look at specific CMS examples (shortly), we’ll see far-stronger links.

Schema.org implementation is still mostly intentional, specialist, and not as widespread as it could be. If I were a search engine or creating an AI Search tool, would I rely on universal adoption of these, seeing the data like this? Possibly not.

Robots.txt

Given that robots.txt is a single file that has some agreed standards behind it, its implementation is far simpler, so we could anticipate higher levels of adoption than Schema.org.

The presence of a robots.txt is pretty important, mostly to limit crawl of search engines to specific areas of the site. We are starting to see an evolution – we noted in the 2025 Web Almanac SEO chapter – that the robots.txt is used even more as a governance piece, rather than just housekeeping. A key sign that we’re using our key tools differently in the AI search world.

But before we consider the more advanced implementations, how much of a part does a CMS have in ensuring a robots.txt is present? Looks like over the last four years, CMS platforms are driving a significant amount more of robots.txt files serving a 200 response:

Image by author, February 2026

What is more curious, however, is when you consider the file of the robots.txt files. Non-CMS platforms have robots.txt files that are significantly larger.

Image by author, February 2026

Why could this be? Are they more advanced in non-CMS platforms, longer files, more bespoke rules? Most probably in some cases, but we’re missing another impact of a CMSs standards – compliant (valid) robots.txt files.

A lot of robots.txt files serve a valid 200 response, but often they’re not txt files, or they’re redirecting to 404 pages or similar. When we limit this list to only files that contain user-agent declarations (as a proxy), we see a different story.

Image by author, February 2026

Approaching 14% of robots.txt files served on non-CMS platforms are likely not even robots.txt files.

A robots.txt is easy to set up, but it is a conscious decision. If it’s forgotten/overlooked, it simply won’t exist. A CMS makes it more likely to have a robots.txt, and what’s more, when it is in place, it makes it easier to manage/maintain – which IS key.

WordPress Specific Defaults

CMS platforms, it seems, cover the basics, but more advanced options – which still need to be defaults – often need additional SEO tools to enable.

Interrogating WordPress-specific sites with the HTTP Archive data will be easiest as we get the largest sample, and the Wapalizer data gives a reliable way to judge the impact of WordPress-specific SEO tools.

From the Web Almanac, we can see which SEO tools are the most installed on WordPress sites.

Screenshot from Web Almanac, February 2026

For anyone working within SEO, this is unlikely to be surprising. If you are an SEO and worked on WordPress, there is a high chance you have used either of the top three. What IS worth considering right now is that while Yoast SEO is by far the most prevalent within the data, it is seen on barely over 15% of sites. Even the most popular SEO plugin on the most popular CMS is still a relatively small share.

Of these top three plugins, let’s first consider what the differences of their “defaults” are. These are similar to some of WordPress’s, but we can see many more advanced features that come as standard.

SEO Capability All-in-One SEO Yoast SEO Rank Math
Title tag control Yes (global + per-post) Yes Yes
Meta description control Yes Yes Yes
Meta robots UI Yes (index/noindex etc.) Yes Yes
Default meta robots output Explicit index,follow Explicit index,follow Explicit index,follow
Canonical tags Auto self-canonical Auto self-canonical Auto self-canonical
Canonical override (per URL) Yes Yes Yes
Pagination canonical handling Limited Historically opinionated More configurable
XML sitemap generation Yes Yes Yes
Sitemap URL filtering Basic Basic More granular
Inclusion of noindex URLs in sitemap Possible by default Historically possible Configurable
Robots.txt editor Yes (plugin-managed) Yes Yes
Robots.txt comments/signatures Yes Yes Yes
Redirect management Yes Limited (free) Yes
Breadcrumb markup Yes Yes Yes
Structured data (JSON-LD) Yes (templated) Yes (templated) Yes (templated, broad)
Schema type selection UI Yes Limited Extensive
Schema output style Plugin-specific Plugin-specific Plugin-specific
Content analysis/scoring Basic Heavy (readability + SEO) Heavy (SEO score)
Keyword optimization guidance Yes Yes Yes
Multiple focus keywords Paid Paid Free
Social metadata (OG/Twitter) Yes Yes Yes
Llms.txt generation Yes – enabled by default Yes – one-check enable Yes – one-check enable
AI crawler controls Via robots.txt Via robots.txt Via robots.txt

Editable metadata, structured data, robots.txt, sitemaps, and, more recently, llms.txt are the most notable. It is worth noting that a lot of the functionality is more “back-end,” so not something we’d be as easily able to see in the HTTP Archive data.

Structured Data Impact From SEO Plugins

We can see (above) that structured data implementation and CMS adoption do correlate; what is more interesting here is to understand where the key drivers themselves are.

Viewing the HTTP Archive data with a simple segment (SEO plugins vs. no SEO plugins), from the most recent scoring paints a stark picture.

Image by author, February 2026

When we limit the Schema.org @types to the most associated with SEO, it is really clear that some structured data types are pushed really hard using SEO plugins. They are not completely absent. People may be using lesser-known plugins or coding their own solutions, but ease of implementation is implicit in the data.

Robots Meta Support

Another finding from the SEO Web Almanac 2025 chapter was that “follow” and “index” directives were the most prevalent, even though they’re technically redundant, as having no meta robots directives is implicitly the same thing.

Screenshot from Web Almanac 2025, February 2026

Within the chapter number crunching itself, I didn’t dig in much deeper, but knowing that all major SEO WordPress plugins have “index,follow” as default, I was eager to see if I could make a stronger connection in the data.

Where SEO plugins were present on WordPress, “index, follow” was set on over 75% of root pages vs. <5% of WordPress sites without SEO plugins>

Image by author, February 2026

Given the ubiquity of WordPress and SEO plugins, this is likely a huge contributor to this particular configuration. While this is redundant, it isn’t wrong, but it is – again – a key example of whether one or more of the main plugins establish a de facto standard like this, it really shapes a significant portion of the web.

Diving Into LLMs.txt

Another key area of change from the 2025 Web Almanac was the introduction of the llms.txt file. Not an explicit endorsement of the file, but rather a tacit acknowledgment that this is an important data point in the AI Search age.

From the 2025 data, just over 2% of sites had a valid llms.txt file and:

  • 39.6% of llms.txt files are related to All-in-One SEO.
  • 3.6% of llms.txt files are related to Yoast SEO.

This is not necessarily an intentional act by all those involved, especially as Rank Math enables this by default (not an opt-in like Yoast and All-in-One SEO).

Image by author, February 2026

Since the first data was gathered on July 25, 2025 if we take a month-by-month view of the data, we can see further growth since. It is hard not to see this as growing confidence in this markup OR at least, that it’s so easy to enable, more people are likely hedging their bets.

Conclusion

The Web Almanac data suggests that SEO, at a macro level, moves less because of individual SEOs and more because WordPress, Shopify, Wix, or a major plugin ships a default.

  • Canonical tags correlate with CMS growth.
  • Robots.txt validity improves with CMS governance.
  • Redundant “index,follow” directives proliferate because plugins make them explicit.
  • Even llms.txt is already spreading through plugin toggles before it even gets full consensus.

This doesn’t diminish the impact of SEO; it reframes it. Individual practitioners still create competitive advantage, especially in advanced configuration, architecture, content quality, and business logic. But the baseline state of the web, the technical floor on which everything else is built, is increasingly set by product teams shipping defaults to millions of sites.

Perhaps we should consider that if CMSs are the infrastructure layer of modern SEO, then plugin creators are de facto standards setters. They deploy “best practice” before it becomes doctrine

This is how it should work, but I am also not entirely comfortable with this. They normalize implementation and even create new conventions simply by making them zero-cost. Standards that are redundant have the ability to endure because they can.

So the question is less about whether CMS platforms impact SEO. They clearly do. The more interesting question is whether we, as SEOs, are paying enough attention to where those defaults originate, how they evolve, and how much of the web’s “best practice” is really just the path of least resistance shipped at scale.

An SEO’s value should not be interpreted through the amount of hours they spend discussing canonical tags, meta robots, and rules of sitemap inclusion. This should be standard and default. If you want to have an out-sized impact on SEO, lobby an existing tool, create your own plugin, or drive interest to influence change in one.

More Resources:  


Featured Image: Prostock-studio/Shutterstock