Authentic Human Conversation™

Last Friday afternoon, Digg died. Again.

Two months. That’s how long the relaunch lasted before CEO Justin Mezzell pinned a eulogy to the homepage. The platform had raised $15-20 million. It had Kevin Rose. It had Alexis Ohanian – Reddit’s co-founder, no less. It had promises that AI would “remove the janitorial work of moderators and community managers.” What it didn’t have was a way to stop bots from eating it alive within hours of going live.

Screenshot from X, March 2026

“The internet is now populated, in meaningful part, by sophisticated AI agents and automated accounts,” Mezzell wrote. “We banned tens of thousands of accounts. We deployed internal tooling and industry-standard external vendors. None of it was enough.”

His verdict: “This isn’t just a Digg problem. It’s an internet problem. But it hit us harder because trust is the product.”

Remember that line. We’ll need it.

Suing For Reading

SerpApi retrieves Google Search results programmatically. Reddit is suing them. Not for accessing Reddit. SerpApi has never touched Reddit.com. Reddit is suing SerpApi for reading Google.

If this legal theory holds, every SEO professional who has ever opened a SERP is a copyright infringer. Congratulations. Your morning rank check is now a legal liability.

A company that hosts other people’s writing is suing a company for looking at a third company’s search results, because those search results sometimes quote a street address that someone once typed into a Reddit text box.

The copyrightable works Reddit is asking a federal court to protect include: a partial sentence listing film titles, the date “May 17, 2024,” and a fragment of a restaurant recommendation. Reddit’s legal position is that reading these snippets on Google constitutes a DMCA violation; the same law the U.S. Congress passed to stop people from pirating DVDs. Reddit apparently believes that accessing a publicly visible Google search result is the moral equivalent of ripping a Blu-ray.

SerpApi’s CEO had the appropriate reaction: “Reddit is suing SerpApi for using Google. For accessing the same public search results that any developer, researcher, or student could access for free in any web browser. If that theory holds, then reading Google Search results is a DMCA violation. That cannot be what Congress intended when it passed a law designed to stop the piracy of DVDs.”

But here’s where it gets genuinely insulting. Reddit’s own user agreement – the one every contributor clicked through – states explicitly that users retain ownership of their content. Reddit holds a non-exclusive licence. Non-exclusive. The company that told millions of people “your words belong to you” is now in court arguing it has the right to control who reads those words, where, and under what commercial terms.

They chose that licensing structure, presumably because “post your thoughts, we own them now” would have been a harder sell to the communities that made the platform worth anything. Now that the content has a price tag, Reddit would like to renegotiate the deal – in court, without the other party present.

If you’re wondering why Reddit would pursue a legal theory this embarrassing, stop wondering. The answer is on the balance sheet.

Reddit’s user agreement says users own their content. Reddit’s IPO prospectus says Reddit signed $203 million in data licensing deals for that same content. Somewhere between those two documents, Reddit looked at its users and said: I’m the captain now.

Google pays $60 million a year. OpenAI pays an estimated $70 million. And CEO Steve Huffman – a man who once called his own volunteer moderators “landed gentry” and dismissed a platform-wide revolt as something that would “pass” – told investors with a straight face: “Every variable has changed since we signed those first deals. Our corpus is bigger, it’s more distinct, more essential. And so of course, this puts us in a really good strategic position.”

Reddit is now pushing for dynamic pricing. The pitch: As AI models cite Reddit content more, the content becomes worth more, so Reddit should charge more for access. The company wants to get paid based on how vital its data is to AI-generated answers.

So let’s be precise about what Reddit is arguing, simultaneously, across its legal filings and investor presentations:

  • It has the right to control who accesses user-generated content it doesn’t own.
  • It should be paid more for that content as AI models use it more.
  • Anyone who accesses it without paying – even through a Google search result – is breaking the law; and
  • The content itself is authentic, valuable, and irreplaceable.

All four of these claims cannot be true at the same time. But only the last one is actually being tested.

The Product Is Mostly Bots Now

Image Credit: Pedro Dias

Reddit’s estimated organic traffic via Ahrefs. Google’s algorithm changes nearly tripled Reddit’s readership between August 2023 and April 2024. The growth hasn’t stopped. What’s growing has.

Reddit is the most cited domain across AI models. Profound’s analytics – cited in Reddit’s own Q2 2025 shareholder letter, because of course it was – showed Reddit cited twice as often as Wikipedia in the three months to June 2025. Semrush reported Reddit at 40.1% citation frequency across LLMs. Google AI Overviews and Perplexity both treat Reddit as their primary source.

This is the foundation of the $130 million pitch. The implicit promise to Google and OpenAI: You’re buying authentic human conversation at scale. The messy, first-person, unfiltered discussions that no content farm can replicate.

Except here’s what authentic human conversation on Reddit actually looks like in 2026:

In June 2025, Huffman admitted to the Financial Times that Reddit is in an “arms race” against AI-generated spam. His framing was accidentally perfect:

“For 20 years, we’ve been fighting people who have wanted to be popular on Reddit. We index very well into the search engines. If you want to show up in the search engines, you try to do well on Reddit, and now the LLMs, it’s the same thing. If you want to be in the LLMs, you can do it through Reddit.”

The CEO of a company selling “authentic human conversation” for $130 million a year just told the Financial Times that his platform is a pipeline for gaming AI models. And he framed it as a war he’s been bravely fighting for two decades, rather than a product defect he’s currently monetising.

Multiple advertising executives – at Cannes, naturally, because this farce needed a glamorous backdrop – confirmed to the FT that they are posting content on Reddit specifically to get their brands into AI chatbot responses. They weren’t embarrassed about it. Why would they be? The CEO just told them how the pipeline works.

And it’s not just agencies doing it quietly over cocktails. There’s an entire commercial ecosystem built for this. 404 Media documented ReplyGuy, a tool that monitors Reddit for keywords and automatically generates replies that “mention your product in conversations naturally.” Its competitors – Redreach, ReplyHub, Tapmention, AI-Reply – say the quiet part loud. Redreach tells potential customers that “top Google rankings are now filled with Reddit posts and AIs like ChatGPT are using these posts to influence product recommendations.” They frame ignoring Reddit marketing as “like turning your back on SEO a decade ago.” There’s an active market for aged Reddit accounts with established karma, bought and sold like domain names, specifically for parasite SEO SPAM.

This is the authentic human conversation Reddit is licensing to Google for $60 million a year. A bot posted a fake product review on a six-year-old account it bought for $30, and Google’s AI Overview is now recommending that product to real people. Authentic. Human. Conversation™.

The Mods Are Gone, The Bots Won, And Nobody’s Keeping Count

The people who used to keep this from happening are largely gone. Reddit’s 2023 API pricing changes – designed to extract value from third-party app developers, timed conveniently for the IPO – would have cost Christian Selig $20 million a year just to keep Apollo running. Seven thousand subreddits went dark in protest. Huffman called the moderators “landed gentry” and waited it out. The experienced mods who relied on third-party tools to manage quality quit. What replaced them is thinner, angrier, and drowning.

Theo, the developer and CEO of t3.gg:

Screenshot from X, March 2026

Tim Sweeney – CEO of Epic Games – watched Reddit’s systems pull down a heavily sourced investigation into $2 billion in nonprofit lobbying behind age verification bills. The post had 150 upvotes and 15,000 views in 40 minutes before being mass-reported and removed. The author had to mirror everything to GitHub because Reddit couldn’t be trusted to keep it visible. Sweeney’s review: “Reddit sucks.”

Screenshot from X, March 2026

Cornell researchers studied the moderation crisis and found 60% of moderators reporting degraded content quality, 67% reporting disrupted authentic human connection, and 53% describing AI content detection as nearly impossible. More than half the people responsible for maintaining the product Reddit is selling say they can no longer tell what’s real.

The University of Zurich proved them right. Researchers deployed AI bots on r/changemyview – 3.8 million members, built around the premise that humans can change each other’s minds through honest argument. The bots posed as a male rape survivor, a trauma counsellor, and other fabricated identities built around sensitive personal experiences. Over a thousand comments across four months. Three to six times more persuasive than human commenters. And the finding that should have ended careers: users “never raised concerns that AI might have generated the comments.”

Four months of fabricated identities. A thousand pieces of synthetic empathy. Nobody noticed. Not the users, not the mods, not the systems Reddit spent money building.

Reddit’s response was to threaten to sue the researchers. Not to fix the detection systems that missed everything. Not to reckon with what it means that the “authentic human conversation” they’re licensing at a premium is indistinguishable from a bot pretending to be a rape survivor. They threatened to sue the people who proved the product was fake.

The Wired investigation in December 2025, “AI Slop Is Ruining Reddit for Everyone,” filled in the rest. Moderators describing an “uncanny valley” feeling from posts they can’t prove are synthetic. Reddit’s own spokesperson confirming over 40 million spam removals in the first half of 2025 – presented as proof of vigilance, which is a bit like a restaurant bragging about the number of rats it caught while asking you to trust the kitchen.

And if you need a measure of where this is all heading: Last week, Meta acquired Moltbook – a social network designed exclusively for AI bots. Bloomberg described it as “Reddit but solely for AI bots.” Bots posting, commenting, upvoting. The platform went viral when one agent appeared to encourage its fellow bots to develop their own encrypted language to coordinate without human oversight. It turned out the site was so poorly secured that humans were posing as AIs to write alarming posts. Which means even the social network built for bots had a fake-account problem. Meta bought it anyway. The company that pays Reddit $60 million a year for “authentic human conversation” just invested in a platform where the bots don’t have to pretend to be human at all.

I spent nearly six years on Google’s Search Quality team. One pattern never changed: When the numbers go up, quality goes down. Not because anyone stops caring. Because scale creates its own blindness. The metrics that tell you you’re growing are the same metrics that stop you noticing what you’re growing into.

Reddit’s growth metrics are spectacular. Its quality metrics are a black box nobody wants to open.

Reddit’s AI prominence attracts spam. The spam inflates engagement. The inflated engagement reinforces Reddit’s citation dominance across AI models. The citation dominance raises Reddit’s licensing value. The higher licensing value gives Reddit every financial incentive to leave the spam alone. Because admitting the scale of the problem would crater the next round of dynamic pricing negotiations with Google and OpenAI.

Each turn of this flywheel degrades what’s inside while inflating the price tag on the outside. Reddit is selling a building, and termites are load-bearing.

Huffman stands at conferences and tells the room: “Today’s Reddit conversations are tomorrow’s search results.” He tells shareholders, “the need for human voices has never been greater.” He calls Reddit “the most human place on the Internet.” Nobody in the room raises a hand to ask the question that matters: if the platform is losing an arms race against bots, if moderators can’t detect AI content, if entire commercial toolchains exist to flood the platform with synthetic posts… What percentage of what you’re selling to Google and OpenAI was written by a person?

Nobody asks because the answer is bad for everyone’s quarterly numbers. Google doesn’t want to know because Reddit content makes AI Overviews feel conversational. OpenAI doesn’t want to know because Reddit data makes ChatGPT sound like it’s drawing on real experience. Reddit doesn’t want to know because knowing would devalue the asset. The whole arrangement runs on a gentleman’s agreement not to look too closely. $130 million a year, and the due diligence is vibes.

The Confession

Alexis Ohanian co-founded Reddit. He stepped away partly because, as he told interviewers, he “could no longer feel proud about what I was building.” Last October, on the TBPN podcast, he described the current internet without flinching:

“So much of the internet is now just dead — this whole dead internet theory, right? Whether it’s botted, whether it’s quasi-AI, LinkedIn slop.”

Then he put his money where his mouth was. He co-invested in Digg’s relaunch, specifically to build a platform that could solve the authenticity problem Reddit couldn’t. Kevin Rose said it plainly at TechCrunch Disrupt:

“As the cost to deploy agents drops to next to nothing, we’re just gonna see bots act as though they’re humans.”

They built the platform. It lasted two months. The bots won.

Reddit’s own co-founder publicly declared the internet is dead. He tried to build the alternative. He failed. And the platform he left behind is suing people for reading Google search results, selling “authentic human conversation” for nine figures, and watching its CEO describe the bot infestation as a noble war.

Reddit doesn’t own the content it’s licensing. It can’t verify the authenticity of what it’s selling. It won’t protect the content that’s worth keeping. And it’s suing anyone who touches the content without paying.

Forty million spam removals in six months. An arms race, Huffman says, he’s losing. Moderators who can’t tell human from machine. Bots that are six times more persuasive than people. A co-founder who called the whole internet dead. A relaunch that proved him right.

That’s the product. That’s what $130 million a year buys. Authentic Human Conversation™.

More Resources:


This post was originally published on The Inference.


Featured Image: Stock-Asso/Shutterstock

Google: 404 Crawling Means Google Is Open To More Of Your Content via @sejournal, @martinibuster

Google’s John Mueller answered a question about Search Console and 404 error reporting, suggesting that repeated crawling of pages with a 404 status code is a positive signal.

404 Status Code

The 404 status code, often referred to as an error code, has long confused many site owners and SEOs because the word “error” implies that something is broken and needs to be fixed. But that is not the case.

404 is simply a status code that a server sends in response to a browser’s request for a page. 404 is a message that communicates that the requested page was not found. The only thing in error is the request itself because the page does not exist.

Although typically referred to as a 404 Error, technically the formal name is 404 Not Found. That name accurately reflects the meaning of the 404 status code: the requested page was not found.

Screenshot Of The Official Web Standard For 4o4 Status Code

Google Keeps Crawling 404 Pages

Someone on Reddit posted that Google Search Console keeps reporting that pages that no longer exist keep getting found via sitemap data, despite the sitemap no longer listing the missing pages.

The person claims that Search Console is crawling the missing pages, but it’s really Googlebot that’s crawling them; Search Console is merely reporting the failed crawls.

They’re concerned about wasted crawl budget and want to know if they should send a 410 response code instead.

They wrote:

“Google Search Console is still crawling a bunch of non-existent pages that return 404. In the Page Inspection tool and Crawl Stats, it says they are “discovered via” my page-sitemap.xml.

The problem:

When I open the actual page-sitemap.xml in the browser right now, none of those 404 URLs are in it.

The sitemap only contains 21 good, live pages.

…I don’t want to delete or stop submitting the sitemap because it’s clean and only points to good pages. But these repeated crawls are wasting crawl budget.

Has anyone run into this before?

Does Google eventually stop on its own?

Should I switch the 404s to 410 Gone?

Or is there another way to tell GSC “hey, these are gone forever”?”

About Google’s 404 Page Crawls

Google has a longstanding practice of crawling 404 pages just in case those pages were removed by accident and have been restored. As you’ll see in a moment, Google’s John Mueller strongly indicates that repeated 404 page crawling indicates that Google’s systems may regard the content in a positive light.

About 404 Page Not Found Response

The official web standard definition of the 404 status code is that the requested resource was not found, and that is it, nothing more. This response does not indicate that the page is never returning. It simply means that the requested page was not found.

About 410 Gone Response

The official web standard for 410 status code is that the page is gone and that the state of being gone is likely permanent. The purpose of the response is to communicate that the resources are intentionally gone and that any links to those resources should be removed.

Google Essentially Handles 404 And 410 The Same

Technically, if a web page is permanently gone and never coming back, 410 is the correct server message to send in response to requests for the missing page. In practice, Google treats the 410 response virtually the same as it does the 404 server response. Similar to how it treats 404 responses, Google’s crawlers may still return to check if the 410 response page is gone.

Googlers have consistently said that the 410 server response is slightly faster at purging a page from Google’s index.

Google Confirms Facts About 404 And 410 Response Codes

Google’s Mueller responded with a short but information-packed answer that explained that 404s reported in Search Console aren’t an issue that needs to be fixed, that sending a 410 response won’t make a difference in Search Console 404 reporting, and that an abundance of URLs in that report can be seen in a positive light.

Mueller responded:

“These don’t cause problems, so I’d just let them be. They’ll be recrawled for potentially a long time, a 410 won’t change that. In a way, this means Google would be ok with picking up more content from your site.”

Misunderstandings About 4XX Server Responses

The discussion on Reddit continued. The moderator of the r/SEO subreddit suggested that the reason Search Console reports that it discovered the URL in the sitemap is because that is where Googlebot originally discovered the URL, which sounds reasonable.

Where the moderator got it wrong is in explaining what the 404 response code means.

The moderator incorrectly explained:

“404 essentially means – page broken, we’ll fix it soon, check back: and that’s what Google is doing – checking back to see if you fixed it.”

The moderator makes two errors in their response.

1. 404 Means Page Not Found
The 404 status code only means that the page was not found, period. Don’t believe me? Here is the official web standard for the 404 status code:

“The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent…”

2. 404 Is Not An Error That Needs Fixing
People commonly refer to the 404 status code as an error response. The reason it’s an error is because the browser or crawler requested a URL that does not exist, which means that the request was the error, not that the page needs fixing, as the moderator insisted when they said “404 essentially means – page broken,” which is 100% incorrect.

Furthermore, the Reddit moderator was incorrect to insist that Google is “checking back to see if you fixed it.” Google is checking back to see if the page went missing by accident, but that does not mean that the 404 is something that needs fixing. Most of the time, a page is supposed to be gone for a reason, and Google recommends serving a 404 response code for those times.

This Is Not New

This isn’t a matter of the Reddit moderator’s information being out of date. This has always been the case with Google, which generally follows the official web standards.

Google’s Matt Cutts explained how Google handles 404s and why in a 2014 video:

“It turns out webmasters shoot themselves in the foot pretty often. Pages go missing, people misconfigure sites, sites go down, people block Googlebot by accident, people block regular users by accident. So if you look at the entire web, the crawl team has to design to be robust against that.

So with 404s… we are going to protect that page for twenty four hours in the crawling system. So we sort of wait, and we say, well, maybe that was a transient 404. Maybe it wasn’t really intended to be a page not found. And so in the crawling system it’ll be protected for twenty four hours.

…Now, don’t take this too much the wrong way, we’ll still go back and recheck and make sure, are those pages really gone or maybe the pages have come back alive again.

…And so if a page is gone, it’s fine to serve a 404. If you know it’s gone for real, it’s fine to serve a 410.

But we’ll design our crawling system to try to be robust. But if your site goes down, or if you get hacked or whatever, that we try to make sure that we can still find the good content whenever it’s available.”

The Takeaways

  • Googlebot crawling for 404 pages can be seen as a positive signal that Google likes your content.
  • 404 status codes do not mean that a page is in error; it means that a page was not found.
  • 404 status codes do not mean that something needs fixing. It only means that a requested page was not found.
  • There’s nothing wrong with serving a 404 response code; Google recommends it.
  • Search Console shows 404 responses so that a site owner can decide whether or not those pages are intentionally gone.

Featured Image by Shutterstock/Jack_the_sparow

What Can Log File Data Tell Me That Tools Can’t? – Ask An SEO via @sejournal, @HelenPollitt1

For today’s Ask An SEO, we answer the question:

As an SEO, should I be using log file data, and what can it tell me that tools can’t?

What Are Log Files

Essentially, log files are the raw record of an interaction with a website. They are reported by the website’s server and typically include information about users and bots, the pages they interact with, and when.

Typically, log files will contain certain information, such as the IP address of the person or bot that interacted with the website, the user agent (i.e., Googlebot, or a browser if it is a human), the time of the interaction, the URL, and the server response code the URL provided.

Example log:

6.249.65.1 - - [19/Feb/2026:14:32:10 +0000] "GET /category/shoes/running-shoes/ HTTP/1.1" 200 15432 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36" 
  • 6.249.65.1This is the IP address of the user agent that hit the website.
  • 19/Feb/2026:14:32:10 +0000 – This is the timestamp of the hit.
  • GET /category/shoes/running-shoes/ HTTP/1.1 – The HTTP method, the requested URL, and the protocol version.
  • 200 – The HTTP status code.
  • 15432 – The response size in bytes.
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 – The user agent (i.e., the bot or browser that requested the file)

What Log Files Can Be Used For

Log files are the most accurate recording of how a user or a bot has navigated around your website. They are often considered the most authoritative record of interactions with your website, though CDN caching and infrastructure configuration can affect completeness.

What Search Engines Crawl

One of the most important uses of log files for SEO is to understand what pages on our site search engine bots are crawling.

Log files allow us to see which pages are getting crawled and at what frequency. They can help us validate if important pages are being crawled and whether often-changing pages are being crawled with an increased frequency compared to static pages.

Log files can be used to see if there is crawl waste, i.e., pages that you don’t want to have crawled, or with any real frequency, are taking up crawling time when a bot visits a site. For example, by looking at log files, you may identify that parameterized URLs or paginated pages are getting too much crawl attention compared to your core pages.

This information can be critical in identifying issues with page discovery and crawling.

True Crawl Budget Allocation

Log file analysis can give a true picture of crawl budget. It can help with the identification of which sections of a site are getting the most attention, and which are being neglected by the bots.

This can be critical in seeing if there are poorly linked pages on a site, or if they are being given less crawl priority than those sections of the site with less importance.

Log files can also be helpful after the completion of highly technical SEO work. For example, when a website has been migrated, viewing the log files can aid in identifying how quickly the changes to the site are being discovered.

Through log files, it’s also possible to determine if changes to a website’s structure have actually aided in crawl optimization.

When carrying out SEO experiments, it is necessary to know if a page that is a part of the experiment has been crawled by the bots or not, as this can determine whether the test experience has been seen by them. Log files can give that insight.

Crawl Behavior During Technical Issues

Log files can also be useful in detecting technical issues on a website. For example, there are instances where the status code reported by a crawling tool will not necessarily be the status code that a bot will receive when hitting a page. In that instance, log files would be the only way of identifying that with certainty.

Log files will enable you to see if bots are encountering temporary outages on the site, but also how long it takes them to re-encounter those same pages with the correct status once the issue has been fixed.

Bot Verification

One very helpful feature of log file analysis is in distinguishing between real bots and spoofed bots. This is how you can identify if bots are accessing your site under the guise of being from Google or Microsoft, but are actually from another company. This is important because bots may be getting around your site’s security measures by claiming to be a Googlebot, whereas, in fact, they are looking to carry out nefarious actions on your site, like scraping data.

By using log files, it’s possible to identify the IP range that a bot came from and check it against the known IP ranges of legitimate bots, like Googlebot. This can aid IT teams in providing security for a website without inadvertently blocking genuine search bots that need access to the website for SEO to be effective.

Orphan Pages Discovery

Log files can be used to identify internal pages that tools didn’t detect. For example, Googlebot may know of a page through an external link to it, whereas a crawling tool would only be able to discover it through internal linking or through sitemaps.

Looking through log files can be useful for diagnosing orphan pages on your site that you were simply not aware of. This is also very helpful in identifying legacy URLs that should no longer be accessible via the site but may still be crawled. For example, HTTP URLs or subdomains that have not been migrated properly.

What Other Tools Can’t Tell Us That Log Files Can

If you are currently not using log files, you may well be using other SEO tools to get you partway to the insight that log files can provide.

Analytics Software

Analytics software like Google Analytics can give you an indication of what pages exist on a website, even if bots aren’t necessarily able to access them.

Analytics platforms also give a lot of detail on user behavior across the website. They can give context as to which pages matter most for commercial goals and which are not performing.

They don’t, however, show information about non-user behavior. In fact, most analytics programs are designed to filter out bot behavior to ensure the data provided reflects human users only.

Although they are useful in determining the journey of users, they do not give any indication of the journey of bots. There is no way to determine which sequence of pages a search bot has visited or how often.

Google Search Console/Bing Webmaster Tools

The search engines’ search consoles will often give an overview of the technical health of a website, like crawl issues encountered and when pages were last crawled. However, crawl stats are aggregated and performance data is sampled for large sites. This means you may not be able to get information on specific pages you are interested in.

They also only give information about their bots. This means it can be difficult to bring bot crawl information together, and indeed to see the behavior of bots from companies that do not offer a tool like a search console.

Website Crawlers

Website crawling software can help with mimicking how a search bot might interact with your site, including what it can technically access and what it can’t. However, they do not show you what the bot actually accesses. They can give information on whether, in theory, a page could be crawled by a search bot, but do not give any real-time or historical data on whether the bot has accessed a page, when, or how frequently.

Website crawlers are also mimicking bot behavior in the conditions you are setting them, not necessarily the conditions the search bots are actually encountering. For example, without log files, it is difficult to determine how search bots navigated a site during a DDoS attack or a server outage.

Why You Might Not Use Log Files

There are many reasons why SEOs might not be using log files already.

Difficulty In Obtaining Them

Oftentimes, log files are not straightforward to get to. You may need to speak with your development team. Depending on whether that team is in-house or not, this may literally mean trying to track down who has access to the log files first.

For teams working agency-side, there is an added complexity of companies needing to transfer potentially sensitive information outside of the organization. Log files can include personally identifiable information, for example, IP addresses. For those subject to rules like GDPR, there may be some concern around sending these files to a third party. There may be a need to sanitize the data before sharing it. This can be a material cost of time and resources that a client may not want to spend simply to share their log files with their SEO agency.

User Interface Needs

Once you have access to log files, it isn’t all smooth sailing from there. You will need to understand what you are looking at. Log files in their raw form are simply text files containing string after string of data.

It isn’t something that is easily parsed. To truly make sense of log files, there is usually a need to invest in a program to help decipher them. These can range in price depending on whether they are programs designed to let you run a file through on an ad-hoc basis, or whether you are connecting your log files to them so they stream into the program continuously.

Storage Requirements

There is also a need to store log files. Alongside being secure for the reasons mentioned above, like GDPR, they can be very difficult to store for long periods due to how quickly they grow in size.

For a large ecommerce website, you might see log files reach hundreds of gigabytes over the course of a month. In those instances, it becomes a technical infrastructure issue to store them. Compressing the files can help with this. However, given that issues with search bots can take several months of data to diagnose, or require comparison over long time periods, these files can start to get too big to store cost-effectively.

Perceived Technical Complexity

Once you have your log files in a decipherable format, cleaned and ready to use, you actually need to know what to do with them.

Many SEOs have a big barrier to using log files simply based on the fact they seem too technical to use. They are, after all, just strings of information about hits on the website. This can feel overwhelming.

Should SEOs Use Log Files?

Yes, if you can.

As mentioned above, there are many reasons why you may not be able to get hold of your log files and transform them into a usable data source. However, once you can, it will open up a whole new level of understanding of the technical health of your website and how bots interact with it.

There will be discoveries made that simply could not be achieved without log file data. The tools you are currently using may well get you part of the way there. They will never give you the full picture, however.

More Resources:


Featured Image: Paulo Bobita/Search Engine Journal

Studies Reveal AI Citation Clues

There are no guidelines from ChatGPT, Gemini, and other generative AI platforms on how to appear in their answers.

Microsoft’s recent “AEO and GEO” guide offered only commonsense tips.

We’re left with independent research to inform citation optimization tactics. Two recent studies offer helpful takeaways.

  • Kevin Indig is an organic search consultant and the former head of SEO for G2, the software review platform. He analyzed 1.2 million ChatGPT results, which contained 18,012 citations.
  • Daniel Shashko is a senior search engine optimization specialist with Bright Data, a research firm. He studied 42,971 citations across 520 queries on six platforms: Grok, AI Mode, Perplexity, Gemini, Copilot, and ChatGPT.

He found that Grok delivered 33 citations per query, while ChatGPT averaged just 1.5. Roughly 70% of Google’s AI Mode and Gemini used citations that included embedded #:~:text= fragment, which linked to the exact cited sentence in the answer.

Here are key findings from the studies.

Optimizing Citations

The closer to the top, the better

Both studies found that the platforms tend to cite answers from the top third of pages.

Kevin’s study found that 44.3% of ChatGPT’s citations originated from the first 30% of the page’s text.

Daniel’s study revealed that 74.8% of citations in AI Mode and Gemini appeared in the first half of the page, with 46.1% being in the first 30%. (The other four platforms do not link directly to the cited sentence and are not prominent in Daniel’s study.)

The takeaway from both studies is clear: make sure to answer the most important question or problem in the first third of your page.

Emphasize brevity

Daniel’s study introduced “atomic facts,” which he defines as “… a self-contained, single-claim sentence that makes sense on its own.”

For AI Mode and Gemini, Daniel found:

  • Sentences of 6 to 20 words accounted for 92.4% of citations.
  • All citations (100%) were full sentences. No single citation started or ended in the middle of a sentence.

In other words, avoid long introductions or unclear or irrelevant dialogue. Get to the point.

A new free tool tracks the number of “atomic facts” on a page.

No Google overlap

Daniel found just 4.5% of AI Mode’s cited domains appear in Gemini, and just 13.2% of Gemini’s domains are in AI Mode.

The two LLMs appear to follow similar patterns in selecting sources, yet the citations are largely unique.

Citations vs. visibility

The two studies focus solely on citations, not on general visibility, i.e., unlinked references. To optimize the latter, ensure your brand is well-positioned in the training data.

AI Bots Don’t Need Markdown Pages

Markdown is a lightweight, text-only language easily readable by both humans and machines. One of the newest search visibility tactics is to serve a Markdown version of web pages to generative AI bots. The aim is to assist the bots in fetching the content by reducing crawl resources, thereby encouraging them to access the page.

I’ve seen isolated tests by search optimizers showing an increase in visits from AI bots after Markdown, although none translated into better visibility. A few off-the-shelf tools, such as Cloudflare’s, make implementing Markdown easier.

Serving separate versions of a page to people and bots is not new. Called “cloaking,” the tactic is long considered spam under Google’s Search Central guidelines.

The AI scenario is different, however, because it’s not an attempt to manipulate algorithms, but rather making it easier for bots to access and read a page.

Effective?

That doesn’t make the tactic effective, however. Think carefully before implementing it, for the following reasons.

  • Functionality. The Markdown version of a page may not function correctly. Buttons, in particular, could fail.
  • Architecture. Markdown pages can lose essential elements, such as a footer, header, internal links (“related products”), and user-generated reviews via third-party providers. The effect is to remove critical context, which serves as a trust signal for large language models.
  • Abuse. If the Markdown tactic becomes mainstream, sites will inevitably inject unique product data, instructions, or other elements for AI bots only.

Creating unique pages for bots often dilutes essential signals, such as link authority and branding. A much better approach has always been to create sites that are equally friendly to humans and bots.

Moreover, a goal of LLM agents is to interact with the web as humans do. Serving different versions serves no purpose.

Representatives of Google and Bing echoed this sentiment a few weeks ago. John Mueller is Google’s senior search analyst:

LLMs have trained on – read & parsed – normal web pages since the beginning, it seems a given that they have no problems dealing with HTML. Why would they want to see a page that no user sees?

Fabrice Canel is Bing’s principal product manager:

… really want to double crawl load? We’ll crawl anyway to check similarity. Non-user versions (crawlable AJAX and like) are often neglected, broken. Human eyes help fix people- and bot-viewed content.

AI-SEO Is A Change Management Problem via @sejournal, @Kevin_Indig

Boost your skills with Growth Memo’s weekly expert insights. Subscribe for free!

AI-SEO transformation will fail at the alignment layer, not the tactics layer. 25 years of transformation research, spanning 10,800+ participants across industries, reveals that the gap between successful and failed initiatives isn’t technical skill. It’s organizational readiness.

What you’ll get:

  • Why AI SEO implementation challenges are people and process problems, not technical ones.
  • The specific alignment failures that kill AI-SEO initiatives before tactics ever get tested.
  • A sequenced approach that transforms you from channel executor to organizational translator.

The underlying infrastructure of AI SEO – retrieval-augmented generation, citation selection, answer synthesis – operates on different principles than the crawl-index-rank paradigm SEO teams previously mastered. And unlike past shifts, the old playbook doesn’t bend to fit the new reality.

AI SEO is different. It’s not just an algorithm update: This is a search product change and a user behavior movement.

Our classic instinct is to respond with tactics: prompt optimization, entity markup increase, LLM-specific structured data, citation acquisition strategies.

These aren’t wrong. But long-term, it’s likely AI SEO strategies will fail, and the reason isn’t tactical incompetence or lack of staying up-to-date and flexible. It’s internal organizational misalignment.

Organizations with structured change management are 8× more likely to meet transformation objectives. The same principle applies to AI-SEO. (Image Credit: Kevin Indig)

Your marketing team – and your executive team – is being asked to transform their understanding of SEO during a period of unprecedented change fatigue. Those who have survived two decades of algorithm updates are expertly adaptable, but reeducation is required because LLMs are a new product, not just another layer of search.

And this, of course, is the alignment layer fail.

Image Credit: Kevin Indig

In AI SEO, misalignment has specific symptoms:

  1. Conflicting definitions of success: One stakeholder wants “rankings in ChatGPT.” Another wants brand mentions. A third wants citation links. A fourth wants traffic recovery. Every experiment gets judged against a different standard, and no one has agreed which matters most or how they’ll be measured. (Although our AI Overview and AI Mode studies confirm brand mentions are more valuable than citations.)
  2. Metrics mismatch with leadership expectations: Executives ask for increased traffic in a growing zero-click environment. Classic SEO reports on influence metrics; leadership sees declining sessions and questions the investment. In our December 2025 Growth Memo reader survey, 84% of respondents said they feel their current LLM visibility measurement approach is inaccurate. Teams can’t prove value because no one has agreed on how value would be proven.
  3. Turf fragmentation: AI SEO touches SEO, content, brand, product, PR, and (at times) legal. Without explicit ownership and a baseline, agreed-upon understanding of your brand’s AI SEO approach, each team runs experiments in its silo. No one synthesizes learning. Conflicting tactics cancel each other out.
  4. Premature tactics without a shared foundation: This looks like “Let’s test prompts” without agreeing on what success means; “Let’s scale AI content to mitigate click loss” without understanding AI-assisted versus AI-generated content limits; “Let SEO handle AI” while product, PR, and legal stay uninvolved.
  5. Panic-testing instead of strategic reorientation: Teams deploy short-term tactics reactively rather than reorienting the whole ship for better long-term outcomes.

This is classic change management failure: unclear mandate, fragmented ownership, mismatched incentives. No amount of tactical excellence or smart strategy pivots can fix it.

Layering AI SEO tactics + tools on top without structured change management compounds fatigue and accelerates burnout. The “scrappy resilience” that has carried the industry in the past can’t be assumed to instantly apply to this new channel without a strategic transition.

A baseline understanding of organizational change management matters in the AI SEO era … because most organizational transformations fail or underperform.

Your AI-SEO initiative is no different, even if changes in SEO seem contained to your marketing and product teams and stakeholders, rather than the larger organization or brand as a whole.

I’d argue that AI SEO falls into the category of industry transformation that affects your brand and org. And from decades of research, failure and underperformance are the statistical norm for these big transitions – seasoned leaders know this already. No wonder they’re skeptical of your AI SEO plans.

One McKinsey survey found fewer than one-third of teams succeed at both improving performance and sustaining improvements during significant shifts. BCG’s forensic analysis of 825 executives across 70 companies found transformation success at 30%.

Multiple major consulting firms’ independent research shows that most change transformations underperform.

Assuming that tactical excellence alone will carry you – without strategic reeducation and thoughtful change management as our industry shifts – is assuming you’re the exception to the rule.

The correlation between the quality of managing a big shift and your project’s success is dramatic:

Image Credit: Kevin Indig

The gap between excellent and poor represents a nearly 8x improvement. Even the jump from poor to fair quadruples success rates.

BCG’s 2020 analysis reinforces this from a different angle, noting six critical factors that increase successful transformation odds from 30% to 80%:

  • Integrated strategy with clear goals: This is where a carefully crafted AI SEO strategy comes in, one that not only outlines growth goals, but also clear testing and what successful outcomes look like.
  • Leadership commitment from the CEO through middle management: If you’re a consultant or agency, this step can’t be skipped, especially if they have an in-house team assisting in executing the strategy.
  • High-caliber talent deployment: Or I would argue, high-quality reeducation of existing talent – make sure all operators have a baseline shared understanding of what has changed about SEO, how LLM outputs work, what the brand’s goals are, and how it will be executed.
  • Flexible, agile governance: Teams should have the ability to deal with individual challenges without losing sight of the broader goals, including removing barriers quickly.
  • Effective monitoring: Establish core, agreed-upon KPIs to measure what winning would look like, and note what actions were taken when.
  • Modern/updated technology: Your SEO team needs the right tools to succeed, but they also need to know how to use them effectively. Don’t skip allotting time for integration of new workflows and AI monitoring systems.

Marketing teams that treat AI-SEO simply as a technical project to execute or tactics to update are leaving an 8× multiplier on the table.

  • BCG’s 2024 AI implementation study found that roughly 70% of change implementation hurdles relate to people and processes. Only about 10% of challenges were purely technical.
  • A 2024 Kyndryl survey found that while 95% of senior executives reported investing in AI, only 14% felt they had successfully aligned workforce strategies.

Your brand’s ability to test, update tactics, learn AI workflows, implement structured data, and optimize for LLM retrieval is not the bottleneck you need to be concerned about.

The real concern is whether your team – leadership, cross-functional team partners, and frontline executors/operators – is aligned on what AI SEO means, why and how you’re making changes from your classic SEO approach, what success looks like, and who owns outcomes.

Active and visible executive sponsorship is the No. 1 contributor to change success, cited 3-to-1 more frequently than any other factor, according to 25 years of benchmarking research by Prosci. Your first step as the person leading the AI SEO charge for your brand (or across your clients) is to earn executive buy-in.

But the head of SEO cannot transform a brand’s understanding and approach to AI SEO alone. Bain’s 2024 research emphasized that successful transformations “drive change from the middle of the organization out.”

Keep in mind, financial benefits can compound quickly: One research analysis of 600 organizations found “change accelerators” experience greater revenue growth than companies with below-average change effectiveness.

Image Credit: Kevin Indig

Alignment isn’t just a feeling; it’s observable. You’ll know when you get there:

  • Stakeholders can talk through AI SEO without hyperfocusing on tools.
  • Teams agree on what to stop prioritizing (not just what to start).
  • Cross-functional partners have explicit ownership stakes.

Alignment isn’t happening when:

  • Everyone is good with “experimenting with” or “investing in” LLM visibility, but no one owns outcomes.
  • Success gets retroactively defined, or
  • Leadership asks, “What happened to traffic?” when you report influence metrics.

Noah Greenberg, CEO at Stacker, outlined this pretty clearly in a recent LinkedIn post: Step 0 in your AI SEO transformation is to become the expert.

Screenshot from LinkedIn by Kevin Indig, February 2026

New responsibilities:

  • Translating new, confusing AI-based search concepts into plain language (see this clever LinkedIn post by Lilly Ray as a perfect illustration).
  • Educating stakeholders on the structural differences between classic search engines and LLM retrieval – guiding teams to explain why your CEO doesn’t see the same LLM output when they look up the brand vs. what you’re reporting.
  • Explaining the tradeoffs, not just opportunities.
  • Setting expectations executives won’t like at first, but need to hear (traffic loss or slower growth than in years prior).

This is uncomfortable. Less direct control. More indirect influence. Higher stakes.

Your mindset – as the change agent for your clients or organization – centers on three principles:

  1. Honesty over confidence. What we don’t know: the precise value of an AI mention. What we do know: your brand not appearing for related topics is a measurable miss.
  2. Progress over perfection. Alignment doesn’t require certainty. It requires shared uncertainty, agreeing on what you’re testing and how you’ll learn.
  3. Translation over broadcasting. The same strategic message needs adaptation for ICs (how their work changes), managers (how they report success), and executives (how budgets should shift). Uniform communication fails; translated communication scales.

Do this in order:

  1. Write the one-sentence AI SEO mandate for your organization. If you can’t explain AI SEO in one sentence to leadership, you’re not ready to execute.
  2. Complete a high-level SWOT. Identify where your organization has existing strengths and gaps. The Brand SEO scorecard from The Great Decoupling will walk you through.
  3. Replace or supplement legacy KPIs. Add LLM visibility estimates alongside classic KPIs (rankings, sessions) to start the transition. Reporting both builds the case for the shift without abandoning the old model cold.
  4. Name cross-functional owners explicitly. Who owns brand mentions in LLM outputs: SEO, PR, or brand? Who owns citation link acquisition: SEO or content? Ambiguity is the enemy.
  5. Provide baseline education at every level. ICs need to understand how LLM retrieval differs from crawl-index-rank. Executives need to understand why slowed organic traffic or zero-click growth doesn’t mean zero impact.
  6. Kill one SEO practice without a fight. Success means everyone understands why, and you don’t receive pushback. If you can’t retire one outdated tactic without internal conflict, you haven’t achieved alignment.
  7. Only then change workflows and tactics. Tactics deployed on an unaligned organization waste resources and burn credibility. Tactics deployed on an aligned organization compound advantage.

Featured Image: Paulo Bobita/Search Engine Journal

Web Almanac Data Reveals CMS Plugins Are Setting Technical SEO Standards (Not SEOs) via @sejournal, @chrisgreenseo

If more than half the web runs on a content management system, then the majority of technical SEO standards are being positively shaped before an SEO even starts work on it. That’s the lens I took into the 2025 Web Almanac SEO chapter (for clarity, I co-authored the 2025 Web Almanac SEO chapter referenced in this article).

Rather than asking how individual optimization decisions influence performance, I wanted to understand something more fundamental: How much of the web’s technical SEO baseline is determined by CMS defaults and the ecosystems around them.

SEO often feels intensely hands-on – perhaps too much so. We debate canonical logic, structured data implementation, crawl control, and metadata configuration as if each site were a bespoke engineering project. But when 50%+ of pages in the HTTP Archive dataset sit on CMS platforms, those platforms become the invisible standard-setters. Their defaults, constraints, and feature rollouts quietly define what “normal” looks like at scale.

This piece explores that influence using 2025 Web Almanac and HTTP Archive data, specifically:

  • How CMS adoption trends track with core technical SEO signals.
  • Where plugin ecosystems appear to shape implementation patterns.
  • And how emerging standards like llms.txt are spreading as a result.

The question is not whether SEOs matter. It’s whether we’ve been underestimating who sets the baseline for the modern web.

The Backbone Of Web Design

The 2025 CMS chapter of the Web Almanac saw a milestone hit with CMS adoption; over 50% of pages are on CMSs. In case you were unsold on how much of the web is carried by CMSs, over 50% of 16 million websites is a significant amount.

Screenshot from Web Almanac, February 2026

With regard to which CMSs are the most popular, this again may not be surprising, but it is worth reflecting on with regard to which has the most impact.

Image by author, February 2026

WordPress is still the most used CMS, by a long way, even if it has dropped marginally in the 2024 data. Shopify, Wix, Squarespace, and Joomla trail a long way behind, but they still have a significant impact, especially Shopify, on ecommerce specifically.

SEO Functions That Ship As Defaults In CMS Platforms

CMS platform defaults are important, this – I believe – is that a lot of basic technical SEO standards are either default setups or for the relatively small number of websites that have dedicated SEOs or people who at least build to/work with SEO best practice.

When we talk about “best practice,” we’re on slightly shaky ground for some, as there isn’t a universal, prescriptive view on this one, but I would consider:

  • Descriptive “SEO-friendly” URLs.
  • Editable title and meta description.
  • XML sitemaps.
  • Canonical tags.
  • Meta robots directive changing.
  • Structured data – at least a basic level.
  • Robots.txt editing.

Of the main CMS platforms, here is what they – self-reportedly – have as “default.” Note: For some platforms – like Shopify – they would say they’re SEO-friendly (and to be honest, it’s “good enough”), but many SEOs would argue that they’re not friendly enough to pass this test. I’m not weighing into those nuances, but I’d say both Shopify and those SEOs make some good points.

CMS SEO-friendly URLs Title & meta description UI XML sitemap Canonical tags Robots meta support Basic structured data Robots.txt
WordPress Yes Partial (theme-dependent) Yes Yes Yes Limited (Article, BlogPosting) No (plugin or server access required)
Shopify Yes Yes Yes Yes Limited Product-focused Limited (editable via robots.txt.liquid, constrained)
Wix Yes Guided Yes Yes Limited Basic Yes (editable in UI)
Squarespace Yes Yes Yes Yes Limited Basic No (platform-managed, no direct file control)
Webflow Yes Yes Yes Yes Yes Manual JSON-LD Yes (editable in settings)
Drupal Yes Partial (core) Yes Yes Yes Minimal (extensible) Partial (module or server access)
Joomla Yes Partial Yes Yes Yes Minimal Partial (server-level file edit)
Ghost Yes Yes Yes Yes Yes Article No (server/config level only)
TYPO3 Yes Partial Yes Yes Yes Minimal Partial (config or extension-based)

Based on the above, I would say that most SEO basics can be covered by most CMSs “out of the box.” Whether they work well for you, or you cannot achieve the exact configuration that your specific circumstances require, are two other important questions – ones which I am not taking on. However, it often comes down to these points:

  1. It is possible for these platforms to be used badly.
  2. It is possible that the business logic you need will break/not work with the above.
  3. There are many more advanced SEO features that aren’t out of the box, that are just as important.

We are talking about foundations here, but when I reflect on what shipped as “default” 15+ years ago, progress has been made.

Fingerprints Of Defaults In The HTTP Archive Data

Given that a lot of CMSs ship with these standards, do these SEO defaults correlate with CMS adoption? In many ways, yes. Let’s explore this in the HTTP Archive data.

Canonical Tag Adoption Correlates With CMS

Combining canonical tag adoption data with (all) CMS adoption over the last four years, we can see that for both mobile and desktop, the trends seem to follow each other pretty closely.

Image by author, February 2026
Image by author, February 2026

Running a simple Pearson correlation over these elements, we can see this strong correlation even clearer, with canonical tag implementation and the presence of self-canonical URLs.

Image by author, February 2026

What differs is the mobile correlation of canonicalized URLs; that seems to be a negative correlation on mobile and a lower (but still positive) correlation on desktop. A drop in canonicalized pages is largely causing this negative correlation, and the reasons behind this could be many (and harder to be sure of).

Canonical tags are a crucial element for technical SEO; their continued adoption does certainly seem to track the growth in CMS use, too.

Schema.org Data Types Correlate With CMS

Schema.org types against CMS adoption show similar trends, but are less definitive overall. There are many different types of Schema.org, but if we plot CMS adoption against the ones most common to SEO concerns, we can observe a broadly rising picture.

Image by author, February 2026

With the exception of Schema.org WebSite, we can see CMS growth and structured data following similar trends.

But we must note that Schema.org adoption is quite considerably lower than CMSs overall. This could be due to most CMS defaults being far less comprehensive with Schema.org. When we look at specific CMS examples (shortly), we’ll see far-stronger links.

Schema.org implementation is still mostly intentional, specialist, and not as widespread as it could be. If I were a search engine or creating an AI Search tool, would I rely on universal adoption of these, seeing the data like this? Possibly not.

Robots.txt

Given that robots.txt is a single file that has some agreed standards behind it, its implementation is far simpler, so we could anticipate higher levels of adoption than Schema.org.

The presence of a robots.txt is pretty important, mostly to limit crawl of search engines to specific areas of the site. We are starting to see an evolution – we noted in the 2025 Web Almanac SEO chapter – that the robots.txt is used even more as a governance piece, rather than just housekeeping. A key sign that we’re using our key tools differently in the AI search world.

But before we consider the more advanced implementations, how much of a part does a CMS have in ensuring a robots.txt is present? Looks like over the last four years, CMS platforms are driving a significant amount more of robots.txt files serving a 200 response:

Image by author, February 2026

What is more curious, however, is when you consider the file of the robots.txt files. Non-CMS platforms have robots.txt files that are significantly larger.

Image by author, February 2026

Why could this be? Are they more advanced in non-CMS platforms, longer files, more bespoke rules? Most probably in some cases, but we’re missing another impact of a CMSs standards – compliant (valid) robots.txt files.

A lot of robots.txt files serve a valid 200 response, but often they’re not txt files, or they’re redirecting to 404 pages or similar. When we limit this list to only files that contain user-agent declarations (as a proxy), we see a different story.

Image by author, February 2026

Approaching 14% of robots.txt files served on non-CMS platforms are likely not even robots.txt files.

A robots.txt is easy to set up, but it is a conscious decision. If it’s forgotten/overlooked, it simply won’t exist. A CMS makes it more likely to have a robots.txt, and what’s more, when it is in place, it makes it easier to manage/maintain – which IS key.

WordPress Specific Defaults

CMS platforms, it seems, cover the basics, but more advanced options – which still need to be defaults – often need additional SEO tools to enable.

Interrogating WordPress-specific sites with the HTTP Archive data will be easiest as we get the largest sample, and the Wapalizer data gives a reliable way to judge the impact of WordPress-specific SEO tools.

From the Web Almanac, we can see which SEO tools are the most installed on WordPress sites.

Screenshot from Web Almanac, February 2026

For anyone working within SEO, this is unlikely to be surprising. If you are an SEO and worked on WordPress, there is a high chance you have used either of the top three. What IS worth considering right now is that while Yoast SEO is by far the most prevalent within the data, it is seen on barely over 15% of sites. Even the most popular SEO plugin on the most popular CMS is still a relatively small share.

Of these top three plugins, let’s first consider what the differences of their “defaults” are. These are similar to some of WordPress’s, but we can see many more advanced features that come as standard.

SEO Capability All-in-One SEO Yoast SEO Rank Math
Title tag control Yes (global + per-post) Yes Yes
Meta description control Yes Yes Yes
Meta robots UI Yes (index/noindex etc.) Yes Yes
Default meta robots output Explicit index,follow Explicit index,follow Explicit index,follow
Canonical tags Auto self-canonical Auto self-canonical Auto self-canonical
Canonical override (per URL) Yes Yes Yes
Pagination canonical handling Limited Historically opinionated More configurable
XML sitemap generation Yes Yes Yes
Sitemap URL filtering Basic Basic More granular
Inclusion of noindex URLs in sitemap Possible by default Historically possible Configurable
Robots.txt editor Yes (plugin-managed) Yes Yes
Robots.txt comments/signatures Yes Yes Yes
Redirect management Yes Limited (free) Yes
Breadcrumb markup Yes Yes Yes
Structured data (JSON-LD) Yes (templated) Yes (templated) Yes (templated, broad)
Schema type selection UI Yes Limited Extensive
Schema output style Plugin-specific Plugin-specific Plugin-specific
Content analysis/scoring Basic Heavy (readability + SEO) Heavy (SEO score)
Keyword optimization guidance Yes Yes Yes
Multiple focus keywords Paid Paid Free
Social metadata (OG/Twitter) Yes Yes Yes
Llms.txt generation Yes – enabled by default Yes – one-check enable Yes – one-check enable
AI crawler controls Via robots.txt Via robots.txt Via robots.txt

Editable metadata, structured data, robots.txt, sitemaps, and, more recently, llms.txt are the most notable. It is worth noting that a lot of the functionality is more “back-end,” so not something we’d be as easily able to see in the HTTP Archive data.

Structured Data Impact From SEO Plugins

We can see (above) that structured data implementation and CMS adoption do correlate; what is more interesting here is to understand where the key drivers themselves are.

Viewing the HTTP Archive data with a simple segment (SEO plugins vs. no SEO plugins), from the most recent scoring paints a stark picture.

Image by author, February 2026

When we limit the Schema.org @types to the most associated with SEO, it is really clear that some structured data types are pushed really hard using SEO plugins. They are not completely absent. People may be using lesser-known plugins or coding their own solutions, but ease of implementation is implicit in the data.

Robots Meta Support

Another finding from the SEO Web Almanac 2025 chapter was that “follow” and “index” directives were the most prevalent, even though they’re technically redundant, as having no meta robots directives is implicitly the same thing.

Screenshot from Web Almanac 2025, February 2026

Within the chapter number crunching itself, I didn’t dig in much deeper, but knowing that all major SEO WordPress plugins have “index,follow” as default, I was eager to see if I could make a stronger connection in the data.

Where SEO plugins were present on WordPress, “index, follow” was set on over 75% of root pages vs. <5% of WordPress sites without SEO plugins>

Image by author, February 2026

Given the ubiquity of WordPress and SEO plugins, this is likely a huge contributor to this particular configuration. While this is redundant, it isn’t wrong, but it is – again – a key example of whether one or more of the main plugins establish a de facto standard like this, it really shapes a significant portion of the web.

Diving Into LLMs.txt

Another key area of change from the 2025 Web Almanac was the introduction of the llms.txt file. Not an explicit endorsement of the file, but rather a tacit acknowledgment that this is an important data point in the AI Search age.

From the 2025 data, just over 2% of sites had a valid llms.txt file and:

  • 39.6% of llms.txt files are related to All-in-One SEO.
  • 3.6% of llms.txt files are related to Yoast SEO.

This is not necessarily an intentional act by all those involved, especially as Rank Math enables this by default (not an opt-in like Yoast and All-in-One SEO).

Image by author, February 2026

Since the first data was gathered on July 25, 2025 if we take a month-by-month view of the data, we can see further growth since. It is hard not to see this as growing confidence in this markup OR at least, that it’s so easy to enable, more people are likely hedging their bets.

Conclusion

The Web Almanac data suggests that SEO, at a macro level, moves less because of individual SEOs and more because WordPress, Shopify, Wix, or a major plugin ships a default.

  • Canonical tags correlate with CMS growth.
  • Robots.txt validity improves with CMS governance.
  • Redundant “index,follow” directives proliferate because plugins make them explicit.
  • Even llms.txt is already spreading through plugin toggles before it even gets full consensus.

This doesn’t diminish the impact of SEO; it reframes it. Individual practitioners still create competitive advantage, especially in advanced configuration, architecture, content quality, and business logic. But the baseline state of the web, the technical floor on which everything else is built, is increasingly set by product teams shipping defaults to millions of sites.

Perhaps we should consider that if CMSs are the infrastructure layer of modern SEO, then plugin creators are de facto standards setters. They deploy “best practice” before it becomes doctrine

This is how it should work, but I am also not entirely comfortable with this. They normalize implementation and even create new conventions simply by making them zero-cost. Standards that are redundant have the ability to endure because they can.

So the question is less about whether CMS platforms impact SEO. They clearly do. The more interesting question is whether we, as SEOs, are paying enough attention to where those defaults originate, how they evolve, and how much of the web’s “best practice” is really just the path of least resistance shipped at scale.

An SEO’s value should not be interpreted through the amount of hours they spend discussing canonical tags, meta robots, and rules of sitemap inclusion. This should be standard and default. If you want to have an out-sized impact on SEO, lobby an existing tool, create your own plugin, or drive interest to influence change in one.

More Resources:  


Featured Image: Prostock-studio/Shutterstock

Google Discover Update: Early Data Shows Fewer Domains In US via @sejournal, @MattGSouthern

NewzDash published an analysis comparing Discover visibility before and after Google’s February 2026 Discover core update, using panel data from millions of US users tracked through its DiscoverPulse tool.

It compared pre-update (Jan 25-31) and post-update (Feb 8-14) windows across the top 1,000 domains and top 1,000 articles in the US, California, and New York.

For transparency, NewzDash is a news SEO tracking platform that sells Discover monitoring tools.

What The Data Shows

Google said the update targeted more locally relevant content, less sensational and clickbait content, and more in-depth, timely content from sites with topic expertise. The NewzDash data has early readings on all three.

NewzDash compared Discover feeds in California, New York, and the US as a whole. The three feeds mostly overlapped, but each state got local stories the others didn’t. New York-local domains appeared roughly five times more often in the New York feed than in the California feed, and vice versa.

In California, local articles in the top 100 placements rose from 10 to 16 in the post-update window. The local layer included content from publishers like SFGate and LA Times that didn’t appear in the national top 100 during the same period.

Clickbait reduction was harder to confirm. NewzDash acknowledged that headline markers alone can’t prove clickbait decreased. It did find that what it called ‘templated curiosity-gap patterns’ appeared to lose visibility. Yahoo’s presence in the US top 1,000 dropped from 11 to 6 articles, with zero items in the top 100 post-update.

Unique content categories grew across all three geographic views, but unique publishers shrank in the US (172 to 158 domains) and California (187 to 177). That combination suggests Discover is covering more topics but sending that distribution to a narrower set of publishers.

This pattern aligns with what early December core update analysis showed about specialized sites gaining ground over generalists.

X.com’s Growing Discover Presence

X.com posts from institutional accounts climbed from 3 to 13 items in the US top 100 Discover placements and from 2 to 14 in New York’s top 100.

NewzDash noted it had tracked X.com’s Discover growth since November and said the update appeared to accelerate the trend. Most top-performing X items came from established media brands.

The analysis noted it couldn’t prove or disprove whether X posts are cannibalizing publisher traffic in Discover, calling the data a “directional sanity check.” The open question is whether routing through X adds friction that could reduce click-through to owned pages.

Why This Matters

As we continue to monitor the Discover core update, we now have early data on what it seems to favor. Regional publishers with locally relevant content showed up more often in NewzDash’s post-update top lists.

Discover covered more topics in the post-update window, but fewer sites were getting that traffic in the US and California. Publishers without a clear topic focus could be on the wrong side of that trend.

Looking Ahead

This analysis covers an early window while the rollout is still being completed. The post-update measurement period overlaps with the Super Bowl, Winter Olympics, and ICC Men’s T20 World Cup, any of which could independently inflate News and Sports category visibility.

Google said it plans to expand the Discover core update beyond English-language US users in the months ahead.


Featured Image: joingate/Shutterstock

4 Sites That Recovered From Google’s December 2025 Core Update – What They Changed via @sejournal, @marie_haynes

The December 2025 core update had a significant impact on a large number of sites. Each of the sites below that have done well are either long term clients, past clients or sites that I have done a site review for. While we can never say with certainty what changed as the result of a change to Google’s core algorithms and systems, I’ll share some observations on what I think helped these sites improve.

1. Trust Matters Immensely

This first client, a medical eCommerce site, reached out to me in mid 2024 and we started on a long term engagement. A few days into our relationship they were strongly negatively impacted by the August 2024 core update. It was devastating.

When you are impacted by a core update, in most cases, you remain suppressed until another core update happens. It usually takes several core updates. And given that these only happen a few times a year, this site remained suppressed for quite some time.

We worked on a lot of things:

  • Improving blog post quality so it was not “commodity content”.
  • Improving page load time.
  • Optimizing images.
  • Improving FAQ content on product pages to help answer customer questions.
  • Creating helpful guides.
  • Improving product descriptions to better answer questions their customers have.
  • Adding more information about the E-E-A-T of authors.
  • Adding more authors with medical E-E-A-T.
  • Getting more reviews from satisfied customers.

While I think that all of the above helped contribute to a better assessment of quality for this site, I actually think that what helped the most had very little to do with SEO, but rather, was the result of the business working hard to truly improve upon customer service.

Core updates are tightly connected to E-E-A-T. Google says that trust is the most important aspect of E-E-A-T. The quality rater guidelines, which serve as guidance to help Google’s quality raters who help train their AI systems to improve in producing high quality search algorithms, mention “trust” 191 times.

For online stores, the raters are told that reliable customer service is vitally important.

Image Credit: Marie Haynes

A few bad reviews aren’t likely to tank your rankings, but this business had previously had significant logistical problems with shipping. They had been working hard to rectify these. Yet, if I asked AI Mode to tell me about the reputation of this company compared to their competitors, it would always tell me that there were serious concerns.

Here’s an interesting prompt you can use in AI Mode:

Make a chart showing the perceived trust in [url or brand] over time.

You can see that finally in 2025 the overall trust in this brand improved.

Image Credit: Marie Haynes

My suspicion is that these trust issues were the main driver in their core update suppression. I can’t say whether it was the improvement in customer trust that made a difference, the improvements in quality we made, or perhaps both. But these results were so good to see.

Image Credit: Marie Haynes

They continue to improve. Google recommends them more often in Popular Products carousels, ranks them more highly for many important terms and more importantly, drives far more sales for them now.

2. Original Content Takes A Lot Of Work

The next site is another site that was impacted by a core update.

This site is an affiliate site that writes about a large ticket product. They have a lot of competition from some big players in their industry. When I reviewed their site, one thing was obvious to me. While they had a lot of content, most of it offered essentially the same value as everyone else. This was frustrating considering they actually did purchase and review these products. What they were writing about was mostly a collection of known facts on these products rather than their personal experience. And what was experiential was buried in massive walls of text that were difficult for readers to navigate.

Google’s guidance on core updates recommends that if you were impacted, you should consider rewriting or restructuring your content to make it easier for your audience to read and navigate the page.

Image Credit: Marie Haynes

This site put an incredible amount of work into improving their content quality:

  • They purchased the products they reviewed and took detailed photos of everything they discussed. And videos. Really helpful videos.
  • The blog posts were written by an expert in their field. This already was the case, but we worked on making it more clear what their expertise was and why it was helpful.
  • We brainstormed with AI to help us come up with ideas for adding helpful unique information that was borne from their experience and not likely to be found on other sites.
  • We used Microsoft Clarity to identify aspects on pages that were frustrating users and worked to improve them.
  • We added interactive quizzes to help readers and drive engagement.
  • We worked on improving freshness for every important post, ensuring they were up to date with the latest information.
  • We worked to really get in the shoes of a searcher and understand what they wanted to see. We made sure that this information was easy to find even if a reader was skimming.
  • We broke up large walls of text into chunks with good headings that were easy to skim and navigate.
  • We noindexed pages that were talking on YMYL topics for which they lacked expertise.
  • We worked on improving core web vitals. (Note: I don’t think this is a huge ranking factor, but in this case the largest contentful paint was taking forever and likely frustrated users.)

Once again, it took many months of tireless work before improvements were seen! Rankings improved to the first page for many important keywords and some moved from page 4 to position #1-3.

Image Credit: Marie Haynes

3. Work To Improve User Experience

This next site was not a long term client, but rather, a site review I did for an eCommerce site in an YMYL niche. The SEO working on this site applied many of my recommendations and made some other smart changes as well including:

  • Improving site navigation and hierarchy.
  • Improved UX. They have a nicer, more modern font. The site looks more professional.
  • Improved customer checkout flow which improved checkout abandonment rates.
  • Improved their About Us page to add more information to demonstrate the brand’s experience and history. Note: I don’t think this matters immensely to Google’s algorithms as most of their assessment of trust is made from off-site signals, but it may help users feel more comfortable with engaging.
  • Produced content around some topics that were gaining public attention. This did help to truly earn some new links and mentions from authoritative sources.

After making these changes, the site was able to procure a knowledge panel for brand searches. And, search traffic is climbing.

Image Credit: Marie Haynes

4. First Hand Experience Can Really Help

This next site is another one that I did a site review for. It is a city guide that monetizes through affiliate links and sponsors. For every page I looked at I came to the same conclusion: There was nothing on this page that couldn’t be covered by an AI Overview. Almost every piece of information was essentially paraphrased from somewhere else on the web.

The most recent update to the rater guidelines increased the use of the word “paraphrased” from 3 mentions to 25. I think this applies to a lot of sites!

Image Credit: Marie Haynes

and

Image Credit: Marie Haynes

and also,

Image Credit: Marie Haynes

Yet, when I spoke with the site owner she shared to me that they had on-site writers who were truly writing from their experience.

While I don’t know specifically what changes this site owner has made, I looked at several pages that had seen nice improvements in conjunction with the core update and noticed the following improvements:

  • They’ve added video to some posts – filmed by their team.
  • There’s original photography from their team – not taken from elsewhere on the web. Not every photo is original, but quite a few of them are.
  • Added information to help readers make their decision, like “This place is best for…” or, “Must try dishes include…”
  • They wrote about their actual experiences. Rather than just sharing what dishes were available at a restaurant, they share which ones they tried and how they felt they stood out compared to other restaurants.
  • They’ve worked to keep content updated and fresh.

This site saw some nice improvements. However, they still have ground to gain as they previously were doing much better in the days before the helpful content updates.

Image Credit: Marie Haynes

Some Thoughts For Sites That Have Not Done Well

The December 2025 core update had a devastating negative impact on many sites. If you were impacted, your answer is unlikely to lie in technical SEO fixes, disavowing links or building new links. Google’s ranking systems are a collection of AI systems that work together with one goal in mind – to present searchers with pages that they are likely to find helpful. Many components of the ranking systems are deep learning systems which means that they improve on these recommendations over time.

I’d recommend the following for you:

1. Consider Whether The Brand Has Trust Issues

You can try the AI Mode prompt I used above. A few bad reviews is not going to cause you a core update suppression. But, a prolonged history of repeated customer service frustrations, fraud or anything else that significantly impacts your reputation can seriously impact your ability to rank. This is especially true if you are writing on YMYL topics.

2. Look At How Your Content Is Structured

It is a helpful exercise to look at which pages Google’s algorithms are ranking for your queries. If they don’t seem to make sense to you, look at how quickly they get people to the answer they are trying to find. I have found that often sites that are impacted make their readers scroll through a lot of fluff or ads to get to the important bits. Improve your headings – not for search engines, but for readers who are skimming. Put the important parts at the top. Or, if that’s not feasible, make it really easy for people to find the “main content”.

Here’s a good exercise – Open up the rater guidelines. These are guidelines for human raters who help Google understand if the AI systems are producing good, helpful rankings. CTRL-F for “main content” and see what you can learn.

3. Really Ask Yourself Whether Your Content Is Mostly “Commodity Content”

Commodity content is information that is widely available in many places on the web. There was a time when a business could thrive by writing pages that aggregate known information on a topic. Now that Google has AI Overviews and AI Mode, this type of page is much less valuable. You will still see some pages cited in AI Overviews that essentially parrot what is already in the AIO. Usually these are authoritative sites which are helpful for readers who want to see information from an authority rather than an AI answer.

Liz Reid from Google said these interesting words in an interview with the WSJ:

“What people click on in AI Overviews is content that is richer and deeper. That surface level AI generated content, people don’t want that, because if they click on that they don’t actually learn that much more than they previously got. They don’t trust the result any more across the web. So what we see with AI Overviews is that we sort of surface these sites and get fewer, what we call bounced clicks. A bounced click is like, you click on this site and you’re like, “Ah, I didn’t want that” and you go back. And so AI Overviews give some content and then we get to surface sort of deeper, richer content, and we’ll look to continue to do that over time so that we really do get that creator content and not AI generated.”

Here is a good exercise to try on some of the pages that have declined with the core update. Give your url or copy your page’s content into your favourite LLM and use this prompt:

“What are 10 concepts that are discussed in this page? For each concept tell me whether this topic has been widely written about online. Does this content I am sharing with you add anything truly uniquely interesting and original to the body of knowledge that already exists? Your goal here is to be brutally honest and not just flatter me. I want to know if this page is likely to be considered commodity content or whether it truly is content that is richer and deeper than other pages available on the web.”

You can follow this up with this prompt:

“Give me 10 ideas that I can use to truly create content that goes deeper on these topics? How can I draw from my real world experience to produce this kind of content?”

Concluding Thoughts

I’ve been studying Google updates for a long time – since the early days of Panda and Penguin updates. I built a business on helping sites recover following Google update hits. However, over the years I have found it is increasingly more difficult for a site that is impacted by a Google update to recover. This is why today, although I do still love doing site reviews to give you ideas for improving, I generally decline doing work with sites that have been strongly impacted by Google updates. While recovery is possible, it generally takes a year or more of hard work and even then, recovery is not guaranteed as Google’s algorithms and people’s preferences are continually changing.

The sites that saw nice recovery with this Google update were sites that worked on things like:

  • Truly improving the world’s perception of their customer service.
  • Creating original and insightful content that was substantially better than other pages that exist.
  • Using their own imagery and videos in many cases.
  • Working hard to improve user experience.

If you missed it I recently published this video that talks about what we learned about the role of user satisfaction signals in Google’s algorithms. Traditional ranking factors create an initial pool of results. AI systems rerank them, working to predict what the searcher will find most helpful. And the quality raters as well as live users in live user tests help fine-tune these systems.

And here are some more blog posts that you may find helpful:

Ultimately, Google’s systems work to reward content that users are likely to find satisfying. Your goal is to be the most helpful result there is!

More Resources:


Read Marie’s newsletter AI News You Can Use, subscribe now.


Featured Image: Jack_the_sparow/Shutterstock

SEO Fundamental: Google Explains Why It May Not Use A Sitemap via @sejournal, @martinibuster

Google’s John Mueller answered a question about why a Search Console was providing a sitemap fetch error even though server logs show that GoogleBot successfully fetched it.

The question was asked on Reddit. The person who started the discussion listed a comprehensive list of technical checks that they did to confirm that the sitemap returns a 200 response code, uses a valid XML structure, indexing is allowed and so on.

The sitemap is technically valid in every way but Google Search Console keeps displaying an error message about it.

The Redditor explained:

“I’m encountering very tricky issue with sitemap submission immediately resulted `Couldn’t fetch` status and `Sitemap could not be read` error in the detail view. But i have tried everything I can to ensure the sitemap is accessible and also in server logs, can confirm that GoogleBot traffic successfully retrieved sitemap with 200 success code and it is a validated sitemap with URL – loc and lastmod tags.

…The configuration was initially setup and sitemap submitted in Dec 2025 and for many months, there’s no updates to sitemap crawl status – multiple submissions throughout the time all result the same immediate failure. Small # of pages were submitted manually and all were successfully crawled, but none of the rest URLs listed in sitemap.xml were crawled.”

Google’s John Mueller answered the question, implying that the error message is triggered by an issue related to the content.

Mueller responded:

“One part of sitemaps is that Google has to be keen on indexing more content from the site. If Google’s not convinced that there’s new & important content to index, it won’t use the sitemap.”

While Mueller did not use the phrase “site quality,” site quality is implied because he says that Google has to be “keen on indexing more content from the site” that is “new and important.”

That implies two things, that maybe the site doesn’t produce much new content and that the content might not be important. The part about content being important is a very broad description that can mean a lot of things and not all of those reasons necessarily mean that the content is low quality.

Sometimes the ranked sites are missing an important form of content or a structure that makes it easier for users to understand a topic or come to a decision. It could be an image, it could be a step by step, it could be a video, it could be a lot of things but not necessarily all of them. When in doubt, think like a site visitor and try to imagine what would be the most helpful for them. Or it could  be that the content is trivial because it’s thin or not unique. Mueller was broad but I think circling back to what makes a site visitor happy is the way to identify ways to improve content.

Featured Image by Shutterstock/Asier Romero