Authentic Human Conversation™

Last Friday afternoon, Digg died. Again.

Two months. That’s how long the relaunch lasted before CEO Justin Mezzell pinned a eulogy to the homepage. The platform had raised $15-20 million. It had Kevin Rose. It had Alexis Ohanian – Reddit’s co-founder, no less. It had promises that AI would “remove the janitorial work of moderators and community managers.” What it didn’t have was a way to stop bots from eating it alive within hours of going live.

Screenshot from X, March 2026

“The internet is now populated, in meaningful part, by sophisticated AI agents and automated accounts,” Mezzell wrote. “We banned tens of thousands of accounts. We deployed internal tooling and industry-standard external vendors. None of it was enough.”

His verdict: “This isn’t just a Digg problem. It’s an internet problem. But it hit us harder because trust is the product.”

Remember that line. We’ll need it.

Suing For Reading

SerpApi retrieves Google Search results programmatically. Reddit is suing them. Not for accessing Reddit. SerpApi has never touched Reddit.com. Reddit is suing SerpApi for reading Google.

If this legal theory holds, every SEO professional who has ever opened a SERP is a copyright infringer. Congratulations. Your morning rank check is now a legal liability.

A company that hosts other people’s writing is suing a company for looking at a third company’s search results, because those search results sometimes quote a street address that someone once typed into a Reddit text box.

The copyrightable works Reddit is asking a federal court to protect include: a partial sentence listing film titles, the date “May 17, 2024,” and a fragment of a restaurant recommendation. Reddit’s legal position is that reading these snippets on Google constitutes a DMCA violation; the same law the U.S. Congress passed to stop people from pirating DVDs. Reddit apparently believes that accessing a publicly visible Google search result is the moral equivalent of ripping a Blu-ray.

SerpApi’s CEO had the appropriate reaction: “Reddit is suing SerpApi for using Google. For accessing the same public search results that any developer, researcher, or student could access for free in any web browser. If that theory holds, then reading Google Search results is a DMCA violation. That cannot be what Congress intended when it passed a law designed to stop the piracy of DVDs.”

But here’s where it gets genuinely insulting. Reddit’s own user agreement – the one every contributor clicked through – states explicitly that users retain ownership of their content. Reddit holds a non-exclusive licence. Non-exclusive. The company that told millions of people “your words belong to you” is now in court arguing it has the right to control who reads those words, where, and under what commercial terms.

They chose that licensing structure, presumably because “post your thoughts, we own them now” would have been a harder sell to the communities that made the platform worth anything. Now that the content has a price tag, Reddit would like to renegotiate the deal – in court, without the other party present.

If you’re wondering why Reddit would pursue a legal theory this embarrassing, stop wondering. The answer is on the balance sheet.

Reddit’s user agreement says users own their content. Reddit’s IPO prospectus says Reddit signed $203 million in data licensing deals for that same content. Somewhere between those two documents, Reddit looked at its users and said: I’m the captain now.

Google pays $60 million a year. OpenAI pays an estimated $70 million. And CEO Steve Huffman – a man who once called his own volunteer moderators “landed gentry” and dismissed a platform-wide revolt as something that would “pass” – told investors with a straight face: “Every variable has changed since we signed those first deals. Our corpus is bigger, it’s more distinct, more essential. And so of course, this puts us in a really good strategic position.”

Reddit is now pushing for dynamic pricing. The pitch: As AI models cite Reddit content more, the content becomes worth more, so Reddit should charge more for access. The company wants to get paid based on how vital its data is to AI-generated answers.

So let’s be precise about what Reddit is arguing, simultaneously, across its legal filings and investor presentations:

  • It has the right to control who accesses user-generated content it doesn’t own.
  • It should be paid more for that content as AI models use it more.
  • Anyone who accesses it without paying – even through a Google search result – is breaking the law; and
  • The content itself is authentic, valuable, and irreplaceable.

All four of these claims cannot be true at the same time. But only the last one is actually being tested.

The Product Is Mostly Bots Now

Image Credit: Pedro Dias

Reddit’s estimated organic traffic via Ahrefs. Google’s algorithm changes nearly tripled Reddit’s readership between August 2023 and April 2024. The growth hasn’t stopped. What’s growing has.

Reddit is the most cited domain across AI models. Profound’s analytics – cited in Reddit’s own Q2 2025 shareholder letter, because of course it was – showed Reddit cited twice as often as Wikipedia in the three months to June 2025. Semrush reported Reddit at 40.1% citation frequency across LLMs. Google AI Overviews and Perplexity both treat Reddit as their primary source.

This is the foundation of the $130 million pitch. The implicit promise to Google and OpenAI: You’re buying authentic human conversation at scale. The messy, first-person, unfiltered discussions that no content farm can replicate.

Except here’s what authentic human conversation on Reddit actually looks like in 2026:

In June 2025, Huffman admitted to the Financial Times that Reddit is in an “arms race” against AI-generated spam. His framing was accidentally perfect:

“For 20 years, we’ve been fighting people who have wanted to be popular on Reddit. We index very well into the search engines. If you want to show up in the search engines, you try to do well on Reddit, and now the LLMs, it’s the same thing. If you want to be in the LLMs, you can do it through Reddit.”

The CEO of a company selling “authentic human conversation” for $130 million a year just told the Financial Times that his platform is a pipeline for gaming AI models. And he framed it as a war he’s been bravely fighting for two decades, rather than a product defect he’s currently monetising.

Multiple advertising executives – at Cannes, naturally, because this farce needed a glamorous backdrop – confirmed to the FT that they are posting content on Reddit specifically to get their brands into AI chatbot responses. They weren’t embarrassed about it. Why would they be? The CEO just told them how the pipeline works.

And it’s not just agencies doing it quietly over cocktails. There’s an entire commercial ecosystem built for this. 404 Media documented ReplyGuy, a tool that monitors Reddit for keywords and automatically generates replies that “mention your product in conversations naturally.” Its competitors – Redreach, ReplyHub, Tapmention, AI-Reply – say the quiet part loud. Redreach tells potential customers that “top Google rankings are now filled with Reddit posts and AIs like ChatGPT are using these posts to influence product recommendations.” They frame ignoring Reddit marketing as “like turning your back on SEO a decade ago.” There’s an active market for aged Reddit accounts with established karma, bought and sold like domain names, specifically for parasite SEO SPAM.

This is the authentic human conversation Reddit is licensing to Google for $60 million a year. A bot posted a fake product review on a six-year-old account it bought for $30, and Google’s AI Overview is now recommending that product to real people. Authentic. Human. Conversation™.

The Mods Are Gone, The Bots Won, And Nobody’s Keeping Count

The people who used to keep this from happening are largely gone. Reddit’s 2023 API pricing changes – designed to extract value from third-party app developers, timed conveniently for the IPO – would have cost Christian Selig $20 million a year just to keep Apollo running. Seven thousand subreddits went dark in protest. Huffman called the moderators “landed gentry” and waited it out. The experienced mods who relied on third-party tools to manage quality quit. What replaced them is thinner, angrier, and drowning.

Theo, the developer and CEO of t3.gg:

Screenshot from X, March 2026

Tim Sweeney – CEO of Epic Games – watched Reddit’s systems pull down a heavily sourced investigation into $2 billion in nonprofit lobbying behind age verification bills. The post had 150 upvotes and 15,000 views in 40 minutes before being mass-reported and removed. The author had to mirror everything to GitHub because Reddit couldn’t be trusted to keep it visible. Sweeney’s review: “Reddit sucks.”

Screenshot from X, March 2026

Cornell researchers studied the moderation crisis and found 60% of moderators reporting degraded content quality, 67% reporting disrupted authentic human connection, and 53% describing AI content detection as nearly impossible. More than half the people responsible for maintaining the product Reddit is selling say they can no longer tell what’s real.

The University of Zurich proved them right. Researchers deployed AI bots on r/changemyview – 3.8 million members, built around the premise that humans can change each other’s minds through honest argument. The bots posed as a male rape survivor, a trauma counsellor, and other fabricated identities built around sensitive personal experiences. Over a thousand comments across four months. Three to six times more persuasive than human commenters. And the finding that should have ended careers: users “never raised concerns that AI might have generated the comments.”

Four months of fabricated identities. A thousand pieces of synthetic empathy. Nobody noticed. Not the users, not the mods, not the systems Reddit spent money building.

Reddit’s response was to threaten to sue the researchers. Not to fix the detection systems that missed everything. Not to reckon with what it means that the “authentic human conversation” they’re licensing at a premium is indistinguishable from a bot pretending to be a rape survivor. They threatened to sue the people who proved the product was fake.

The Wired investigation in December 2025, “AI Slop Is Ruining Reddit for Everyone,” filled in the rest. Moderators describing an “uncanny valley” feeling from posts they can’t prove are synthetic. Reddit’s own spokesperson confirming over 40 million spam removals in the first half of 2025 – presented as proof of vigilance, which is a bit like a restaurant bragging about the number of rats it caught while asking you to trust the kitchen.

And if you need a measure of where this is all heading: Last week, Meta acquired Moltbook – a social network designed exclusively for AI bots. Bloomberg described it as “Reddit but solely for AI bots.” Bots posting, commenting, upvoting. The platform went viral when one agent appeared to encourage its fellow bots to develop their own encrypted language to coordinate without human oversight. It turned out the site was so poorly secured that humans were posing as AIs to write alarming posts. Which means even the social network built for bots had a fake-account problem. Meta bought it anyway. The company that pays Reddit $60 million a year for “authentic human conversation” just invested in a platform where the bots don’t have to pretend to be human at all.

I spent nearly six years on Google’s Search Quality team. One pattern never changed: When the numbers go up, quality goes down. Not because anyone stops caring. Because scale creates its own blindness. The metrics that tell you you’re growing are the same metrics that stop you noticing what you’re growing into.

Reddit’s growth metrics are spectacular. Its quality metrics are a black box nobody wants to open.

Reddit’s AI prominence attracts spam. The spam inflates engagement. The inflated engagement reinforces Reddit’s citation dominance across AI models. The citation dominance raises Reddit’s licensing value. The higher licensing value gives Reddit every financial incentive to leave the spam alone. Because admitting the scale of the problem would crater the next round of dynamic pricing negotiations with Google and OpenAI.

Each turn of this flywheel degrades what’s inside while inflating the price tag on the outside. Reddit is selling a building, and termites are load-bearing.

Huffman stands at conferences and tells the room: “Today’s Reddit conversations are tomorrow’s search results.” He tells shareholders, “the need for human voices has never been greater.” He calls Reddit “the most human place on the Internet.” Nobody in the room raises a hand to ask the question that matters: if the platform is losing an arms race against bots, if moderators can’t detect AI content, if entire commercial toolchains exist to flood the platform with synthetic posts… What percentage of what you’re selling to Google and OpenAI was written by a person?

Nobody asks because the answer is bad for everyone’s quarterly numbers. Google doesn’t want to know because Reddit content makes AI Overviews feel conversational. OpenAI doesn’t want to know because Reddit data makes ChatGPT sound like it’s drawing on real experience. Reddit doesn’t want to know because knowing would devalue the asset. The whole arrangement runs on a gentleman’s agreement not to look too closely. $130 million a year, and the due diligence is vibes.

The Confession

Alexis Ohanian co-founded Reddit. He stepped away partly because, as he told interviewers, he “could no longer feel proud about what I was building.” Last October, on the TBPN podcast, he described the current internet without flinching:

“So much of the internet is now just dead — this whole dead internet theory, right? Whether it’s botted, whether it’s quasi-AI, LinkedIn slop.”

Then he put his money where his mouth was. He co-invested in Digg’s relaunch, specifically to build a platform that could solve the authenticity problem Reddit couldn’t. Kevin Rose said it plainly at TechCrunch Disrupt:

“As the cost to deploy agents drops to next to nothing, we’re just gonna see bots act as though they’re humans.”

They built the platform. It lasted two months. The bots won.

Reddit’s own co-founder publicly declared the internet is dead. He tried to build the alternative. He failed. And the platform he left behind is suing people for reading Google search results, selling “authentic human conversation” for nine figures, and watching its CEO describe the bot infestation as a noble war.

Reddit doesn’t own the content it’s licensing. It can’t verify the authenticity of what it’s selling. It won’t protect the content that’s worth keeping. And it’s suing anyone who touches the content without paying.

Forty million spam removals in six months. An arms race, Huffman says, he’s losing. Moderators who can’t tell human from machine. Bots that are six times more persuasive than people. A co-founder who called the whole internet dead. A relaunch that proved him right.

That’s the product. That’s what $130 million a year buys. Authentic Human Conversation™.

More Resources:


This post was originally published on The Inference.


Featured Image: Stock-Asso/Shutterstock

Google: 404 Crawling Means Google Is Open To More Of Your Content via @sejournal, @martinibuster

Google’s John Mueller answered a question about Search Console and 404 error reporting, suggesting that repeated crawling of pages with a 404 status code is a positive signal.

404 Status Code

The 404 status code, often referred to as an error code, has long confused many site owners and SEOs because the word “error” implies that something is broken and needs to be fixed. But that is not the case.

404 is simply a status code that a server sends in response to a browser’s request for a page. 404 is a message that communicates that the requested page was not found. The only thing in error is the request itself because the page does not exist.

Although typically referred to as a 404 Error, technically the formal name is 404 Not Found. That name accurately reflects the meaning of the 404 status code: the requested page was not found.

Screenshot Of The Official Web Standard For 4o4 Status Code

Google Keeps Crawling 404 Pages

Someone on Reddit posted that Google Search Console keeps reporting that pages that no longer exist keep getting found via sitemap data, despite the sitemap no longer listing the missing pages.

The person claims that Search Console is crawling the missing pages, but it’s really Googlebot that’s crawling them; Search Console is merely reporting the failed crawls.

They’re concerned about wasted crawl budget and want to know if they should send a 410 response code instead.

They wrote:

“Google Search Console is still crawling a bunch of non-existent pages that return 404. In the Page Inspection tool and Crawl Stats, it says they are “discovered via” my page-sitemap.xml.

The problem:

When I open the actual page-sitemap.xml in the browser right now, none of those 404 URLs are in it.

The sitemap only contains 21 good, live pages.

…I don’t want to delete or stop submitting the sitemap because it’s clean and only points to good pages. But these repeated crawls are wasting crawl budget.

Has anyone run into this before?

Does Google eventually stop on its own?

Should I switch the 404s to 410 Gone?

Or is there another way to tell GSC “hey, these are gone forever”?”

About Google’s 404 Page Crawls

Google has a longstanding practice of crawling 404 pages just in case those pages were removed by accident and have been restored. As you’ll see in a moment, Google’s John Mueller strongly indicates that repeated 404 page crawling indicates that Google’s systems may regard the content in a positive light.

About 404 Page Not Found Response

The official web standard definition of the 404 status code is that the requested resource was not found, and that is it, nothing more. This response does not indicate that the page is never returning. It simply means that the requested page was not found.

About 410 Gone Response

The official web standard for 410 status code is that the page is gone and that the state of being gone is likely permanent. The purpose of the response is to communicate that the resources are intentionally gone and that any links to those resources should be removed.

Google Essentially Handles 404 And 410 The Same

Technically, if a web page is permanently gone and never coming back, 410 is the correct server message to send in response to requests for the missing page. In practice, Google treats the 410 response virtually the same as it does the 404 server response. Similar to how it treats 404 responses, Google’s crawlers may still return to check if the 410 response page is gone.

Googlers have consistently said that the 410 server response is slightly faster at purging a page from Google’s index.

Google Confirms Facts About 404 And 410 Response Codes

Google’s Mueller responded with a short but information-packed answer that explained that 404s reported in Search Console aren’t an issue that needs to be fixed, that sending a 410 response won’t make a difference in Search Console 404 reporting, and that an abundance of URLs in that report can be seen in a positive light.

Mueller responded:

“These don’t cause problems, so I’d just let them be. They’ll be recrawled for potentially a long time, a 410 won’t change that. In a way, this means Google would be ok with picking up more content from your site.”

Misunderstandings About 4XX Server Responses

The discussion on Reddit continued. The moderator of the r/SEO subreddit suggested that the reason Search Console reports that it discovered the URL in the sitemap is because that is where Googlebot originally discovered the URL, which sounds reasonable.

Where the moderator got it wrong is in explaining what the 404 response code means.

The moderator incorrectly explained:

“404 essentially means – page broken, we’ll fix it soon, check back: and that’s what Google is doing – checking back to see if you fixed it.”

The moderator makes two errors in their response.

1. 404 Means Page Not Found
The 404 status code only means that the page was not found, period. Don’t believe me? Here is the official web standard for the 404 status code:

“The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent…”

2. 404 Is Not An Error That Needs Fixing
People commonly refer to the 404 status code as an error response. The reason it’s an error is because the browser or crawler requested a URL that does not exist, which means that the request was the error, not that the page needs fixing, as the moderator insisted when they said “404 essentially means – page broken,” which is 100% incorrect.

Furthermore, the Reddit moderator was incorrect to insist that Google is “checking back to see if you fixed it.” Google is checking back to see if the page went missing by accident, but that does not mean that the 404 is something that needs fixing. Most of the time, a page is supposed to be gone for a reason, and Google recommends serving a 404 response code for those times.

This Is Not New

This isn’t a matter of the Reddit moderator’s information being out of date. This has always been the case with Google, which generally follows the official web standards.

Google’s Matt Cutts explained how Google handles 404s and why in a 2014 video:

“It turns out webmasters shoot themselves in the foot pretty often. Pages go missing, people misconfigure sites, sites go down, people block Googlebot by accident, people block regular users by accident. So if you look at the entire web, the crawl team has to design to be robust against that.

So with 404s… we are going to protect that page for twenty four hours in the crawling system. So we sort of wait, and we say, well, maybe that was a transient 404. Maybe it wasn’t really intended to be a page not found. And so in the crawling system it’ll be protected for twenty four hours.

…Now, don’t take this too much the wrong way, we’ll still go back and recheck and make sure, are those pages really gone or maybe the pages have come back alive again.

…And so if a page is gone, it’s fine to serve a 404. If you know it’s gone for real, it’s fine to serve a 410.

But we’ll design our crawling system to try to be robust. But if your site goes down, or if you get hacked or whatever, that we try to make sure that we can still find the good content whenever it’s available.”

The Takeaways

  • Googlebot crawling for 404 pages can be seen as a positive signal that Google likes your content.
  • 404 status codes do not mean that a page is in error; it means that a page was not found.
  • 404 status codes do not mean that something needs fixing. It only means that a requested page was not found.
  • There’s nothing wrong with serving a 404 response code; Google recommends it.
  • Search Console shows 404 responses so that a site owner can decide whether or not those pages are intentionally gone.

Featured Image by Shutterstock/Jack_the_sparow

What Can Log File Data Tell Me That Tools Can’t? – Ask An SEO via @sejournal, @HelenPollitt1

For today’s Ask An SEO, we answer the question:

As an SEO, should I be using log file data, and what can it tell me that tools can’t?

What Are Log Files

Essentially, log files are the raw record of an interaction with a website. They are reported by the website’s server and typically include information about users and bots, the pages they interact with, and when.

Typically, log files will contain certain information, such as the IP address of the person or bot that interacted with the website, the user agent (i.e., Googlebot, or a browser if it is a human), the time of the interaction, the URL, and the server response code the URL provided.

Example log:

6.249.65.1 - - [19/Feb/2026:14:32:10 +0000] "GET /category/shoes/running-shoes/ HTTP/1.1" 200 15432 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36" 
  • 6.249.65.1This is the IP address of the user agent that hit the website.
  • 19/Feb/2026:14:32:10 +0000 – This is the timestamp of the hit.
  • GET /category/shoes/running-shoes/ HTTP/1.1 – The HTTP method, the requested URL, and the protocol version.
  • 200 – The HTTP status code.
  • 15432 – The response size in bytes.
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 – The user agent (i.e., the bot or browser that requested the file)

What Log Files Can Be Used For

Log files are the most accurate recording of how a user or a bot has navigated around your website. They are often considered the most authoritative record of interactions with your website, though CDN caching and infrastructure configuration can affect completeness.

What Search Engines Crawl

One of the most important uses of log files for SEO is to understand what pages on our site search engine bots are crawling.

Log files allow us to see which pages are getting crawled and at what frequency. They can help us validate if important pages are being crawled and whether often-changing pages are being crawled with an increased frequency compared to static pages.

Log files can be used to see if there is crawl waste, i.e., pages that you don’t want to have crawled, or with any real frequency, are taking up crawling time when a bot visits a site. For example, by looking at log files, you may identify that parameterized URLs or paginated pages are getting too much crawl attention compared to your core pages.

This information can be critical in identifying issues with page discovery and crawling.

True Crawl Budget Allocation

Log file analysis can give a true picture of crawl budget. It can help with the identification of which sections of a site are getting the most attention, and which are being neglected by the bots.

This can be critical in seeing if there are poorly linked pages on a site, or if they are being given less crawl priority than those sections of the site with less importance.

Log files can also be helpful after the completion of highly technical SEO work. For example, when a website has been migrated, viewing the log files can aid in identifying how quickly the changes to the site are being discovered.

Through log files, it’s also possible to determine if changes to a website’s structure have actually aided in crawl optimization.

When carrying out SEO experiments, it is necessary to know if a page that is a part of the experiment has been crawled by the bots or not, as this can determine whether the test experience has been seen by them. Log files can give that insight.

Crawl Behavior During Technical Issues

Log files can also be useful in detecting technical issues on a website. For example, there are instances where the status code reported by a crawling tool will not necessarily be the status code that a bot will receive when hitting a page. In that instance, log files would be the only way of identifying that with certainty.

Log files will enable you to see if bots are encountering temporary outages on the site, but also how long it takes them to re-encounter those same pages with the correct status once the issue has been fixed.

Bot Verification

One very helpful feature of log file analysis is in distinguishing between real bots and spoofed bots. This is how you can identify if bots are accessing your site under the guise of being from Google or Microsoft, but are actually from another company. This is important because bots may be getting around your site’s security measures by claiming to be a Googlebot, whereas, in fact, they are looking to carry out nefarious actions on your site, like scraping data.

By using log files, it’s possible to identify the IP range that a bot came from and check it against the known IP ranges of legitimate bots, like Googlebot. This can aid IT teams in providing security for a website without inadvertently blocking genuine search bots that need access to the website for SEO to be effective.

Orphan Pages Discovery

Log files can be used to identify internal pages that tools didn’t detect. For example, Googlebot may know of a page through an external link to it, whereas a crawling tool would only be able to discover it through internal linking or through sitemaps.

Looking through log files can be useful for diagnosing orphan pages on your site that you were simply not aware of. This is also very helpful in identifying legacy URLs that should no longer be accessible via the site but may still be crawled. For example, HTTP URLs or subdomains that have not been migrated properly.

What Other Tools Can’t Tell Us That Log Files Can

If you are currently not using log files, you may well be using other SEO tools to get you partway to the insight that log files can provide.

Analytics Software

Analytics software like Google Analytics can give you an indication of what pages exist on a website, even if bots aren’t necessarily able to access them.

Analytics platforms also give a lot of detail on user behavior across the website. They can give context as to which pages matter most for commercial goals and which are not performing.

They don’t, however, show information about non-user behavior. In fact, most analytics programs are designed to filter out bot behavior to ensure the data provided reflects human users only.

Although they are useful in determining the journey of users, they do not give any indication of the journey of bots. There is no way to determine which sequence of pages a search bot has visited or how often.

Google Search Console/Bing Webmaster Tools

The search engines’ search consoles will often give an overview of the technical health of a website, like crawl issues encountered and when pages were last crawled. However, crawl stats are aggregated and performance data is sampled for large sites. This means you may not be able to get information on specific pages you are interested in.

They also only give information about their bots. This means it can be difficult to bring bot crawl information together, and indeed to see the behavior of bots from companies that do not offer a tool like a search console.

Website Crawlers

Website crawling software can help with mimicking how a search bot might interact with your site, including what it can technically access and what it can’t. However, they do not show you what the bot actually accesses. They can give information on whether, in theory, a page could be crawled by a search bot, but do not give any real-time or historical data on whether the bot has accessed a page, when, or how frequently.

Website crawlers are also mimicking bot behavior in the conditions you are setting them, not necessarily the conditions the search bots are actually encountering. For example, without log files, it is difficult to determine how search bots navigated a site during a DDoS attack or a server outage.

Why You Might Not Use Log Files

There are many reasons why SEOs might not be using log files already.

Difficulty In Obtaining Them

Oftentimes, log files are not straightforward to get to. You may need to speak with your development team. Depending on whether that team is in-house or not, this may literally mean trying to track down who has access to the log files first.

For teams working agency-side, there is an added complexity of companies needing to transfer potentially sensitive information outside of the organization. Log files can include personally identifiable information, for example, IP addresses. For those subject to rules like GDPR, there may be some concern around sending these files to a third party. There may be a need to sanitize the data before sharing it. This can be a material cost of time and resources that a client may not want to spend simply to share their log files with their SEO agency.

User Interface Needs

Once you have access to log files, it isn’t all smooth sailing from there. You will need to understand what you are looking at. Log files in their raw form are simply text files containing string after string of data.

It isn’t something that is easily parsed. To truly make sense of log files, there is usually a need to invest in a program to help decipher them. These can range in price depending on whether they are programs designed to let you run a file through on an ad-hoc basis, or whether you are connecting your log files to them so they stream into the program continuously.

Storage Requirements

There is also a need to store log files. Alongside being secure for the reasons mentioned above, like GDPR, they can be very difficult to store for long periods due to how quickly they grow in size.

For a large ecommerce website, you might see log files reach hundreds of gigabytes over the course of a month. In those instances, it becomes a technical infrastructure issue to store them. Compressing the files can help with this. However, given that issues with search bots can take several months of data to diagnose, or require comparison over long time periods, these files can start to get too big to store cost-effectively.

Perceived Technical Complexity

Once you have your log files in a decipherable format, cleaned and ready to use, you actually need to know what to do with them.

Many SEOs have a big barrier to using log files simply based on the fact they seem too technical to use. They are, after all, just strings of information about hits on the website. This can feel overwhelming.

Should SEOs Use Log Files?

Yes, if you can.

As mentioned above, there are many reasons why you may not be able to get hold of your log files and transform them into a usable data source. However, once you can, it will open up a whole new level of understanding of the technical health of your website and how bots interact with it.

There will be discoveries made that simply could not be achieved without log file data. The tools you are currently using may well get you part of the way there. They will never give you the full picture, however.

More Resources:


Featured Image: Paulo Bobita/Search Engine Journal

The Content Moat Is Dead. The Context Moat Is What Survives via @sejournal, @DuaneForrester

So, let’s say you spent six months building a resource library: guides, explainers, comparison pages, all well-researched and clearly written, structured for humans who are trying to make decisions. Your analytics show strong engagement, and your team is proud of the work.

Then someone asks ChatGPT a question your library answers perfectly, and the response cites a competitor. Not because the competitor was more accurate or more thorough, but because they published original benchmark data that the model could not find anywhere else. Your content was correct; theirs was irreplaceable. That distinction now helps decide who gets cited and who gets omitted.

Free Frameworks From My Book

The Summarization Problem Is Now The Content Strategy Problem

Any major AI platform can condense a 3,000-word guide into three sentences in under two seconds, now, today. It is a current capability with a direct consequence for how content creates value. If your content can be fully replaced by a summary, it has no moat. The summary becomes the product, and your page becomes the raw material that someone else’s system processes and discards.

This is already happening across multiple surfaces. Gmail’s Gemini-powered summary cards condense marketing emails before recipients see the original content. Google AI Overviews synthesize answers from your pages and present them above your link. Microsoft’s Copilot can now handle purchasing without visiting retailer websites, compressing the entire discovery-to-transaction journey into a single assistant interaction. Samsung plans to double its Galaxy AI devices to 800 million in 2026, pushing AI-mediated discovery and summarization into everyday consumer interactions at a scale that dwarfs what we are seeing today.

The layer between your content and your audience is getting thicker and more capable every quarter. When that layer can reproduce the value of your page without sending anyone to it, the page itself stops being the asset. The asset becomes whatever the layer cannot reproduce.

What Commodity Content Actually Is

Most teams will not like this definition, but it needs to be precise. Commodity content is information available from multiple public sources, repackaged without original data, methodology, or first-person insight. That covers a lot of ground. Most how-to guides, most of what passes for “thought leadership,” and any page where the core information could be assembled by a competent person with access to the same public sources you used.

The uncomfortable reality is that much of what marketing teams call “high-quality content” qualifies as commodity. Clean writing, accurate information, and helpful structure are necessary, but they are no longer sufficient. They are table stakes in the same way that having a mobile-responsive site became table stakes a decade ago. When AI can produce a competent synthesis of public knowledge on any topic, the bar for defensible content moves above “correct and well-written.”

The Content Marketing Institute’s 2026 B2B research surveyed over 1,000 B2B marketers, and the top challenges they reported remain identical to prior years: not enough quality content, difficulty differentiating from competitors, and resource constraints. Those challenges are not new. What is new is that AI makes the consequences of undifferentiated content dramatically worse, because when your guide and your competitor’s guide both say the same thing, the AI picks one and ignores the other, or it picks neither and synthesizes from both without citing either.

The Context Moat Defined

A context moat is content that requires proprietary access, original research, unique datasets, or domain-specific experience to produce. AI can summarize it, AI can reference it, but AI cannot replicate the source material because the source material does not exist anywhere else.

The categories are specific and worth naming clearly:

  • Original benchmarks and proprietary data. This means your customer data (anonymized and aggregated), your internal performance metrics, your survey results. When HubSpot publishes its State of Marketing report, AI must cite HubSpot. When Salesforce publishes State of Sales, AI must cite Salesforce. That “must” is the moat, as the model has no alternative source for those specific numbers.
  • First-person methodology and case studies with specifics. Not “a SaaS company improved retention.” Instead: “We reduced churn from 8.2% to 4.1% over six months by restructuring onboarding around three specific interventions, and here is exactly what we did.” The specificity is the moat because nobody else was in the room when those decisions were made.
  • Expert commentary that models cannot fabricate. Named humans with verifiable credentials offering professional judgment, not just information. Models can synthesize facts from public sources all day long, but they struggle to replicate the judgment of someone who has spent twenty years in a specific domain and can tell you what the data means in context.
  • Original testing and experimentation. You ran the test, you controlled the variables, you measured the outcome. Nobody else has that data unless you choose to publish it, which means the model has to come to you or go without.

This is not an abstract framework. Research is already showing that AI systems disproportionately cite content with original data. The peer-reviewed GEO study from Princeton and Georgia Tech, presented at KDD 2024, found that adding statistics to content improved AI visibility by 41%, making it the single most effective optimization technique tested. Separate analysis from Yext found that data-rich websites earn 4.3 times more citation occurrences per URL than directory-style listings. The mechanism is straightforward: AI systems are risk-minimizing, and when a model needs to support a claim, it looks for a source it can confidently attribute. Original data with clear provenance is safer to cite than a synthesis of public information.

Why This Is An AI Visibility Play, Not Just A Content Strategy Play

If you have been reading this publication, you already know that AI retrieval works differently from traditional search ranking. I have written about how answer engines pick winners, about the gap between human relevance and model utility, and about why being right is not enough for visibility. The context moat connects all those threads into a single strategic argument.

Context-moat content becomes the authoritative node in the retrieval graph. When multiple sources say the same thing, the model has choices and your page is fungible: It can pull from you, your competitor, or a third party and produce an equivalent answer. When only one source has the data, the model has a dependency, and dependencies get cited while fungible sources get compressed.

Evertune.ai’s analysis of 75,000 brands found that brand recognition is the strongest single predictor of AI citations, with a 0.334 correlation coefficient. But brand recognition does not appear from nowhere. It compounds from being the origin point for data, research, and insights that other sources then reference, creating what the researchers describe as a citation authority flywheel: You publish original research, the research generates press coverage and industry mentions, those mentions increase brand recognition signals in AI training and retrieval systems, and the higher recognition makes your content safer for the model to cite.

This is why first-party data is not just a personalization play or an advertising play. It is an AI visibility play. The organizations sitting on proprietary datasets, customer behavior patterns, and operational benchmarks have a structural advantage in the AI retrieval layer, if they publish it. Most do not, and that gap between what companies know and what they make available to the machine layer is where the real opportunity sits right now.

The Investment Reallocation

The CMO Survey, drawing from over 11,000 marketing executives, reports that companies allocate an average of 11.2% of digital marketing budgets to first-party data initiatives, expected to reach 15.8% by 2026. Content marketing overall claims 25% to 30% of total marketing budgets, with enterprise teams investing heavily in experiential marketing, video, and distribution.

Here is the question nobody is asking loudly enough: What percentage of that content budget produces commodity content versus context-moat content?

Run the audit on your own library. Take your top 50 pages by traffic or strategic importance, and for each one, ask a single question: Could a competent competitor produce substantially the same page using only public information? If the answer is yes, that page is commodity content. It may still serve a purpose, and it may still drive traffic today, but its defensibility against AI summarization is zero. When the AI can reproduce its value without sending anyone to your page, the page’s strategic contribution collapses.

Now count. If 80% of your library is commodity and 20% is context-moat, your content investment is structurally misaligned with where AI visibility is heading.

The reallocation does not require burning down what exists. It requires shifting new investment toward the content only you can produce, and in most organizations, that shift looks like four concrete changes:

  • Publishing internal data that already exists but is not being shared. Most organizations collect far more proprietary data than they ever publish. Customer behavior benchmarks, operational metrics, industry-specific performance data, etc. The research team has it, the product team has it, and marketing has not yet turned it into published content that AI systems can discover and cite.
  • Investing in original research as a recurring editorial commitment. Annual surveys, quarterly benchmarks, longitudinal studies. These are expensive to produce and impossible for competitors to replicate, which is exactly the point. They create ongoing citation dependencies that compound over time.
  • Shifting editorial resources from synthesis to analysis. A writer summarizing industry trends produces commodity content because anyone can summarize the same trends from the same public sources. A writer analyzing your proprietary data and explaining what it means produces context-moat content. Same writer, different assignment, fundamentally different value to the business.
  • Treating subject matter experts as content assets, not interview sources. An SME quoted in a blog post adds a sentence of value. An SME who authors a detailed methodology breakdown or publishes professional judgment under their own name and credentials creates an AI-citable authority signal that compounds over time. The difference between “we talked to an expert” and “our expert published their analysis” is the difference between commodity and context moat.

The Existing Content Is Not Worthless

I want to be direct about this because the title of this article is deliberately provocative. Commodity content is not garbage. It still serves real functions; it still helps humans find what they need, it still drives traffic and supports some conversions, and it still forms the baseline of how your brand shows up across the web.

But it is no longer the moat. It is the foundation, and foundations do not differentiate because every competitor has one.

The shift I am describing is not “stop producing commodity content.” It is “stop treating commodity content as your competitive advantage.” Those are different statements: The first is impractical for any real business, while the second is a strategic reorientation that changes how you allocate budget and editorial attention.

This aligns with a pattern I see across the AI search transition more broadly. New practices layer onto existing ones rather than replacing them. SEO is no longer a single discipline, but the old disciplines did not disappear. Technical SEO still matters, on-page fundamentals still matter, and the content you already have still contributes. What changed is that those practices are necessary but insufficient. The context moat is the new sufficiency layer.

Where This Leaves You

The competitive landscape for content is splitting into two tiers, and the split is accelerating as AI systems become the primary mediators of discovery.

Tier one consists of organizations that publish original data, proprietary research, and experience-based insight that AI systems must cite because no alternative source exists. These organizations become origin points in the AI retrieval layer, and their content compounds in value as models train on it, reference it, and build answers around it.

Tier two consists of organizations that publish well-written, accurate, helpful content that could be reproduced by any sufficiently motivated team with access to the same public information. These organizations contribute to the training data, but they do not control how they appear in answers. Their content is raw material, not product.

The question for your next budget cycle is not “are we producing enough content.” It is “are we producing content that only we can produce.”

If the answer is no, the moat is already gone. The good news is that most organizations are sitting on first-party data they have never published – the research exists, the benchmarks exist, the operational knowledge exists. Turning that into published, structured, citable content is an editorial decision and a prioritization choice, not a capability gap (though you really should check with legal, too). Start with one proprietary metric or benchmark published quarterly with a branded name that AI can reference, and build from there. Every month of original data published is a month of context-moat content that no competitor can replicate, and no AI system can synthesize from public sources.

That is the new defensibility. Not having information, but having context that only you can provide.

More Resources:


This post was originally published on Duane Forrester Decodes.


Featured Image: Gabriela Flores Espinosa/Shutterstock; Paulo Bobita/Search Engine Journal

Google Expands UCP With Cart, Catalog, Onboarding via @sejournal, @MattGSouthern

Google updated the Universal Commerce Protocol with new Cart and Catalog capabilities, highlighted Identity Linking as an available option, and announced a simpler onboarding process through Merchant Center.

The update is UCP’s first since Google launched the protocol at NRF in January. Cart and Catalog are published as draft specifications. Identity Linking is in the latest stable version of the spec.

What The New Capabilities Do

The additions expand what AI agents can do within UCP-powered shopping experiences.

Cart lets agents save or add multiple items to a shopping basket from a single store. According to the UCP spec, Cart is designed for pre-purchase exploration, allowing agents to build baskets before a shopper commits to a purchase. Carts can then convert to checkout sessions when the shopper is ready.

Catalog enables agents to retrieve real-time product details from a retailer’s inventory. That includes variants, pricing, and stock availability. The Catalog spec supports both search and direct product lookups.

This is relevant to the product discovery question raised in earlier UCP coverage. Agents can now query live catalog data rather than relying solely on product feeds.

Identity Linking allows shoppers to connect their retailer accounts to UCP-integrated platforms using OAuth 2.0. That means loyalty pricing, member discounts, or free shipping offers can carry over when a shopper buys through AI Mode or Gemini instead of the retailer’s own site.

Identity Linking was part of the UCP spec at launch. Google’s blog post groups it with Cart and Catalog as a newly available option for adopters.

All three capabilities are optional. Retailers choose which ones to support.

Merchant Center Onboarding

Separately, Google said it is simplifying UCP onboarding through Merchant Center. The company described the process as designed to bring in “more retailers of all sizes” and said it would roll out over the coming months.

Google’s Merchant Center help page still lists the checkout feature as available to selected merchants, with an interest form for those who want to participate. The page specifies that only product listings using the native_commerce product attribute will display the checkout button.

Three platform partners announced plans to implement UCP. Commerce Inc, Salesforce, and Stripe each published separate announcements. Google said its implementations will arrive “in the near future,” with others to follow.

For retailers not building a direct UCP integration, platform support from these providers could lower the technical barrier to participation.

Why This Matters

Simplified Merchant Center onboarding and third-party platform support open the door for retailers without engineering teams building custom integrations.

Cart and Catalog also change the scope of what UCP handles. At launch, UCP could process a single-item checkout. Now agents can build multi-item baskets and pull live product data. That moves UCP closer to replicating a full shopping experience inside Google’s AI surfaces.

The tradeoffs for retailers are the same ones identified in January. Sales happen on Google’s surfaces instead of owned sites. Identity Linking adds loyalty benefits to that equation, which may make the tradeoff more palatable for some retailers and more concerning for others who see loyalty programs as a reason shoppers come to their sites directly.

Looking Ahead

Cart and Catalog are draft specifications, meaning their status may change as community contributors provide feedback in the open-source project.

Google said it plans to bring UCP capabilities to AI Mode in Search, the Gemini app, and beyond. The company has not provided a more specific timeline for the Merchant Center onboarding rollout beyond “coming months.”

What do new nuclear reactors mean for waste?

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

The way the world currently deals with nuclear waste is as creative as it is varied: Drown it in water pools, encase it in steel, bury it hundreds of meters underground. 

These methods are how the nuclear industry safely manages the 10,000 metric tons of spent fuel waste that reactors produce as they churn out 10% of the world’s electricity every year. But as new nuclear designs emerge, they could introduce new wrinkles for nuclear waste management.  

Most operating reactors at nuclear power plants today follow a similar basic blueprint: They’re fueled with low-enriched uranium and cooled with water, and they’re mostly gigantic, sited at central power plants. But a large menu of new reactor designs that could come online in the next few years will likely require tweaks to ensure that existing systems can handle their waste.

“There’s no one answer about whether this panoply of new reactors and fuel types are going to make waste management any easier,” says Edwin Lyman, director of nuclear power safety at the Union of Concerned Scientists.

A nuclear disposal playbook

Nuclear waste can be roughly split into two categories: low-level waste, like contaminated protection equipment from hospitals and research centers, and high-level waste, which requires more careful handling. 

The vast majority by volume is low-level waste. This material can be stored onsite and often, once its radioactivity has decayed enough, largely handled like regular trash (with some additional precautions). High-level waste, on the other hand, is much more radioactive and often quite hot. This second category consists largely of spent fuel, a combination of materials including uranium-235, which is the fissile portion of nuclear fuel—the part that can sustain the chain reaction required for nuclear power plants to work. The material also contains fission products—the sometimes radioactive by-products of the splitting atoms that release energy.

Many experts agree that the best long-term solution for spent fuel and other high-level nuclear waste is a geologic repository—essentially, a very deep, very carefully managed hole in the ground. Finland is the furthest along with plans to build one, and its site on the southwest coast of the country should be operational this year.

The US designated a site for a geological repository in the 1980s, but political conflict has stalled progress. So today, used fuel in the US is stored onsite at operational and shuttered nuclear power plants. Once it’s removed from a reactor, it’s typically placed into wet storage, essentially submerged in pools of water to cool down. The material can then be put in protective cement and steel containers called dry casks, a stage known as dry storage.

Experts say the industry won’t need to entirely rewrite this playbook for the new reactor designs.  

“The way we’re going to manage spent fuel is going to be largely the same,” says Erik Cothron, manager of research and strategy at the Nuclear Innovation Alliance, a nonprofit think tank focused on the nuclear industry. “I don’t stay up late at night worried about how we’re going to manage spent fuel.” 

But new designs and materials could require some engineering solutions. And there’s a huge range of reactor designs, meaning there’s an equally wide range of potential waste types to handle.

Unusual waste

Some new nuclear reactors will look quite similar to operating models, so their spent fuel will be managed in much the same way that it is today. But others use novel materials as coolants and fuels. 

“Unusual materials will create unusual waste,” says Syed Bahauddin Alam, an assistant professor of nuclear, plasma, and radiological engineering at the University of Illinois Urbana-Champaign.

Some advanced designs could increase the volume of material that needs to be handled as high-level waste. Take reactors that use TRISO (tri-structural isotropic) fuel, for example. TRISO contains a uranium kernel surrounded by several layers of protective material and then embedded in graphite shells. The graphite that encases TRISO will likely be lumped together with the rest of the spent fuel, making the waste much bulkier than current fuel.

Today, separating those layers would be difficult and expensive, according to a 2024 report from the Nuclear Innovation Alliance. That means the entire package would be lumped together as high-level waste.  

The company X-energy is designing high-temperature gas-cooled reactors that use TRISO fuel. It has already submitted plans for dealing with spent fuel to the Nuclear Regulatory Commission, which oversees reactors in the US. The fuel’s form could actually help with waste management: The protective shells used in TRISO eliminate X-energy’s need for wet storage, allowing for dry storage from day one, according to the company.

Liquid-fueled molten-salt reactors, another new type, could increase waste volume too. In these designs, fuel and coolant are not kept separate as in most reactors; instead, the fuel is dissolved directly into a molten salt that’s used as the coolant. That means the entire vat of molten salt would need to be handled as high-level waste.

On the other hand, some other reactor designs could produce a smaller volume of spent fuel, but that isn’t necessarily a smaller problem. Fast reactors, for example, achieve a higher burn-up, consuming more of the fissile material and extracting more energy from their fuel. That means spent fuel from these reactors typically has a higher concentration of fission products and emits more heat. And that heat could be the killer factor for designing waste solutions. 

Spent fuel needs to be kept relatively cool, so it doesn’t melt and release hazardous by-products. Too much heat in a repository could also damage the surrounding rock. “Heat is what really drives how much you can put inside a repository,” says Paul Dickman, a former Department of Energy and NRC official.

Some spent fuel could require chemical processing prior to disposal, says Allison MacFarlane, director of the school of public policy and global affairs at the University of British Columbia and a former chair of the NRC. That could add complication and cost.

In fast reactors cooled by sodium metal, for example, the coolant can get into the fuel and fuse to its casing. Separation could be tricky, and sodium is highly reactive with water, so the spent fuel will require specialized treatment.

TerraPower’s Natrium reactor, a sodium fast reactor that received a construction permit from the NRC in early March, is designed to safely manage this challenge, says Jeffrey Miller, senior vice president for business development at TerraPower. The company has a plan to blow nitrogen over the material before it’s put into wet storage pools, removing the sodium.

Location, location, location

Regardless of what materials are used, even just changing the size of reactors and where they’re sited could introduce complications for waste management. 

Some new reactors are essentially smaller versions of the large reactors used today. These small modular reactors and microreactors may produce waste that can be handled in the same way as waste from today’s conventional reactors. But for places like the US, where waste is stored onsite, it would be impractical to have a ton of small sites that each hosts its own waste.  

Some companies are looking at sending their microreactors, and the waste material they produce, back to a single location, potentially the same one where reactors are manufactured.

Companies should be required to think carefully about waste and design in management protocols, and they should be held responsible for the waste they produce, UBC’s MacFarlane says. 

She also notes that so far, planning for waste has relied on research and modeling, and the reality will become clear only once the reactors are actually operational. As she puts it: “These reactors don’t exist yet, so we don’t really know a whole lot, in great gory detail, about the waste they’re going to produce.”

The Download: The Pentagon’s new AI plans, and next-gen nuclear reactors

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

The Pentagon is planning for AI companies to train on classified data, defense official says 

The Pentagon plans to set up secure environments for generative AI companies to train military-specific versions of their models on classified data, MIT Technology Review has learned.  

AI models like Anthropic’s Claude are already used to answer questions in classified settings, including for analyzing targets in Iran. But allowing them to train on and learn from classified data is a major new development that presents unique security risks.  

It would embed sensitive intelligence—like surveillance reports or battlefield assessments—into the models themselves. It would also bring AI firms closer to classified data than ever before. Read the full story

—James O’Donnell 

What do new nuclear reactors mean for waste? 

The way the world currently deals with nuclear waste is as creative as it is varied: drown it in water pools, encase it in steel, bury it hundreds of meters underground. But an approaching wave of new reactors could introduce fresh challenges to nuclear waste management.   
 
The new designs and materials could require some engineering solutions. And there’s a huge range of them coming, meaning there’s an equally wide range of potential waste types to handle. Read the full story

—Casey Crownhart 

This story is part of our MIT Technology Review Explains series, which untangles the complex, messy world of technology to show you what’s coming next. Check out the full series here. 

MIT Technology Review Narrated: how uncrewed narco subs could transform the Colombian drug trade 

For decades, handmade narco subs have been among the cocaine trade’s most elusive and productive workhorses, ferrying tons of drugs from Colombia to the rest of the world.  

Now off-the-shelf technology—Starlink terminals, plug-and-play nautical autopilots, high-resolution video cameras—may be advancing that cat-and-mouse game into a new phase. 

Uncrewed subs could move more cocaine over longer distances, and they wouldn’t put human smugglers at risk of capture. Law enforcement agencies are only just beginning to grapple with the consequences. 

—Eduardo Echeverri López 

This is our latest story to be turned into an MIT Technology Review Narrated podcast, which we’re publishing each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released. 

The must-reads

1 Nvidia has joined the OpenClaw craze with the launch of NemoClaw  
It’s adding privacy and security to the AI agent platform. (Business Insider)  
+ Chinese AI stocks surged on the news. (Bloomberg $)  
+ Nvidia has also gained Beijing’s approval to sell H200 chips. (Reuters)  
+ Tech-savvy “Tinkerers” are cashing in on China’s OpenClaw frenzy. (MIT Technology Review)  

2 Microsoft is mulling legal action over the Amazon-OpenAI cloud deal  
Citing a potential violation of its exclusive partnership. (FT $)  

3 The Pentagon wants to mass-produce the drones it used to strike Iran  
The kamikaze drone, called Lucas, is a copy of Iran’s Shahed UAV. (WSJ $)   
+ The Shaheds have proven highly effective in the conflict. (NBC News)  
+ AI is turning the war into theater. (MIT Technology Review)  

4 US officials say Anthropic can’t be trusted with warfighting systems  
They want to oust the AI company from all government agencies. (Wired $)   
+ OpenAI has taken advantage of the spat. (MIT Technology Review)  
+ Here’s how GenAI may be used in strikes. (MIT Technology Review)  

5 China is penalizing people linked to Meta’s $2 billion acquisition of Manus   
It’s seen as an attempt to stop Chinese AI leaders from relocating. (NYT)  

6 DeepSeek appears to be quietly testing a next-generation AI model  
An official launch of the new system may be imminent. (Reuters)  
+ DeepSeek ripped up the AI playbook. (MIT Technology Review)  

7 Meta is ending VR access to Horizon Worlds in June  
It was Meta’s flagship metaverse project. (Engadget)  
+ And became notorious for sexual harassment. (MIT Technology Review)  

8 “Sensorveillance” is turning consumer tech into tracking tools for police 
It’s turning our most personal devices into digital informants. (IEE Spectrum)  
+ In the surveillance capitalism era, we need to rethink privacy. (MIT Technology Review)  

9 Two landmark lawsuits could transform social media for the better  
They target the dangers that the platforms pose to children. (New Scientist)  

10 A DNA discovery suggests humanity may have seeded from space  
An asteroid may have transported the ingredients for life to Earth. (404 Media

Quote of the day 

“It is now the largest, most popular, the most successful open-sourced project in the history of humanity. This is definitely the next ChatGPT.” 

—Nvidia CEO tells CNBC why OpenClaw is a big step forward for AI. 

One More Thing 

AP Photo/Alex Brandon

How the Pentagon is adapting to China’s technological rise 

It’s been just over a year since Kathleen Hicks stepped down as US deputy secretary of defense. 

As the highest-ranking woman in Pentagon history, Hicks shaped US military posture through an era defined by renewed competition between powerful countries and a scramble to modernize defense technology. 

In this conversation with MIT Technology Review, Hicks reflects on how the Pentagon is adapting—or failing to adapt—to a new era of geopolitical competition. She discusses China’s technological rise, the future of AI in warfare, and her signature initiative: Replicator. Read the full story

—Caiwei Chen 

We can still have nice things 

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) 

+ Give typing a tuneful tempo by turning your keyboard into a piano with this new tool
+ Barry’s Border Points is a fascinating photographic journey through the lines that divide us. 
+ Feast your eyes on these five architectural contenders for “a new wonder of the world.” 
+ This Ancient Rome cosplay game lets you live your best gladiator life. 

The Download: OpenAI’s US military deal, and Grok’s CSAM lawsuit

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

Where OpenAI’s technology could show up in Iran 

OpenAI has controversially agreed to give the Pentagon access to its AI. But where exactly could its tech show up, and which applications will its customers and employees tolerate? 

There’s pressure to integrate it quickly with existing military tools. One defense official revealed it could even assist in selecting strike targets. OpenAI’s partnership with Anduril, which makes drones and counter-drone technologies, adds another hint at what is to come. 

AI has long handled military analysis. But applying generative AI’s advice to actions in the field is being tested in earnest for the first time in Iran. Read the full story

—James O’Donnell  

This story is from The Algorithm, our weekly newsletter on AI. Sign up to receive it in your inbox every Monday.  

The must-reads 

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 

1 xAI has been sued over AI-generated child sexual abuse material 
Victims say Grok was built to create porn from photos of real people. (WP $) 
+ There’s a booming market for custom deepfake porn. (MIT Technology Review

2 In a world-first, China has approved a brain chip for commercial use 
The BCI has been approved for treating paralysis. (Nature
+ Brain implants are slowly becoming products. (MIT Technology Review
+ Some are getting help from generative AI. (MIT Technology Review

3 Anthropic is recruiting a weapons expert to prevent “catastrophic misuse” of its AI 
They want experience with “chemical weapons and/or explosives defense.” (BBC
+ Anthropic’s relationship with the White House is in tatters. (MIT Technology Review

4 Nvidia predicts “at least” $1 trillion in AI chip revenue by the end of next year 
But the bullish forecast failed to impress Wall Street. (FT $) 
+ Nvidia has teamed up with Bolt to build European robotaxis. (Engadget

5 OpenAI plans to shift its focus to coding and business users 
Areas where its rival Anthropic already dominates. (WSJ $) 

6 President Trump has driven a wedge between Republicans over AI 
And that divide led to a sweeping AI bill flopping in Florida. (NYT $) 
+ Trump was duped by a fake AI video again. (Reuters

7 The US wants the WTO to permanently ban ecommerce tariffs 
Brazil, India, and South Africa oppose the plan. (Bloomberg

8 OpenAI’s wellbeing experts opposed the launch of ChatGPT’s “adult mode” 
One said it risked creating a “sexy suicide coach” for vulnerable users. (Ars Technica
+ AI is already transforming relationships. (MIT Technology Review

9 A witness caught using smartglasses in court blamed ChatGPT 
He was getting real-time legal coaching through the specs. (404 Media
+ AI is creating legal errors in courtrooms. (MIT Technology Review

10 Some people think Benjamin Netanyahu is an AI clone 
Despite his insistence to the contrary. (The Verge
+ Generative AI is amplifying disinformation and propaganda. (MIT Technology Review

Quote of the day 

“The inference inflection has arrived.” 

—Nvidia CEO Jensen Huang claims we’ve reached a tipping point where AI usage is accelerating faster than its development, AP reports

One More Thing 

Meet the radio-obsessed civilian shaping Ukraine’s drone defense 

EMRE ÇAYLAK

Serhii “Flash” Beskrestnov is, at least unofficially, a spy. Once a month, he drives to the frontline in a VW van equipped with radio hardware, roof antennas, and devices that monitor drones. Over several days, he searches the skies for transmissions that can help Ukrainian troops. 

Drones define this brutal conflict, and most rely on the radio communications Flash has obsessed over since childhood. Though now a civilian, the former officer has taken it upon himself to inform his country’s defense on all matters related to radio. 

Unlike traditional spies, Flash shares his discoveries with over 127,000 followers—including soldiers and officials—on social media. His work has won fans in the military, but also sparked controversy among the top brass. Read the full story

—Charlie Metcalfe  

We can still have nice things 

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) 

+ A newly mapped spiral galaxy 65 million light-years away is an absolute knockout. 
+ Miss the days of TV guides? A new app recreates them for YouTube. 
+ Shameless plug: MIT’s Heirloom House shows homes can last for a millennium. 
+ This supergroup of musical dogs is creating truly fur-midable harmonies (sorry). 

The Pentagon is planning for AI companies to train on classified data, defense official says

The Pentagon is discussing plans to set up secure environments for generative AI companies to train military-specific versions of their models on classified data, MIT Technology Review has learned. 

AI models like Anthropic’s Claude are already used to answer questions in classified settings; applications include analyzing targets in Iran. But allowing models to train on and learn from classified data would be a new development that presents unique security risks. It would mean sensitive intelligence like surveillance reports or battlefield assessments could become embedded into the models themselves, and it would bring AI firms into closer contact with classified data than before. 

Training versions of AI models on classified data is expected to make them more accurate and effective in certain tasks, according to a US defense official who spoke on background with MIT Technology Review. The news comes as demand for more powerful models is high: The Pentagon has reached agreements with OpenAI and Elon Musk’s xAI to operate their models in classified settings and is implementing a new agenda to become an “an ‘AI-first’ warfighting force” as the conflict with Iran escalates. (The Pentagon did not comment on its AI training plans as of publication time.)

Training would be done in a secure data center that’s accredited to host classified government projects, and where a copy of an AI model is paired with classified data, according to two people familiar with how such operations work. Though the Department of Defense would remain the owner of the data, personnel from AI companies might in rare cases access the data if they have appropriate security clearance, the official said. 

Before allowing this new training, though, the official said, the Pentagon intends to evaluate how accurate and effective models are when trained on nonclassified data, like commercially available satellite imagery. 

The military has long used computer vision models, an older form of AI, to identify objects in images and footage it collects from drones and airplanes, and federal agencies have awarded contracts to companies to train AI models on such content. And AI companies building large language models (LLMs) and chatbots have created versions of their models fine-tuned for government work, like Anthropic’s Claude Gov, which are designed to operate across more languages and in secure environments. But the official’s comments are the first indication that AI companies building LLMs, like OpenAI and xAI, could train government-specific versions of their models directly on classified data.

Aalok Mehta, who directs the Wadhwani AI Center at the Center for Strategic and International Studies and previously led AI policy efforts at Google and OpenAI, says training on classified data, as opposed to just answering questions about it, would present new risks. 

The biggest of these, he says, is that classified information these models train on could be resurfaced to anyone using the model. That would be a problem if lots of different military departments, all with different classification levels and needs for information, were to share the same AI. 

“You can imagine, for example, a model that has access to some sort of sensitive human intelligence—like the name of an operative—leaking that information to a part of the Defense Department that isn’t supposed to have access to that information,” Mehta says. That could create a security risk for the operative, one that’s difficult to perfectly mitigate if a particular model is used by more than one group within the military.

However, Mehta says, it’s not as hard to keep information contained from the broader world: “If you set this up right, you will have very little risk of that data being surfaced on the general internet or back to OpenAI.” The government has some of the infrastructure for this already; the security giant Palantir has won sizable contracts for building a secure environment through which officials can ask AI models about classified topics without sending the information back to AI companies. But using these systems for training is still a new challenge. 

The Pentagon, spurred by a memo from Defense Secretary Pete Hegseth in January, has been racing to incorporate more AI. It has been used in combat, where generative AI has ranked lists of targets and recommended which to strike first, and in more administrative roles, like drafting contracts and reports.

There are lots of tasks currently handled by human analysts that the military might want to train leading AI models to perform and would require access to classified data, Mehta says. That could include learning to identify subtle clues in an image the way an analyst does, or connecting new information with historical context. The classified data could be pulled from the unfathomable amounts of text, audio, images, and video, in many languages, that intelligence services collect. 

It’s really hard to say which specific military tasks would require AI models to train on such data, Mehta cautions, “because obviously the Defense Department has lots of incentives to keep that information confidential, and they don’t want other countries to know what kind of capabilities we have exactly in that space.”

If you have information about the military’s use of AI, you can share it securely via Signal (username jamesodonnell.22).