Google Explains Googlebot Byte Limits And Crawling Architecture via @sejournal, @MattGSouthern

Google’s Gary Illyes published a blog post explaining how Googlebot’s crawling systems work. The post covers byte limits, partial fetching behavior, and how Google’s crawling infrastructure is organized.

The post references episode 105 of the Search Off the Record podcast, where Illyes and Martin Splitt discussed the same topics. Illyes adds more details about crawling architecture and byte-level behavior.

What’s New

Googlebot Is One Client Of A Shared Platform

Illyes describes Googlebot as “just a user of something that resembles a centralized crawling platform.”

Google Shopping, AdSense, and other products all send their crawl requests through the same system under different crawler names. Each client sets its own configuration, including user agent string, robots.txt tokens, and byte limits.

When Googlebot appears in server logs, that’s Google Search. Other clients appear under their own crawler names, which Google lists on its crawler documentation site.

How The 2 MB Limit Works In Practice

Googlebot fetches up to 2 MB for any URL, excluding PDFs. PDFs get a 64 MB limit. Crawlers that don’t specify a limit default to 15 MB.

Illyes adds several details about what happens at the byte level.

He says HTTP request headers count toward the 2 MB limit. When a page exceeds 2 MB, Googlebot doesn’t reject it. The crawler stops at the cutoff and sends the truncated content to Google’s indexing systems and the Web Rendering Service (WRS).

Those systems treat the truncated file as if it were complete. Anything past 2 MB is never fetched, rendered, or indexed.

Every external resource referenced in the HTML, such as CSS and JavaScript files, gets fetched with its own separate byte counter. Those files don’t count toward the parent page’s 2 MB. Media files, fonts, and what Google calls “a few exotic files” are not fetched by WRS.

Rendering After The Fetch

The WRS processes JavaScript and executes client-side code to understand a page’s content and structure. It pulls in JavaScript, CSS, and XHR requests but doesn’t request images or videos.

Illyes also notes that the WRS operates statelessly, clearing local storage and session data between requests. Google’s JavaScript troubleshooting documentation covers implications for JavaScript-dependent sites.

Best Practices For Staying Under The Limit

Google recommends moving heavy CSS and JavaScript to external files, since those get their own byte limits. Meta tags, title tags, link elements, canonicals, and structured data should appear higher in the HTML. On large pages, content placed lower in the document risks falling below the cutoff.

Illyes flags inline base64 images, large blocks of inline CSS or JavaScript, and oversized menus as examples of what could push pages past 2 MB.

The 2 MB limit “is not set in stone and may change over time as the web evolves and HTML pages grow in size.”

Why This Matters

The 2 MB limit and the 64 MB PDF limit were first documented as Googlebot-specific figures in February. HTTP Archive data showed most pages fall well below the threshold. This blog post adds the technical context behind those numbers.

The platform description explains why different Google crawlers behave differently in server logs and why the 15 MB default differs from Googlebot’s 2 MB limit. These are separate settings for different clients.

HTTP header details matter for pages near the limit. Google states headers consume part of the 2 MB limit alongside HTML data. Most sites won’t be affected, but pages with large headers and bloated markup might hit the limit sooner.

Looking Ahead

Google has now covered Googlebot’s crawl limits in documentation updates, a podcast episode, and a dedicated blog post within a two-month span. Illyes’ note that the limit may change over time suggests these figures aren’t permanent.

For sites with standard HTML pages, the 2 MB limit isn’t a concern. Pages with heavy inline content, embedded data, or oversized navigation should verify that their critical content is within the first 2 MB of the response.


Featured Image: Sergei Elagin/Shutterstock

The Science Of What AI Actually Rewards via @sejournal, @Kevin_Indig

Boost your skills with Growth Memo’s weekly expert insights. Subscribe for free!

In “The Science Of How AI Pays Attention,” I analyzed 1.2 million ChatGPT responses to understand exactly how AI reads a page. In “The Science Of How AI Picks Its Sources,” I analyzed 98,000 citation rows to understand which pages make it into the reading pool at all.

This is Part 3.

Where Part 1 told you where on a page AI looks, and Part 2 told you which pages AI routinely considers, this one tells you what AI actually rewards inside the content it reads.

The data clarifies:

  • Most AI SEO writing advice doesn’t hold at scale. There is no universal “write like this to get cited” formula – the signals that lift one industry’s citation rates can actively hurt another.
  • The entity types that predict citation are not the ones being targeted. DATE and NUMBER are universal positives. PRICE suppresses citation in five of six verticals, and KG-verified entities are a negative signal.
  • The one writing signal that holds across all seven verticals: Declarative language in your intro, +14% aggregate lift.
  • Heading structure is binary. Commit to the right number for your vertical or use none. Three to four headings are worse than zero in every vertical.
  • Corporate content dominates. Reddit doesn’t. AI citation behavior does not mirror what happened to organic search in 2023-2024.

1. Specific Writing Signals Influence Citation, While Others Harm It

While “The Science Of How AI Pays Attention” covers parts of the page and types of writing that influence ChatGPT visibility, I wanted to understand which writing-level signals – word count, structure, language style – predict higher AI citation rates across verticals.

Approach

  1. I compared high-cited pages (more than three unique prompt citations) vs. low-cited across seven writing metrics: word count, definitive language, hedging, list items, named entity density, and intro-specific signals.
  2. I analyzed the first 1,000 words for list item count, named entity density, intro definitive language token density, and intro number count.

Results: Across all verticals, definitive phrasing and including relevant entities matter. But most signals are flat.

Image Credit: Kevin Indig

What The Industry Patterns Showed

When splitting the data up by vertical, we suddenly see preferences:

  • Total word count was strongest in CRM/SaaS (1.59x).
  • Finance was an anomaly with word count: Shorter pages win (0.86x word count).
  • Definitive phrases in the first 1,000 characters were positive for most verticals.
  • Education is a signal void. Writing style explains almost nothing about citation likelihood there.
Image Credit: Kevin Indig

Top Takeaways

1. There is no universal “write like this to get cited” formula. For example, the signals that lift CRM/SaaS citation rates actively hurt Finance. Instead, match content format to vertical norms.

2. The one universal rule: open with a direct declarative statement. Not a question, not context-setting, not preamble. The form is “[X] is [Y]” or “[X] does [Z].” This is the only writing instruction that holds regardless of vertical, content type, or length.

3. LLMs “penalize” hedging in your intro. “This may help teams understand” performs worse than “Teams that do X see Y.” Remove qualifiers from your opening paragraph before any other optimization.

2. The Entity Types That Predict Citation Are Not The Ones Being Targeted

Most AEO advice focuses on named entities as a category: Pack in more known brand names, tool names, numbers. The cross-vertical entity type analysis below tells a more specific (and more useful) story.

Approach

  1. Ran Google’s Natural Language API on the first 1,000 characters (about 200-250 words) of each unique URL.
  2. Computed lift per entity type: % of high-cited pages with that type / % of low-cited pages.
  3. Analyzed 5,000 pages across seven verticals.

* A quick note on terminology: Google NLP classifies software products, apps, and SaaS tools as CONSUMER_GOOD, a legacy label from when the API was built for physical retail. Throughout this analysis, CONSUMER_GOOD means software/product entities.

Results: DATE and NUMBER are the most universal positive signals. Interestingly, PRICE is the strongest universal negative.

Image Credit: Kevin Indig
Image Credit: Kevin Indig

What The Industry Patterns Showed

  • DATE is the most universal positive signal, with the exception of Finance (0.65x).
  • NUMBER is the second most universal. Specific counts, metrics, and statistics in the intro consistently predict higher citation rates. Finance (0.98x) and Product Analytics (1.10x) mark the floor and ceiling of that range.
  • PRICE is the strongest universal negative. Pages that open with pricing signal commercial intent. Finance is the sole exception at 1.16x, likely because price here means fee percentages and rate comparisons, which are the actual reference data financial queries are looking for.
  • CONSUMER_GOOD (software/product entities) is mixed. In Healthcare, product entities signal established brands and tools. In Crypto, naming specific protocols and products is core to answering technical queries.
  • PHONE_NUMBER is a positive signal in Healthcare (1.41x) and Education (1.40x). In both cases, it is almost certainly a proxy for established brands/institutions/providers with real physical presence, not a literal signal to add phone numbers to your pages.

The Knowledge Graph inversion deserves its own note here:

  • The data showed that high-cited pages average 1.42 KG-verified entities vs. 1.75 for low-cited pages (lift: 0.81x).
  • Pages built around well-known, KG-verified entities (major brands, institutions, famous people) tend toward generic coverage, which isn’t preferred by ChatGPT.
  • High-cited pages are dense with specific, niche entities: a particular methodology, a precise statistic, a named comparison. Many of those niche entities have no KG entries at all. That specificity is what AI reaches for.

Top Takeaways

1. Add the publish date to your pages and aim to use at least one specific number in your content. That combination is the closest thing to a universal AI citation signal this dataset produced. But Finance gets there through price data and location specificity instead.

2. Avoid opening with pricing in non-finance verticals. Price-dominant intros correlate with lower citation rates.

3. KG presence and brand authority do not translate to an AI citation advantage. Chasing Wikipedia entries, brand panels, or KG verification is the wrong lever. Specific, niche entities (even ones without KG entries) outperform famous ones.

3. Heading Structure: Commit To One Or Don’t Bother

We know headings matter for citations from the previous two analyses. Next, I wanted to understand whether heading count predicts citation rates and whether the optimal structure varies by vertical.

Approach

  1. Counted total headings per page (H1+H2+H3) across all cited URLs.
  2. Grouped pages into 7 heading-count buckets: 0, 1-2, 3-4, 5-9, 10-19, 20-49, 50+.
  3. Computed high-cited rate (% of URLs that are high-cited) per bucket per vertical.

Results: Including more headings in your content is not universally better. The sweet spot depends on vertical and content type. One finding holds everywhere: Strangely, 3-4 headings are worse than zero.

Image Credit: Kevin Indig

What The Industry Patterns Showed

  • CRM/SaaS is the only vertical where the 20+ heading lift is confirmed: 12.7% high-cited rate at 20-49 headings vs. a 5.9% baseline. The 50+ bucket reaches 18.2%. Long structured reference pages and comparison guides with one section per tool outperform everything else here.
  • Healthcare inverts most sharply. The high-cited rate drops from 15.1% at zero headings to 2.5% at 20-49 headings. A page with 30 H2s on telehealth topics signals optimization intent, not clinical authority.
  • Finance peaks at 10-19 headings (29.4% high-cited rate). Structured but not exhaustive: think rate tables, regulatory breakdowns, and advisor comparison pages with moderate heading depth.
  • Crypto peaks at five to nine headings (34.7% high-cited rate). Technical documentation in this vertical tends toward dense prose with moderate navigation structure. Over-structuring breaks up the technical depth.
  • Education is flat across all heading counts, which is consistent with the writing signals finding. Heading structure explains almost nothing about citation likelihood in education content.
  • The three to four heading dead zone holds across every vertical without exception. Partial structure confuses AI navigation without providing the full benefit of a committed hierarchy.

Top Takeaways

1. The 20+ heading finding from Part 1 is a CRM/SaaS finding, not a universal one. Applying it to healthcare, education, or finance could actively suppress citation rates in those verticals.

2. The principle that holds everywhere: Commit to structure or don’t use it. The middle ground costs you in every vertical. A fully-structured page with the right heading depth outperforms a half-structured page in every vertical.

3. Use the optimal heading range for your vertical. Crypto: 5-9. Finance and Education: 10-19. CRM/SaaS: 20+ (with H3s). Healthcare: 0 or 5-9 at most. Long CRM reference pages with 50+ sections are the one case where maximum heading depth pays off.

4. UGC Doesn’t Dominate

The “Reddit effect” reshaped organic search between 2024 and 2025. I wanted to understand whether ChatGPT cites user-generated content (Reddit, forums, reviews) at meaningful rates or whether corporate/editorial content dominates.

The common industry assumption – that AI also preferentially cites community voices – is not what we found in the data.

Approach

  1. Classified these cited URLs as (1) UGC: Reddit, Quora, Stack Overflow, forum subdomains, Medium, Substack, Product Hunt, Tumblr, or (2) community/forum prefixes or corporate/editorial by domain.
  2. Computed citation share per category per vertical.
  3. Dataset: 98,217 citations across 7 verticals.

Results: Corporate content accounts for 94.7% of all citations. UGC is nearly invisible.

Image Credit: Kevin Indig

What The Industry Patterns Showed

  • Finance is the most corporate-locked vertical at 0.5% UGC. YMYL (Your Money, Your Life) content appears to systematically suppress citations to community opinion.
  • Healthcare sits at 1.8% UGC for the same structural reason. Clinical, telehealth, and HIPAA content draws almost exclusively from institutional sources.
  • Crypto has the highest UGC penetration in the dataset at 9.2%. Community-generated content (Reddit technical threads, Medium tutorials, developer forum posts) answers a meaningful proportion of analyzed queries. In a fast-moving technical niche where official documentation consistently lags, community posts fill the gap.
  • Product Analytics and HR Tech sit at 6.9% and 5.8% UGC. Both are verticals where Reddit comparison threads and product review communities provide genuine signal alongside corporate content.

Top Takeaways

1. The “Reddit effect” in SEO has not translated proportionally to AI citations. In most verticals, reddit.com captures 2-5% of total citations. This finding is in line with other industry research, including this report from Profound.

2. For finance and healthcare: UGC has near-zero AI citation value. Invest in structured, authoritative corporate content with clear sourcing. Community engagement may matter for other reasons, but it does not contribute meaningfully to AI citation share in these verticals.

3. For crypto, product analytics, and HR tech: Community presence has measurable citation value. Detailed Reddit comparison threads, technical Medium posts, and structured developer forum answers can supplement corporate content reach.

What This Means For How You Strategize For LLM Visibility

Across all three parts of this study, the consistent finding is that AI citation is not primarily a writing quality problem.

Part 2 showed it is a content architecture problem: Thin single-intent pages are structurally locked out regardless of how well they’re written. This piece shows the same logic applies inside the content itself.

The aggregate writing signals table is the most important chart in this analysis. Not because it shows you what to do, but because it shows how much of what the AI SEO/GEO/AEO industry is telling you doesn’t survive cross-vertical scrutiny. Word count, list density, named entity counts … all flat or negative at the aggregate. The signals that work are vertical-specific and smaller than our industry’s consensus implies.

The meta-lesson from this analysis is that findings are vertical (and probably topic) specific, which is no different in SEO.

This part concludes the Science of AI – for now. Because the AI ecosystem is constantly changing.

Methodology

We analyzed ~98,000 ChatGPT citation rows pulled from approximately 1.2 million ChatGPT responses from Gauge.

Because AI behaves differently depending on the topic, we isolated the data across seven distinct, verified verticals to ensure the findings weren’t skewed by one specific industry.

Analyzed verticals:

  • B2B SaaS
  • Finance
  • Healthcare
  • Education
  • Crypto
  • HR Tech
  • Product Analytics

Featured Image: CoreDESIGN/Shutterstock; Paulo Bobita/Search Engine Journal

So Your Traffic Tanked: What Smart CMOs Do Next

We’ve all seen it. Brands with healthy websites and excellent content have been watching their organic traffic from Google’s SERP erode for years. In a recent webinar hosted by Search Engine Journal, guest speaker Nikhil Lai, principal analyst of Performance Marketing for Forrester Research, estimated his clients are losing between 10 and 40% of organic and direct traffic year-over-year.

However, a stunning bright spot is this: Lai said referral traffic from answer engines is growing 40% month over month. Visitors arriving from those engines convert at two to four times the rate of traditional search visitors, spend three times as long on site, and arrive with queries averaging 23 words, compared to the three or four words that defined the last decade of search.

Lai asserted that the channel driving this shift deserves a seat at the CMO’s table. Answer engines influence brand perception before purchase intent forms, which makes answer engine optimization (AEO) a brand investment, and puts budget and measurement decisions at the CMO level.

Here is the strategic roadmap Lai laid out at SEJ Live. He highlighted the decisions, org structures, and measurement frameworks that will move AEO from a search team initiative to a C-suite priority.

Answer Engines Build Demand Before Buyers Know What They Want

Classic search captures intent that already exists. A user types “running shoes,” clicks a result, and evaluates options. Answer engines operate earlier and differently: users hold extended conversations with large datasets, rarely click through, and leave those sessions with specific brand associations formed across multiple follow-up questions.

A user who once searched “running shoes” now asks ChatGPT, “What’s the best shoe for overpronation with wide feet in cold weather on pavement?” They exit that conversation with a brand name in mind and search for it directly. Your brand appeared in an AI conversation before the user ever reached your site. Every day, demand generation is created from users’ research sessions.

The Forrester data Lai presented reinforces the quality of that exposure: Sessions on answer engines average 23 minutes, with users asking five to eight follow-up questions per session. Each turn is another brand impression. The click-through rate stays low; the conversion rate on the traffic that does arrive runs two to four times higher than search-sourced traffic, with stronger average order value and lifetime value.

Brand familiarity is built in answer engines before purchase intent crystallizes in the user’s mind.

SEO Is The Foundation Of AEO

The brands pulling back on SEO investment in response to AEO are making a costly mistake. Lai put it directly: 85 to 90% of current SEO best practices remain fully valid for answer engine visibility.

Google’s E-E-A-T framework (experience, expertise, authoritativeness, trustworthiness) still governs how quality is evaluated across every index. Site architecture, mobile load speed, structured data, and indexation hygiene all strengthen performance across every engine. Every alternative index (Bing’s, Brave’s) is benchmarked against Google’s for completeness. Every bot (GPTBot, Claudebot, Perplexitybot) is benchmarked against Googlebot for sophistication.

SEO is the infrastructure on which AEO runs. The shift is an expansion of scope and emphasis, but AEO is not a replacement of SEO fundamentals.

What changes is where additional effort goes: natural-language FAQ optimization, off-site authority building, pre-rendering for less sophisticated bots, and a measurement framework built around share of voice rather than click volume.

Bing Is Now Your Distribution Network For Every Non-Google Engine

Most answer engines outside Google draw primarily from Bing’s index.

Bing evaluates credibility by weighting what others say about your brand more heavily than what your own site claims. This explains why Reddit threads, Quora answers, Wikipedia entries, G2 reviews, YouTube videos, and Trustpilot pages dominate AI-generated answers. The off-site web has become the primary source of record for how AI describes your brand.

The immediate tactical implication: Push every sitemap update directly to Bing via the IndexNow protocol. This triggers Bingbot to crawl fresh content and feeds that content into Perplexity, ChatGPT, and the broader answer engine ecosystem faster than waiting for organic discovery.

Bing’s index remains the fastest route to non-Google answer engine visibility. Perplexity is building its own index (Sonar), and OpenAI has signaled plans to build or acquire one, but Bing is the distribution network that matters today.

AEO Requires Cross-Functional Ownership

AEO arguably spans more functions than SEO, with these three in common with SEO: content, web development, and paid search. AEO also more strongly interfaces with PR, brand marketing, and social media.

PR earns a seat because off-site authority outweighs on-site signals in AEO. Brand mentions in publications, influencer mentions, and third-party reviews all directly shape how answer engines describe your brand.

Social belongs in the room because Reddit threads and Facebook group discussions show up in AI-generated answers. Community management and reputation management, previously handled separately from SEO, are now integral to AEO. When your social listening data reaches content teams before they draft, the content responds to the questions buyers are actually asking. When it doesn’t, you’re optimizing for questions nobody asked.

Lai proposed two organizational models that work to capture the opportunities inherent in AEO:

  1. Center of Excellence: A senior SEO specialist evolves into an AEO evangelist, runs a COE, and publishes cross-functional standards: clear rules like “every piece of content must answer these five questions” or “every page must include author schema.”
  2. AI Orchestrator: A dedicated hire who builds agents to handle repeatable AEO tasks (schema implementation, JavaScript reduction, FAQ content creation) and governs the cross-functional workflow with published guidelines for all stakeholders.

The CMO’s decision is which model fits the organization’s scale, and whether to build it internally or partner with an agency that has already built the infrastructure.

The Content Strategy That Wins In AI Responses

Long-form skyscraper content is an ancient relic. Answer engines reward precise, specific answers to real questions, delivered succinctly and across multiple formats. Lai framed this as Forrester’s question-to-content framework: Every piece of content maps directly to a FAQ being asked on answer engines, including the follow-up questions that emerge within a single session.

Five content moves that produce results:

  1. Build surround-sound FAQ coverage. Create glossaries, FAQ pages, videos, and blog posts that address the same topic cluster from different angles. When Claudebot crawls 38,000 pages for every referred page visit (per Cloudflare data), each page it indexes is an opportunity to signal topical authority. Volume and variety matter.
  2. Publish direct competitor comparisons. Users ask answer engines to compare brands. Brands that create honest, data-backed comparison guides are gaining prominent visibility, because they directly answer the queries being asked that pit a brand against its competitors. This was once a taboo content format; it has become a competitive requirement.
  3. Treat off-site syndication as the new backlinking. Hosting AMAs on Reddit, answering questions on Quora, and contributing to industry publications that rank in AI responses all earn the off-site authority that answer engines weigh most heavily. Give third-party voices data and perspective they couldn’t generate themselves, and they will produce mentions that shape how AI describes your brand.
  4. Pre-render pages for bot access. The bots crawling your site lack the compute budget to render JavaScript-heavy pages. Claudebot’s 38,000:1 crawl-to-referral ratio compared to Googlebot’s 5:1 ratio reflects this sophistication gap. Pre-rendering a JavaScript-free version for bots while serving the full experience to human visitors ensures your content gets indexed across every engine. Over time, limit the amount of JavaScript on site. Have content directly in HTML so bots can understand your content, and index it more often. The more you’re crawled and indexed, the more visible you become.
  5. Create unique content. Lai said, “Being distinctive, differentiated, and unique will help your brand stand out in a sea of sameness. Implicit in all this is that you need a lot more content, greater content velocity and diversity, which means you can use AI to create content. Google won’t automatically penalize AI-created content unless it lacks the watermarks of human authorship. The syntax and diction have to be natural. Use AI to create content, but don’t make it seem AI-generated. Get down into the details. It’s not enough to say your product is great. Explain why in different temperatures, conditions, the thickness, and so on, to satisfy long-tail intent.”

Replace Legacy KPIs With Metrics That Predict Market Share

The internal conversation, Lai said, he hears most from Forrester clients: “The hardest part of this transition from SEO to AEO has been trying to convince management to not focus as much on CTR and traffic. Those were indicators of organic authority. They are no longer reliable indicators.

“The new KPIs to focus on are visibility and share of voice. Share of voice can be measured in many ways. The most common are citation share: how often is my brand cited, how often is my content linked, of the opportunities I have to be cited; and mention share: how often is my brand mentioned of the opportunities I have to be mentioned. I’m also seeing more clients look into citation attempts: how often is ChatGPT trying to cite my content, and are there things I can do on the back end of my site to make that citation attempt score go up? Those are the new indicators of authority,” said Lai.

These metrics connect directly to branded search volume, which Lai called “the single strongest leading indicator of market share growth.” The chain of logic to present to the board: higher citation and mention share drives more branded searches, which converts at higher rates, which compounds into measurable market share gains against competitors.

Lai said he expects Google to add citation metrics to Search Console once AI Max adoption reaches critical mass, and an OpenAI Analytics product before year-end.

For now, Lai suggested, the best course of action is to establish a baseline with your current SEO platform and track the directional trend. Lai contended that, to address concerns of accuracy within today’s popular SEO tools of answer engine mentions, even imperfect measurement reveals which content clusters are earning citations and which need rebuilding.

The Agentic Phase Starts The Clock On B2B Urgency

Answer engines are moving from conversation to action. The current phase, characterized by extended back-and-forth with large datasets, is the warm-up. The agentic phase is defined by engines’ booking, filing, researching, and purchasing on users’ behalf. This will mean fewer clicks, longer sessions, and richer intent signals available to advertisers.

For B2B CMOs, the urgency is immediate. Forrester research shows GenAI has already become the number one source of information for business buyers evaluating purchases of $1 million or more, coming in ahead of customer references, vendor websites, and social media. Your largest deals are being influenced by AI conversations before your sales team enters the picture.

AEO visibility in B2B is a current-pipeline variable that requires immediate attention.

The brands building complete search strategies now, covering answer engines, on-site conversational search, and structured data across every indexed channel, will own discovery and have greater control over brand perception in the next phase of buying behavior.

The window to gain an early-mover competitive advantage is shrinking, before AEO visibility becomes just another standard expectation everyone has to meet.

Key Takeaways For CMOs

  • Reframe the traffic story. Lower overall traffic volume paired with two-to-four-times higher conversion rates is a net performance gain. Build that case proactively before your CEO draws the wrong conclusion from a falling traffic chart.
  • Fund AEO as an upper-funnel brand channel. That means applying the same budget logic, measurement frameworks, and executive ownership you would bring to any major brand awareness investment, where success is measured in visibility, perception, and long-term share of voice rather than clicks and conversions.
  • Move to share-of-voice KPIs. Citation share and mention share drive branded search volume, which drives market share. Make that causal chain visible to your leadership team.
  • Assign cross-functional ownership with clear governance. Choose between a center of excellence or an AI orchestrator model and make that structural decision this quarter.
  • Prioritize off-site authority as a content strategy responsibility. Reddit, Quora, third-party publications, and YouTube shape AI’s perception of your brand. PR and social teams own the channels that matter most for AEO.
  • Push every sitemap update to Bing via IndexNow. Bing’s index feeds most non-Google answer engines. This is a 15-minute technical change with compounding distribution benefits.
  • Use AI to help with content, but always apply human editing for authority. Content that reads as machine-generated loses trust across every engine, including Google.

What Does A Smart CMO Do Next?

Start with a 90-day experiment using some or all of these strategies.

Audit your current citation and mention share in one category using your existing SEO platform. Identify three high-intent FAQ clusters where your brand should be visible and build surround-sound content for each: a dedicated FAQ page, a comparison guide, and one off-site piece in a publication that appears in AI responses. Push fresh sitemaps to Bing. Track citation share and branded search volume at 30, 60, and 90 days.

The data may make the investment case for broader rollout. If not, tweak your approach. The brands moving first will capture the highest-quality traffic at the lowest incremental cost, and set the citation baseline that becomes progressively harder for competitors to close.

The full webinar is available on demand.

More Resources:


Featured Image: Dmitry Demidovich/Shutterstock

WordPress Delays Release Of Version 7.0 To Focus On Stability via @sejournal, @martinibuster

WordPress 7.0, previously scheduled for an April 9th release, will be delayed in order to stabilize the Real-Time Collaboration feature and assure that the release, a major milestone, will “target extreme stability.” Much is riding on WordPress 7.0 as it will ship with features that will usher in the age of AI-driven content management systems.

Prioritization Of Stability

Matt Mullenweg, co-founder of WordPress, commenting in the official Making WordPress Slack workspace, said the release should step back from its current trajectory and prioritize stability, calling for a longer pre-release phase to get the real-time collaboration (RTC) feature working correctly. The delay is expected to last weeks, not days, and is described as a one-off deviation from WordPress’s planned date-driven schedule.

Mullenweg posted:

“Given the scope and status of 7.0, I think we should go back to beta releases, get the new tables right, lock in everything we want for 7.0, and then start RCs again. Date-driven is still our default, but for this milestone release we want to target extreme stability and exciting updates, especially as AI-accelerated development is increasing people’s expectations for software.

This is a one-off, I think for future we should get back on the scheduled train, with an aim for 4-a-year in 2027, to hopefully reflect our AI-enabled ability to move faster.”

Extended Release Candidate Phase Replaces Beta Reversion

To avoid technical compatibility issues, the project will remain in the release candidate phase, extending the testing period through additional RC builds as needed.

The proposal to return to beta releases was rejected because it would break PHP version comparison behavior, plugin update logic, and tooling that depends on standard version sequencing. Continuing with RC builds preserves compatibility while allowing more time for testing and fixes.

Real-Time Collaboration

The delay is largely due to the Real-Time Collaboration feature, which introduces new database tables and changes how WordPress handles editing sessions. Contributors identified risks related to performance, data handling, and interactions with existing systems.

A primary concern is that real-time editing currently disables persistent post caches during active sessions, a performance issue the team is working to resolve before the final release.

Database Design Raises Performance Concerns

A key part of the discussion focused on how to structure the database for Real-Time Collaboration (RTC).  A proposed single RTC table would support 1. real-time editing updates and 2. synchronization. But some contributors noted that the workloads for real-time editing and synchronization are fundamentally different.

Real-time collaboration generates high-frequency, bursty writes that require low latency (meaning updates happen with very little delay).

While synchronization between environments involves slower, structured updates that may include full-table scans.

Combining both patterns within one table risks performance issues and added complexity. Contributors discussed separating these workloads into separate tables optimized for each use case, but no decision has been made.

Gap In Release Candidate Testing Raises Concern

The discussion in the WordPress Slack workspace also raised concern over whether there was enough real-world release candidate testing, and database schema changes increase the risk of failures during upgrades. The solution of using the Gutenberg plugin for testing was rejected because database changes could affect production sites and require complex migration logic. Instead, the project will use an extended RC phase to increase testing exposure and gather feedback from a wider group of users.

Versioning Constraints

The proposal to delay version 7.0 led to additional issues. PHP version comparison rules and related tooling complicated returning to beta versions. It was agreed that staying within the release candidate sequence (ergo RC1, RC2, RC3) avoids these issues while allowing continued iteration, so it was decided to continue with release candidates.

Future Release Cadence Remains

The delay is described as a temporary exception. Matt Mullenweg said the project intends to return to a regular release schedule, with a goal of delivering roughly four releases per year by 2027 as development speeds increase with AI-assisted workflows.

Implications For Developers And Users

Developers should expect continued changes to the Real-Time Collaboration feature and its supporting database structures during the extended release candidate phase. The longer testing period provides more time to identify issues before release. For site owners and hosts, the delay shows that WordPress is prioritizing stability over schedule while introducing more complex real-time and synchronization features.

Impact Of RTC On Hosting Environments

Something that wasn’t discussed but is a real issue is how real-time collaboration might affect web hosting providers. They need to test that feature to see if it introduces issues on shared hosting environments. While RTC will be shipping with the feature turned off by default, the impact of it being used by customers in a shared hosting environment is currently unknown. A spokesperson for managed WordPress hosting provider Kinsta told Search Engine Journal they are still testing. Given how the feature is still evolving, Kinsta and other web hosts will have to continue testing the upcoming WordPress release candidates.

I think most people will agree that the decision to delay the release of WordPress 7.0 is the right call.

Introducing llms.txt to Shopify: Give AI a map to your best products 

You’ve worked hard to build your product catalog. The last thing you want is AI tools like ChatGPT or Google Gemini describing your products inaccurately to potential customers. 

AI tools don’t browse your whole store the way a search engine does. They grab what they can find, quickly, and fill in the gaps. For a store with a large catalog, that means incomplete answers, outdated information, or worse, sending shoppers to a competitor. 

The new llms.txt feature, available in Yoast SEO for Shopify bridges that gap. 

What does it actually do? 

It creates a file that tells AI tools which parts of your store matter most: your top products, your collections, your policies, and your key pages. Think of it as handing AI a well-organized store guide instead of letting it wander around on its own. 

You switch it on once. We handle the rest. 

Two ways to use it 

Let Yoast handle it automatically 

Turn it on and we’ll build and update the file each week based on your Shopify data. No decisions needed. The file automatically highlights: 

  • Your 10 most-sold products over time
  • Up to 5 of your largest collections, plus a link to your full product range 
  • Your store policies, including shipping, returns, and privacy 
  • Your homepage, latest blog posts, and most recently updated pages 
  • Any pages you’ve already marked as cornerstone content 

Or choose exactly what’s included 

If you’d rather have full control, switch to manual selection. You can hand-pick the products and pages you want to feature, and there’s a dedicated spot to add your “About us” page so AI knows the story behind your brand. 

Either way, the file updates weekly and removes deleted products automatically. 

No technical knowledge needed

Setting this up from scratch would normally mean editing code. We’ve built it directly into your Yoast SEO for Shopify settings so any member of your team can turn it on in seconds. If you already have a redirect set up for /llms.txt, we’ll respect it and let you know, so nothing breaks. 

You decide when it’s right for your business 

We believe every merchant should have a say in how their content is seen and used as AI plays a bigger role in how people discover products online. That’s why this feature is opt-in. 

Turn on the llms.txt toggle in Yoast SEO for Shopify next time you log in to your store

How To Identify Which LLM Is Actually Working For You [Webinar] via @sejournal, @hethr_campbell

AI search is dominating the strategy conversation right now, and everyone is hearing the same thing from clients and directors: “What’s our AI search plan?”

The instinct is to optimize everywhere, ChatGPT, Perplexity, Gemini, and move fast. But before you reallocate budget or rewrite your GEO roadmap, there’s a more useful question to ask first:

Which LLM is actually driving conversions in your clients’ specific industry?

Join us for an upcoming expert panel webinar where we’ll dive into exactly that.

What You’ll Learn

In this webinar, Danielle Wood, Content & Creative Manager at CallRail, and Natalie Johnson, SEO & AI Visibility Expert & Founder of SweetGlow Marketing, will break down real conversion data by LLM and show how platform-level performance should shape your GEO strategy.

Specifically, you’ll walk away with:

  • Conversion data by LLM platform, so you know where high-intent traffic is actually coming from in each industry
  • A clear AI prioritization framework to stop spreading GEO effort equally and concentrate it where it converts
  • A reporting model that ties AI search activity to real business outcomes clients can see and trust

Why Attend?

You’ll finally be able to justify AI search investment; this session will give you the data and the framework to make that case and to implement the strongest, most successful AI search strategy possible.

Join us live to get your questions answered directly by the expert panel.

Inside the stealthy startup that pitched brainless human clones

After operating in secrecy for years, a startup company called R3 Bio, in Richmond, California, suddenly shared details about its work last week—saying it had raised money to create nonsentient monkey “organ sacks” as an alternative to animal testing.

In an interview with Wired, R3 listed three investors: billionaire Tim Draper, the Singapore-based fund Immortal Dragons, and life-extension investors LongGame Ventures.

But there is more to the story. And R3 doesn’t want that story told.

MIT Technology Review discovered that the stealth startup’s founder John Schloendorn also pitched a startling, medically graphic, and ethically charged vision for what he’s called “brainless clones” to serve the role of backup human bodies.

Imagine it like this: a baby version of yourself with only enough of a brain structure to be alive in case you ever need a new kidney or liver.

Or, alternatively, he has speculated, you might one day get your brain placed into a younger clone. That could be a way to gain a second lifespan through a still hypothetical procedure known as a body transplant.

The fuller context of R3’s proposals, as well as activities of another stealth startup with related goals, have not previously been reported. They’ve been kept secret by a circle of extreme life-extension proponents who fear that their plans for immortality could be derailed by clickbait headlines and public backlash.

And that’s because the idea can sound like something straight from a creepy science fiction film. One person who heard R3’s clone presentation, and spoke on the condition of anonymity, was left reeling by its implications and shaken by Schloendorn’s enthusiastic delivery. The briefing, this person said, was like a “close encounter of the third kind” with “Dr. Strangelove.”

A key inspiration for Schloendorn is a birth defect in which children are born missing most of their cortical hemispheres; he’s shown people medical scans of these kids’ nearly empty skulls as evidence that a body can live without much of a brain. 

And he’s talked about how to grow a clone. Since artificial wombs don’t exist yet, brainless bodies can’t be grown in a lab. So he’s said the first batch of brainless clones would have to be carried by women paid to do the job. In the future, though, one brainless clone could give birth to another.

Last Monday, the same day it announced itself to the world in Wired, R3 sent us a sweeping disavowal of our findings. It said Schloendorn “never made any statement regarding hypothetical ‘non-sentient human clones’ [that] would be carried by surrogates.” The most overarching of these challenges was its insistence that “any allegations of intent or conspiracy to create human clones or humans with brain damage are categorically false.”

But even Schloendorn and his cofounder, Alice Gilman, can’t seem to keep away from the topic. Just last September, the pair presented at Abundance Longevity, a $70,000-per-ticket event in Boston organized by the anti-aging promoter Peter Diamandis. Although the presentation to about 40 people was not recorded and was meant to be confidential, a copy of the agenda for the event shows that Schloendorn was there to outline his “final bid to defeat aging” in a session called “Full Body Replacement.”

According to a person who was there, both animal research and personal clones for spare organs were discussed. During the presentation, Gilman and Schloendorn even stood in front of an image of a cloning needle. Pressed on whether this was a talk about brainless clones, Gilman told us that while R3’s current business is replacing animal models, “the team reserves the right to hold hypothetical futuristic discussions.”

MIT Technology Review found no evidence that R3 has cloned anyone, or even any animal bigger than a rodent. What we did find were documents, additional meeting agendas, and other sources outlining a technical road map for what R3 called “body replacement cloning” in a 2023 letter to supporters. That road map involved improvements to the cloning process and genetic wiring diagrams for how to create animals without complete brains. 

light passing through an infant's skull
A child with hydranencephaly, a rare condition in which most of the brain is missing. Could a human clone also be created without much of a brain as an ethical source of spare organs?
DIMITRI AGAMANOLIS, M.D. VIA WIKIPEDIA

A main purpose of the fundraising, investors say, was to support efforts to try these techniques in monkeys from a base in the Caribbean. That offered a path to a nearer-term business plan for more ethical medical experiments and toxicology testing—if the company could develop what it now calls monkey “organ sacks.” However, this work would clearly inform any possible human version. 

Though he holds a PhD, Schloendorn is a biotech outsider who has published little and is best known for having once outfitted a DIY lab in his Bay Area garage. Still, his ties to the experimental fringe of longevity science have earned him a network in Silicon Valley and allies at a risk-taking US health innovation agency, ARPA-H. Together with his success at raising money from investors, this signals that the brainless-clone concept should be taken seriously by a wider community of scientists, doctors, and ethicists, some of whom expressed grave concerns. 

“It sounds crazy, in my opinion,” said Jose Cibelli, a researcher at Michigan State University, after MIT Technology Review described R3’s brainless-clone idea to him. “How do you demonstrate safety? What is safety when you’re trying to create an abnormal human?”

Twenty-five years ago, Cibelli was among the first scientists to try to clone human embryos, but he was trying to obtain matched stem cells, not make a baby. “There is no limit to human imagination and ways to make money, but there have to be boundaries,” he says. “And this is the boundary of making a human being who is not a human being.” 

“Feasibility research”

Since Dolly the sheep was born in 1996, researchers have cloned dogs, cats, camels, horses, cattle, ferrets, and other species of mammal. Injecting a cell from an existing animal into an egg creates a carbon-copy embryo that can develop, although not always without problems. Defects, deformities, and stillbirths remain common. 

Those grave risks are why we’ve never heard of a human clone, even though it’s theoretically possible to create one. 

But brainless clones flip the script. That’s because the ultimate aim is to create not a healthy person but an unconscious body that would probably need life support, like a feeding tube, to stay alive. Because this body would share the DNA of the person being copied, its organs would be a near-perfect immunological match. 

Backers of this broad concept argue that a nonsentient body would be ethically acceptable to harvest organs from. Some also believe that swapping in fresh, young body parts—known as “replacement”—is the likeliest path to life extension, since so far no drug can reverse aging. 

And then there’s the idea of a complete body transplant. “Certainly, for the cryonics patients, that sounds like something really promising,” says Anders Sandberg, a prominent Swedish transhumanist and expert in the ethics of future technologies. He notes that many people who opt to be stored in cryonic chambers after death choose the less expensive “head only” option, so “there might be a market for having an extra cloned body.”

MIT Technology Review first approached Schloendorn two years ago after learning he’d led a confidential online seminar called the Body Replacement Mini Conference, in which he presented “recent lab progress towards making replacement bodies.” 

According to a copy of the agenda, that 2023 session also included a presentation by a cloning expert, Young Gie Chung. And there was another from Jean Hébert, who was then a professor at the Albert Einstein College of Medicine and is now a program manager at ARPA-H, where he oversees a project to use stem cells to restore damaged brain tissue. Hébert popularized the so-called replacement solution to avoiding death in a 2020 book called Replacing Aging

In an interview prior to joining the government in 2024, Hébert described an informal but “very collaborative” relationship with Schloendorn. The overall idea was that to stop aging, one of them would determine how to repair a brain, while the other would figure out how to create a body without one. “It’s a perfect match, right? Body, brain,” Hébert told MIT Technology Review at the time. 

Schloendorn, by working outside the mainstream, had the huge advantage of “not being bound by getting the next paper out, or the next grant,” Hébert said, adding, “It’s such a wonderful way of doing research. It’s just clean and pure.” R3 now appears on the ARPA-H website on a list of prospective partners for Hébert’s program.

In a LinkedIn message exchanged with Schloendorn that same year, he described his work as “feasibility research in body replacement.”

“We will try to do it in a way that produces defined societal benefits early on, and we need to be prepared to take no for an answer, if it turns out that this cannot be done safely,” Schloendorn wrote at the time. He declined an interview then, saying that before exiting stealth mode, he wants to be sure the benefits are “reasonably grounded in reality.”

That could prove challenging. While body-part replacement sounds logical, like swapping the timing belt on an old car, in reality there’s scant evidence that receiving organs from a younger twin would make you live any longer. 

A complete body transplant, meanwhile, would probably be fatal, at least with current techniques. In the latest test of the concept, published last July, Russian surgeons removed a pig’s head and then sewed it back on. The animal did live—breathing weakly and lapping water from a syringe. But because its spinal cord had been cut, it was otherwise totally paralyzed. (As yet, there’s no proven method to rejoin a severed spinal cord.) In an act of mercy, the doctors ended the pig’s life after about 12 hours. 

Even some of R3’s investors say the endeavor is a risky, low-odds project, on par with colonizing Mars. Boyang Wang, head of Immortal Dragons, has spoken at longevity conferences about body-swapping technology, referring to the chance that “when the time comes, you can transplant your brain into a new body.” Wang confirmed in a January Zoom call that he’d been referring to R3 and that he invested $500,000 in the company during a 2024 fundraising round.

But since making his investment, Wang says, he’s become less bullish. He now views whole-body transplant as “very infeasible, not even very scientific” and “far away from hope for any realistic application.” 

Still, he says, the investment in R3 fits with his philosophy of making unorthodox bets that could be breakthroughs against aging. “What can really move the needle?” he asks. “Because time is running out.”

Stealth mode

Clonal bodies sit at the extreme frontier of an advancing cluster of technologies all aimed at growing spare parts. Researchers are exploring stem cells, synthetic embryos, and blob-like organoids, and some companies are cloning genetically engineered pigs whose kidneys and hearts have already been transplanted into a few patients. Each of these methods seeks to harness development—the process by which animal bodies naturally form in the womb—to grow fully functional organs. 

There’s even a growing cadre of mainstream scientists who say nonsentient bodies could solve the organ shortage, if they could be grown through artificial means. Two Stanford University professors, calling these structures “bodyoids,” published an editorial in favor of manufacturing spare human bodies in MIT Technology Review last year. While that editorial left many details to the imagination, they called the idea “at least plausible—and possibly revolutionary.” 

“There are a lot of variations on this where they’re trying to find a socially acceptable form,” says George Church, a Harvard University professor who advises startups in the field. But Church says gestating an entire body is probably taking things too far, especially since nearly all patients on transplant lists are waiting for just a single organ, like a heart or kidney. 

“There’s almost no scenario where you need a whole body,” he says. “I just think even if it’s someday acceptable, it’s not a good place to start.” For the moment, Church says, brainless human bodies are “not very useful, in addition to being repulsive.”

That’s arguably why body replacement technology still feels risky to talk about, even among life-extension enthusiasts who are otherwise ready to inject Chinese peptides or have their bodies cryogenically frozen. “I think it’s exciting or interesting from a scientific perspective, but I think the world is not fully ready for it yet,” says Emil Kendziorra, CEO of Tomorrow Bio, a company in Berlin that stores bodies at -196 °C in the hope they can be restored to life in the future. 

“Everybody’s like, yeah, you know, cryopreservation makes total sense,” he says. “And then you talk about total body replacement. And then everybody’s like, Whoa, whoa, whoa.”

Even so, “replacement” technology has found a fervent base of support among a group of self-described “hardcore” longevity adherents who follow a philosophy called Vitalism, which holds that society should redirect resources toward achieving unlimited lifespans. The growing influence of this movement, achieved through lobbying, investment, recruiting, and public messaging, was detailed earlier this year in MIT Technology Review.

Last spring, during a meetup for this community, Kendziorra was among the attendees at an invite-only “Replacement Day” gathering that took place off the public schedule. It was where more radical ideas could be discussed freely, since to some in the Vitalist circle, replacing body parts has emerged as the most plausible, least expensive way to beat death. 

At least that was the conclusion of a road map for anti-aging technology produced by one Vitalist group, the Longevity Biotech Fellowship, which reckoned that a proof-of-concept human clone lacking a neocortex would cost $40 million to create—a tiny amount, relatively speaking. 

Its report cited the existence of two stealth companies working on cloning whole nonsentient bodies, although it took care not to name them. If these companies’ activities become public, “there will be a huge backlash—people will hate it,” the entrepreneur Kris Borer said while presenting the road map at a French resort last August. 

“There are a ton of dystopian movies and novels about this kind of stuff. That is why I didn’t talk about any of the companies working on it. They are trying to hide from public attention,” he said. “We have to have the angel investors and other people invest kind of in secret until things are ready.” 

Borer did say what he sees as the best way to go public: first, to slowly ease body replacement into society’s awareness by disclosing more limited aims, which will be palatable. “We are not going to start with Let’s clone you and give you a body. We are going to start with Let’s solve the organ shortage,” he said. “Eventually people will warm up to it, and then we can go to the more hardcore stuff.”

In an interview earlier this month, Borer declined to name the companies involved in his immortality road map, or to say if R3 is one of them. But we did identify one additional stealthy startup, this one focused on replacing a person’s internal organs, not the whole body. Called Kind Biotechnology, it is a New Hampshire–based company headed by the anti-aging researcher Justin Rebo, a sometime collaborator of Schloendorn’s.

Fig 13 from a patent application
A patent image from Kind Biotechnology shows a mouse pup engineered to lack anatomical features (left) next to a normal animal. The company’s goal is to grow organ “sacks” with a “complete lack of ability to feel, think, or sense.”
WO2025260099 VIA WIPO

According to patent applications filed by the company, Rebo’s team is working to create animals with a “complete lack of ability to feel, think, or sense the environment.” Images included in the patents show mice the company produced that lack a complete brain, and others that don’t have faces or limbs. They did that by deleting genes in embryos using the gene-editing technology CRISPR with the goal of creating a “sack of organs that grows mostly on its own,” with only a minimal nervous system. A cartoon rendering submitted to the patent office shows what looks like a fleshy duffel bag connected to life support tubes. 

In an email, Rebo said his company is working on an “ethical and scalable” way to create animal organs for experimental transplant to humans. He notes that “thousands die while waiting” for an organ. 

Some of Kind’s patent applications do cover the possibility of producing these organ sacks from human cells. Rebo says that’s more of a speculative possibility. But he does see his work as part of the “replacement” approach to longevity. Firstly, that’s because a “scalable production of young, high-quality organs” would let surgeons try transplants in more types of patients, including many with heart disease in old age who aren’t candidates for a transplant now. 

“With abundant high-quality organs, replacement could become a direct form of rejuvenation by replacement of failing parts,” he says. 

And Rebo imagines that simultaneously replacing multiple internal organs (grown together in the sack) could have even broader rejuvenating effects. “Ultimately, replacing failing parts is a direct path to extending healthy human lifespan,” he says. 

Church, who agreed earlier this year to advise Kind Bio, sees this work as part of an effort to “nudge” these technologies “toward something that is more useful and more acceptable from the get-go,” he says. “And then let’s see how society responds to that—rather than jumping to the most repulsive and most useless form, which some of them seem to be aiming for.” 

“There’s one way to find out”

People who know Schloendorn describe a dynamo-like presence who is “100% dedicated” to the goal of extreme life extension. In 2006, he penned a paper in a bioethics journal outlining why the “desire to live forever” is rational, and his doctoral research at the University of Arizona was sponsored by a longevity research organization called the SENS Foundation.  

He’s also well connected. In an interview, Aubrey de Grey, the influential and controversial fundraiser and prognosticator who cofounded SENS, called Schloendorn “one of my protégés.” And around 2010, Peter Thiel reportedly invested $1.5 million in ImmunePath, a company started by Schloendorn to develop stem-cell treatments, though it soon failed. (A representative for Thiel did not respond to a request to confirm the figure.)

By 2021, Schloendorn had moved on, founding R3 Biotechnologies. He began to circulate the body replacement idea and discuss a step-by-step scheme to get there: assess techniques in the lab first, then in monkeys, and maybe eventually in humans. 

A 2023 “letter to stakeholders” signed by Schloendorn begins by saying that “body replacement cloning will require multicomponent genetic engineering on a scale that has never been attempted in primates.” Fortunately, it adds, molecular techniques for “brain knockout” are well known in mice and should also be expected to function in “birthing whole primates,” a class that includes both monkeys and humans. 

Would it work? “There’s one way to find out,” the letter says. 

Wang, the investor at Immortal Dragons, says he put money into R3 after it showed him it is possible to create mice without complete brains. “There were imperfections, but the resulting mice survived, grew up, and to me, that is a pretty strong experiment,” he says; it was evidence enough for him to fund R3’s attempt to “replicate the result in primates.” 

(In its emailed statement, R3 said the company and its founders “never produced any degree of brain alterations in any species, did not attempt to do so, did not hire another party to do so, and have no specific plans to do so in the future.” It added: “We do not work with live non-human primates.”) 

The bigger technical obstacle, though, remains the cloning. Out of 100 attempts to clone an animal, only a few typically succeed. That fact alone makes cloning a human—or a monkey—almost infeasible.

But R3 does seem to have made an effort to tackle the efficiency problem. In one document reviewed by MIT Technology Review, it claims to have implemented improvements to the basic procedure in rodents, referencing a protein, called a histone demethylase, that helps erase a cell’s genetic memory. Adding it can greatly increase the chance that the cell will form a cloned embryo after being injected into an egg in the lab.

Those molecules were used in the first successful cloning of a monkey, which occurred in 2018 in China. But it still wasn’t easy—in fact, it was a huge and costly effort to handle a crowd of monkeys in estrus and perform IVF on them. According to Michigan State’s Cibelli, monkey cloning remains nearly impossible, at least on US territory, just because it’s “unaffordable.”

Nevertheless, success in monkeys did help prove, at least biologically, that human reproductive cloning could be possible. 

The company may also have tried to tackle a second long-standing obstacle to cloning: defects in how the placenta works. Because of such problems, some cloned animals die quickly after birth.

The R3 document refers to a “birthing fix” it developed to further improve the cloning success rate. While MIT Technology Review didn’t learn what R3’s process entails, we found a reference to it on the LinkedIn page of Maitriyee Mahanta, a scientist who cosigned the 2023 letter to R3 stakeholders and is a former research assistant to Hébert. (We were unable to reach Mahanta for comment.)

Her page described her current role as “molecular lead” studying cloning, “birth rate fixing,” and cortical development using cells from nonhuman primates. Her job affiliation is given as the Longevity Escape Velocity Foundation, a nonprofit where de Grey is the president and chief science officer. But de Grey says his foundation only arranged a work visa for Mahanta as part of a partnership “with the company she actually spends her time at.”

Like several other people interviewed for this article, de Grey made a resourceful effort to avoid directly confirming the existence of R3 when we spoke, while at the same time freely discussing theoretical aspects of body cloning technology. For instance, he talked about ways to shorten the wait for your double to grow up to a size suitable for organ harvesting; a further genetic mutation could be added to cause “central precocious puberty” in the clone, he said. This condition causes a growth spurt, even pubic hair, in a toddler. 

Cloning dictators

Who would clone a body and pay to keep it alive for years, until it’s needed? The first customers for this costly technology (if it ever proves feasible) would likely be the ultra-rich or the ultra-powerful. 

Indeed, somehow the world’s top dictators seem to have gotten the memo about replacement parts. In September, a hot mic picked up a conversation between Russian president Vladimir Putin and Chinese leader Xi Jinping as they walked through Beijing with North Korean autocrat Kim Jong Un; in the exchange, the Russian speculated on life extension.  

“Biotechnology is continuously developing. Human organs can be continuously transplanted. The longer you live, the younger you become, and [you can] even achieve immortality,” Putin said through an interpreter.

“Some predict that in this century, humans will live to 150 years old,” Xi responded agreeably.

How the leaders learned of these possibilities is unknown. But scenarios involving dictators are a constant topic among body replacement enthusiasts. 

“There are companies working on this. They are in stealth—we can’t reveal too much about them—but the general concept on this is if you didn’t have any ethical qualms, you could do most of it today,” Will Harborne, the chief investment officer of LongGame Advisors, said last year, during an interview with the podcaster Julian Issa. “If you were the dictator of some country and wanted a clone of yourself, you can already go grow one. You can create a cloned embryo of yourself, you can get a surrogate to carry it to term, and you can grow [a] body until age 18 with a brain, and eventually, if you were a dictator, you could kill them and try to transplant your head on their body.”

“And now no one is suggesting you do that—it’s very unethical—but most of the technology is there,” he said. He noted that the reason for removing the cortex of a clone created for such a purpose is that “we don’t want to kill other people to live forever.” 

Harborne subsequently confirmed to MIT Technology Review that the fund invested $1 million in R3 about a year and a half ago.

In order to make the body replacement process ethical, the clone’s brain needs to be stunted so it lacks consciousness. That is where the interest in birth defects comes in. Remarkable medical scans of kids with a rare condition, hydranencephaly, show a total absence of the cerebral hemispheres. Yet if they are cared for, they may be able to live into their 20s, even though they cannot speak or engage in purposeful movement. 

The technical question, then, is how to intentionally produce such a condition in a clone. Sandberg, the futurist, says he’s visited R3’s lab, talked to Gilman, and sat through a presentation about how genetic engineering can be used to shape brain growth. Previous work has shown that by adding a toxic gene, it is possible to kill specific cell types in a growing embryo but spare others, leading to a mouse without a neocortex.

While Sandberg isn’t an expert in biotechnology, he says R3’s theory looked sensible to him. “I think it’s possible to actually prevent the development of the brain well enough that you can say ‘Yeah, there is almost certainly no consciousness here,’” Sandberg says. “Hence, there can’t be any suffering, or any individual, in a practical sense.”

“I think the overall aim—actually, it looks ethically pretty good,” he says. 

Two monkeys with stuffed animals in a plastic research container
Monkeys were successfully cloned in China for the first time in 2018. Although it was was a costly and difficult undertaking, the feat suggested human cloning is biologically possible.
QIANG SUN AND MU-MING POO/CHINESE ACADEMY OF SCIENCES VIA AP

Yet it could be difficult to really determine where consciousness starts and ends. Under current medical standards, taking the organs of people with hydranencephaly isn’t allowed because they don’t meet the standard of brain death: They have a functioning brain stem. An even more serious problem is evidence that the brain stem alone produces a basic form of consciousness. If that is so, says Bjorn Merker, a neuroscientist who surveyed caretakers of more than a hundred children with hydranencephaly, a plan “to harvest organs from organisms modeled on this condition would be unethical.”

Of course, the most extreme version of the replacement dream isn’t just to take organs. It’s to take over the body entirely. Sergio Canavero, a controversial Italian surgeon who has proposed head and brain transplants, says he was approached for advice by Schloendorn and others a few years ago. “They told me they were looking at a head transplant on a two- or three-year-old,” he says. “I stopped short. How could you even conceive of that? The biomechanical compatibility is not there. You have to wait until at least 14. And I would say 16. It was very clear to me these guys are not surgeons—they are biologists.” 

Canavero says he’s not opposed to cloning bodies for transplant—he thinks it could work. “But if you want to use a clone,” he says, “it must be a nonsentient clone. Otherwise it’s murder, a homicide.”    

MIT Technology Review has not found any evidence that R3 has yet created an “organ sack,” much less a brainless human clone. And there are many reasons to believe their hypothetical future of “full body replacement” will never come to pass—that it is just a live-forever fantasy.

“There are so many barriers,” says Cibelli. It’s a long list: Human cloning is illegal in many countries, it’s unsafe, and few competent experts would want, or dare, to participate. And then there’s the inconvenient fact that for now, there’s no way to grow a brainless clone to birth, except in a woman’s body. Think about it, Cibelli says: “You’d have to convince a woman to carry a fetus that is going to be abnormal.”

Sandberg agrees that is where things could start to get tricky. “The problem here, of course,” he says, “is that the yuck factor is magnificent.”

The Download: brainless human clones and the first uterus kept alive outside a body

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

Inside the stealthy startup that pitched brainless human clones 

After operating in secrecy for years, R3 Bio, a California-based startup, suddenly revealed last week that it had raised money to create nonsentient monkey “organ sacks” as an alternative to animal testing. But there is more to the story. And R3 doesn’t want that story told. 

MIT Technology Review discovered that founder John Schloendorn also pitched a startling, ethically charged vision: “brainless clones” that serve as backup human bodies. Find out all the details on the radical proposal

—Antonio Regalado 

A woman’s uterus has been kept alive outside the body for the first time 

Ten months ago, reproductive health researchers placed a freshly donated human uterus inside a new device they call “Mother.” They connected the organ to the machine’s plastic veins and arteries and pumped in modified human blood. 

The device kept the uterus alive for a day, a new feat that could lead to longer-term maintenance of wombs outside the body. Future versions of the technology could shine new light on pregnancies—and potentially even grow a human fetus. Read the full story

—Jessica Hamzelou 

The must-reads 

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 

1 AI data centers can significantly warm up surrounding areas  
The “heat islands” may already affect 340 million people. (New Scientist
Mistral has raised $830M to build Nvidia-powered AI centers in Europe. (FT $) 
+ But nobody wants a data center in their backyard. (MIT Technology Review

2 Elon Musk reportedly joined Trump’s call with Modi about the Iran War 
It remains unclear what Musk was doing during the conversation. (NYT $)  
+ India has disputed the report. (Independent
+ The war poses a grave threat to the EV market. (Rest of World

3 Eli Lilly has struck a deal to bring AI-developed drugs to the market 
It’s secured a $2.75 billion drug collaboration with Insilico Medicine. (Reuters $) 
+ A I-designed compounds can kill drug-resistant bacteria. (MIT Technology Review

4 More and more countries are curbing children’s social media access 
Austria is the latest to pursue a ban. (Engadget
+ Indonesia has rolled out the first one in Southeast Asia. (DW
+ UK Prime Minister Keir Starmer said he will also “have to act.” (Guardian)  

5 Tech stocks just had their worst week in nearly a year 
Thanks to a combination of the Iran war and legal disputes. (CNBC
+ Tech insiders are split over the AI bubble. (MIT Technology Review

6 Meta is launching new smart glasses for prescription wearers 
It plans to debut them next week. (Bloomberg $) 

7 Taiwan is probing 11 Chinese firms for illegal poaching of tech talent 
Its semiconductors are entangled in the tensions with Beijing. (Reuters

8 Bluesky has built an AI app for customizing social media feeds 
It uses Anthropic’s Claude. (TechCrunch

9 A psychologist is making music with his brain implant 
He believes enjoyment is a prerequisite for BCI success. (Wired $) 

10 The world’s smallest QR code could store data for centuries 
It’s smaller than bacteria. (Science Daily

Quote of the day 

“We should be thinking about protecting young people in the digital world as opposed to protecting them from the digital world.” 

—YouTube CEO Neal Mohan gives the New York Times his take on the debate around children’s safety online. 

One More Thing 

AJ PICS / ALAMY STOCK PHOTO

AI’s growth needs the right interface 

You’d have to be pudding-brained to believe that chatbots are the best way to use computers. The real opportunity is a system built atop the visual interfaces we already know, but navigated through a natural mix of voice and touch. 

Crucially, this won’t just be a computer that we can use. It’ll be one we can break and remake to suit whatever uses we want. Instead of merely consuming technology like the gelatinous humans in Wall-E, we should be able to architect it to suit our own ends 

This idea is already lurching to life. Read the full story to find out how

—Cliff Kuang 

We can still have nice things 

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) 
 
+ These floating designs will elevate your perspective on architecture. 
+ Uğur Gallenkuş’s portraits of two worlds in one image beautifully build bridges. 
+ This is the anti-Karen that the world needs right now. 
+ If only we could all find a love as pure as this kitty clinging to its favorite toy. 

The Pentagon’s culture war tactic against Anthropic has backfired

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Last Thursday, a California judge temporarily blocked the Pentagon from labeling Anthropic a supply chain risk and ordering government agencies to stop using its AI. It’s the latest development in the month-long feud. And the matter still isn’t settled: The government was given seven days to appeal, and Anthropic has a second case against the designation that has yet to be decided. Until then, the company remains persona non grata with the government. 

The stakes in the case—how much the government can punish a company for not playing ball—were apparent from the start. Anthropic drew lots of senior supporters with unlikely bedfellows among them, including former authors of President Trump’s AI policy.

But Judge Rita Lin’s 43-page opinion suggests that what is really a contract dispute never needed to reach such a frenzy. It did so because the government disregarded the existing process for how such disputes are governed and fueled the fire with social media posts from officials that would eventually contradict the positions it took in court. The Pentagon, in other words, wanted a culture war (on top of the actual war in Iran that began hours later). 

The government used Anthropic’s Claude for much of 2025 without complaint, according to court documents, while the company walked a branding tightrope as a safety-focused AI company that also won defense contracts. Defense employees accessing it through Palantir were required to accept terms of a government-specific usage policy that Anthropic cofounder Jared Kaplan said “prohibited mass surveillance of Americans and lethal autonomous warfare” (Kaplan’s declaration to the court didn’t include details of the policy). Only when the government aimed to contract with Anthropic directly did the disagreements begin. 

What drew the ire of the judge is that when these disagreements became public, they had more to do with punishment than just cutting ties with Anthropic. And they had a pattern: Tweet first, lawyer later. 

President Trump’s post on Truth Social on February 27 referenced “Leftwing nutjobs” at Anthropic and directed every federal agency to stop using the company’s AI. This was echoed soon after by Defense Secretary Pete Hegseth, who said he’d direct the Pentagon to label Anthropic a supply chain risk. 

Doing so necessitates that the secretary take a specific set of actions, which the judge found Hegseth did not complete. Letters sent to congressional committees, for example, said that less drastic steps were evaluated and deemed not possible, without providing any further details. The government also said the designation as a supply chain risk was necessary because Anthropic could implement a “kill switch,” but its lawyers later had to admit it had no evidence of that, the judge wrote.

Hegseth’s post also stated that “No contractor, supplier, or partner that does business with the United States military may conduct any commercial activity with Anthropic.” But the government’s own lawyers admitted on Tuesday that the Secretary doesn’t have the power to do that, and agreed with the judge that the statement had “absolutely no legal effect at all.”

The aggressive posts also led the judge to also conclude that Anthropic was on solid ground in complaining that its First Amendment rights were violated. The government, the judge wrote while citing the posts, “set out to publicly punish Anthropic for its ‘ideology’ and ‘rhetoric,’ as well as its ‘arrogance’ for being unwilling to compromise those beliefs.”

Labeling Anthropic a supply chain risk would essentially be identifying it as a “saboteur” of the government, for which the judge did not see sufficient evidence. She issued an order last Thursday halting the designation, preventing the Pentagon from enforcing it and forbidding the government from fulfilling the promises made by Hegseth and Trump. Dean Ball, who worked on AI policy for the Trump administration but wrote a brief supporting Anthropic, described the judge’s order on Thursday as “a devastating ruling for the government, finding Anthropic likely to prevail on essentially all of its theories for why the government’s actions were unlawful and unconstitutional.”

The government is expected to appeal the decision. But Anthropic’s separate case, filed in DC, makes similar allegations. It just references a different segment of the law governing supply chain risks. 

The court documents paint a pretty clear pattern. Public statements made by officials and the President did not at all align with what the law says should happen in a contract dispute like this, and the government’s lawyers have consistently had to create justifications for social media lambasting of the company after the fact.

Pentagon and White House leadership knew that pursuing the nuclear option would spark a court battle; Anthropic vowed on February 27 to fight the supply chain risk designation days before the government formally filed it on March 3. Pursuing it anyway meant senior leadership was, to say the least, distracted during the first five days of the Iran war, launching strikes while also compiling evidence that Anthropic was a saboteur to the government, all while it could have cut ties with Anthropic by simpler means. 

But even if Anthropic ultimately wins, the government has other means to shun the company from government work. Defense contractors who want to stay on good terms with the Pentagon, for example, now have little reason to work with Anthropic even if it’s not flagged as a supply chain risk. 

“I think it’s safe to say that there are mechanisms the government can use to apply some degree of pressure without breaking the law,” says Charlie Bullock, a senior research fellow at the Institute for Law and AI. “It kind of depends how invested the government is in punishing Anthropic.”

From the evidence thus far, the administration is committing top-level time and attention to winning an AI culture war. At the same time, Claude is apparently so important to its operations that even President Trump said the Pentagon needed six months to stop using it. The White House demands political loyalty and ideological alignment from top AI companies, But the case against Anthropic, at least for now, exposes the limits of its leverage.

If you have information about the military’s use of AI, you can share it securely via Signal (username jamesodonnell.22).

There are more AI health tools than ever—but how well do they work?

<div data-chronoton-summary="

  • Demand is driving the boom: Microsoft, Amazon, and OpenAI have all launched consumer health AI tools in recent months, partly because people are already using general chatbots for medical advice at massive scale—Microsoft alone fields 50 million health questions daily.
  • Independent testing is lagging behind releases: Most experts agree these tools could genuinely help people who struggle to access care, but all six academic researchers interviewed raised concerns that products are going public before independent researchers can assess whether they’re actually safe.
  • Even good benchmarks have blind spots: Studies show that real users—lacking medical expertise—might not know how to get the answers they want from health chatbots, a gap that some lab-based evaluations may not catch.
  • The honest answer is still “we don’t know”: No one is demanding perfection from health AI, but without trusted third-party evaluation, it remains genuinely unclear whether today’s tools help more than they harm.

” data-chronoton-post-id=”1134795″ data-chronoton-expand-collapse=”1″ data-chronoton-analytics-enabled=”1″>

Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropic’s Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend. 

There’s a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released. 

In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations aren’t made available for external expert review. And even if the companies are doing quality, rigorous research—which some, including OpenAI, do seem to be—they might still have blind spots that the broader research community could help to fill.

“To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,” says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out.”

“But,” he adds, “the evidence base really needs to be there.”

Tipping points 

To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the company’s health team was formed, and why Copilot Health now exists. “We’ve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,” he says.

But that’s only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report, and an accompanying blog post, detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app.

Other AI companies have noticed, and responded to, this trend. “Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,” says Karan Singhal, who leads OpenAI’s Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAI’s models.)

It’s possible that people simply prefer posing their health problems to a nonjudgmental bot that’s available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. “There is a reason that these tools exist and they have a position in the overall landscape,” says Girish Nadkarni, chief AI officer​ at the Mount Sinai Health System. “That’s because access to health care is hard, and it’s particularly hard for certain populations.”

The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbot’s advice rather than unnecessarily busying emergency rooms and doctor’s offices.

But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and  some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Health’s capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public.

Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan. 

The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazon’s Health AI include similar warnings. But those warnings are easy to ignore. “We all know that people are going to use it for diagnosis and management,” says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.

Medical testing

Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversations—though the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the model’s HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect. 

But evaluations like HealthBench have limitations. In a study published last month, Bean—the Oxford doctoral candidate—and his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenario—or their real-life experience—are important to include in their prompt, or they might misinterpret the information that an LLM gives them.

Bean says that this performance gap could be significant for OpenAI’s models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If that’s the case, then users who don’t have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice.

Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.

Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. 

Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. 

Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.

Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”

They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.

OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” 

Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.

Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who’s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.”

No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes—and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren’t too grave. 

With the current state of the evidence, however, it’s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.