The Triple-P Framework: AI & Search Brand Presence, Perception & Performance

As brands compete for market share across a whole range of AI platforms, each with its own way of presenting information, brands are on red alert.

The three pillars of presence, perception, and performance that I discuss in this article may help marketers navigate new times. This is especially true as search and AI undergo their biggest make-over ever.

What’s driving this change?

AI isn’t just retrieving information anymore – it’s actively evaluating, framing, and recommending brands before prospects even click a link.

It’s happening now, and it’s accelerating.

Think about it. Today, in many ways, ChatGPT has become just as synonymous with AI as Google was when it launched core search.

More and more users and marketers are experimenting with and utilizing Google AIO, ChatGPT, Perplexity, and more.

According to a recent BrightEdge survey, over 53% of marketers regularly use multiple (two or more) AI search platforms weekly.

AI Is Reshaping How Brands Are Presented And Perceived

Consider how buyers research options today: In Google AIO, a traveler planning a Barcelona vacation once needed dozens of separate searches, each representing an opportunity for visibility.

Now? They ask one question to an AI assistant and receive a complete itinerary, compressing what 50 touchpoints once took into a single interaction.

AI is no longer a passive search engine. It’s an active evaluator, interpreting intent, forming opinions, and determining which brands deserve attention.

In enterprise SEO and B2B contexts, the shift is even more pronounced. AI is effectively writing the request for proposal (RFP), establishing evaluation criteria, and creating shortlists without brands having direct input.

Take enterprise software evaluation, for instance. When a CIO asks an AI about the “best enterprise resource planning solutions,” the AI’s response typically features:

  • A curated shortlist of vendors.
  • Evaluation criteria that the AI deems relevant.
  • Strengths and limitations of each solution.
  • Recommendations based on various scenarios.

These responses don’t just inform decisions. They frame the entire evaluation process before a vendor’s content is visited.

The question isn’t whether this transformation is happening. It’s whether your brand is prepared for it.

Read more: 5 Key Enterprise SEO And AI Trends For 2025

The Triple-P Framework For AI Search Success

After analyzing thousands of AI search responses using our BrightEdge Generative Parser™, I’ve developed the Triple-P framework (Presence, Perception, and Performance) as a strategic compass for navigating this new landscape.

Let’s break down each component.

Presence: Beyond Traditional Rankings

While Google still commands 89.71% of search market share, the ecosystem is diversifying rapidly:

  • ChatGPT: 19% monthly traffic growth.
  • Perplexity: 12% monthly traffic growth.
  • Claude: 166% monthly traffic surge.
  • Grok: 266% early-stage spike.

(Source: BrightEdge Generative Parser™ April 2025)

Our research shows that the presence of AI Overviews has nearly doubled since June 2024, with comparison features growing by 70-90% and product visualization features by 45-50% in B2B sectors.

Image from author, May 2025

For enterprise marketers, Google is always your starting point. However, it’s not just about ranking on Google anymore; it’s about showing up wherever AI models showcase your brand.

For example, consider these industry-specific implications:

  • For CPG brands: When consumers ask about product sustainability, AI doesn’t just list eco-friendly options; it evaluates authenticity based on consistent messaging across digital touchpoints.
  • For SaaS companies: Buyers researching integration capabilities receive AI-curated assessments that either position you as a compatibility leader or exclude you entirely.
  • For healthcare providers: Patient questions about treatment options trigger AI responses that cite the most authoritative content, not necessarily the highest-ranking websites.

We are in an era of compressed decision-making. Invisibility equals elimination.

Perception: When AI Forms Opinions

The most revealing insight from our research is that only 31% of AI-generated brand mentions are positive; of those, just 20% include direct recommendations.

Source: BrightEdge AI Catalyst and Generative Parser ™, May 2025

This is a wake-up call for all marketers, especially those managing a brand.

Even when your brand appears in AI results, how it’s framed varies dramatically depending on the AI model, training data, and interpretive logic.

In some AI engines, your brand may appear as the industry leader. In others, you may be completely absent.

What The Data Shows:

  • Brands with strong pre-existing recognition receive more positive mentions in AI responses.
  • Consistent messaging across digital touchpoints makes brands more likely to be cited positively.
  • AI systems appear to “average” brand signals across the web when forming perceptions.

When we analyzed sentiment distribution (April 2025) in AI responses by industry, we saw significant variation, which you could group-match to verticals. For example:

  • Finance: Positive mentions aligned around good content on regulatory compliance and security.
  • Healthcare: Positive mentions aligned around good content with accuracy and credibility as key factors.
  • Retail: Positive mentions aligned around good customer experience and shopping.
  • Technology: Positive mentions aligned around content on innovation and reliability as primary criteria.

The implications are clear: Perception management is now as crucial as presence.

How does this play out in practice?

When brands implement coordinated perception management strategies across multiple channels, they see improvements in AI sentiment within 60-90 days.

Performance: New Metrics That Matter

The final P (Performance) requires entirely new measurement approaches.

When AI overviews appear in search results, click-through rates often drop by up to 50% according to internal BrightEdge data. Yet, conversion rates typically remain strong, suggesting AI qualifies leads before they reach your site.

We’re entering an era where impressions will be high, click-through rates may drop, but conversions will increase.

I explained at our recent quarterly briefing. AI filters options and delivers buyers who are closer to decisions.

The impact varies dramatically by query type:

  • Informational queries: Reduction in clicks, minimal conversion impact.
  • Navigational queries: Reduction in clicks, negligible conversion impact.
  • Commercial queries: Reduction in clicks, higher conversion rates.
  • Transactional queries: Reduction in clicks, higher conversion rates.

This pattern suggests AI is most effective at qualifying commercial intent, delivering more purchase-ready traffic.

And impressions matter now – they are a new brand metric.

Five Essential AI Search Metrics:

  1. AI Presence Rate: Percentage of target queries where your brand appears in AI responses.
  2. Citation Authority: How consistently you are cited as the primary source.
  3. Share Of AI Conversation: Your semantic real estate in AI answers versus competitors.
  4. Prompt Effectiveness: How well your content answers natural language prompts.
  5. Response-To-Conversion Velocity: How quickly AI-influenced prospects convert. Brands with strong pre-existing recognition will receive more positive mentions in AI responses.

Position within AI responses matters as much as position in traditional SERPs once did.

Monthly reporting cycles are becoming obsolete. AI-generated results can shift within hours, demanding real-time monitoring capabilities.

The DNA Of AI-Optimized Content

In my experience, content is more likely to be cited by AI with:

  • Comprehensive coverage: Content addressing multiple related questions outperforms narrow content.
  • Structured data implementation: Pages with robust schema markup see higher citation rates.
  • Expert validation: Content with clear expert authorship signals receives more citations.
  • Multi-format delivery: Topics presented in multiple formats (text, video, data visualizations) earn more citations.
  • First-party data inclusion: Original research and proprietary data increase citation likelihood.

These patterns suggest AI systems are increasingly sophisticated in their ability to identify genuinely authoritative content versus content merely optimized for traditional ranking factors.

In my last article, I discussed how Google AIO, ChatGPT, and Perplexity differ and where they share some common optimization traits.

Five Actionable Strategies For Triple-P Success

Based on our extensive research, here are five implementation strategies aligned with this framework:

1. Adopt Entity-Based SEO

AI prioritizes content from known, trusted entities. Stop optimizing for fragmented keywords and start building comprehensive topic authority.

Our data shows that authoritative content is three times more likely to be cited in AI responses than narrowly focused pages.

Implementation Steps:

  • Perform an entity audit: Identify how search engines currently understand your brand as an entity.
  • Develop topical maps: Create comprehensive coverage of core topics rather than isolated keywords
  • Implement entity-based schema: Use structured data to explicitly define your brand’s relationship to key topics.
  • Build consistent entity references: Ensure name, address, and phone (NAP) consistency across all digital properties.
  • Cultivate authoritative connections: Earn mentions and links from recognized authorities in your space.

Enterprise brands implementing entity-based SEO will see an uplift in AI citations.

2. Implement Perception Management

With 69% of AI brand mentions not explicitly positive, you must actively shape sentiment.

Image from author, May 2025

Brands that implement proactive sentiment management strategies will see success.

Implementation Steps:

  • Monitor AI sentiment tracking: Establish baseline sentiment across AI platforms.
  • Identify perception gaps: Compare AI perceptions against desired brand positioning.
  • Address criticism proactively: Create content that honestly addresses common concerns.
  • Amplify authentic strengths: Develop evidence-based content highlighting genuine advantages.
  • Build consistent messaging: Align key messages across all digital touchpoints.

3. Integrate Real-Time Citation Monitoring

Tracking AI citations regularly is now vital to improve mention rates.

This requires capability beyond traditional rank tracking or Google Search Console analysis.

Implementation Steps:

  • Deploy continuous monitoring: Track AI responses for priority queries across platforms.
  • Implement competitor citation alerts: Get notified when competitors gain or lose citations.
  • Conduct prompt variation testing: Analyze how different user phrasings affect your brand’s inclusion.
  • Track citation position: Monitor where within AI responses your brand appears.
  • Measure citation authority: Assess whether you’re positioned as a primary or secondary source.

4. Deploy Cross-Core Search And AI Platforms

Companies that take an integrated approach across traditional search and multiple AI platforms will see higher return on investment (ROI) on search investments.

The future belongs to unified measurement frameworks that connect traditional SEO metrics with emerging AI citation patterns.

Implementation Steps:

  • Build unified dashboards: Integrate traditional search metrics with AI citation data.
  • Map keyword-to-prompt relationships: Connect traditional keywords to conversational AI prompts.
  • Analyze traffic source shifts: Track changing patterns between direct search and AI-referred traffic.
  • Segment by AI platform: Monitor performance variations across different AI search environments.
  • Connect to business outcomes: Tie AI presence metrics directly to conversion and revenue data.

5. Use AI To Win At AI

This isn’t theoretical. It’s delivering measurable results:

  • BrightEdge Autopilot users averaged a 65% performance improvement.
  • BrightEdge Copilot users saved 1.2 million content research hours.

The brands succeeding most in AI search leverage AI in their workflows.

Implementation Steps:

  • Automate content research: Use AI to identify comprehensive topic coverage opportunities.
  • Implement AI-driven schema markup: Systematically structure data for machine interpretation.
  • Deploy prompt effectiveness testing: Continuously test how well content answers real user prompts.
  • Create AI-optimized content briefs: Define exactly what comprehensive coverage means for each topic.
  • Analyze AI citation patterns: Identify what characteristics make competitor content citation-worthy.

Teams using AI for AI optimization will benefit from higher productivity and improved performance to gain that must-have competitive edge in search and AI today.

What’s Coming Next: AI-To-AI Marketing

Looking ahead to two to three years, expect AI to evolve from an information assistant to a trusted advisor that buyers rely on for evaluation, comparison, and vendor selection.

We’re already seeing early indicators of AI-to-AI marketing, where procurement teams use AI agents to automate research and vendor vetting.

Emerging Trends:

  • Digital twin marketplaces: Buyers will interact with simulated versions of B2B solutions before speaking with vendors
  • Vertical-specific AI companions: Industry-specialized models for cybersecurity, manufacturing, and healthcare.
  • AI agent purchasing: Autonomous systems are not just researching but also completing transactions on users’ behalf.
  • Continuous entity validation: AI systems continuously monitor brand claims against real-world evidence.
  • Multi-modal search experiences: Voice, image, and text-based AI interactions requiring omnichannel optimization.

Read more: As Chatbots And AI Search Engines Converge: Key Strategies For SEO

The Trust Premium In AI Search

Consumers are always more likely to trust brands they already recognize.

  • AI functions as a trust bridge.
  • When consumers delegate decision-making to AI, pre-existing brand familiarity becomes disproportionately influential.
  • The impact is most pronounced in high-consideration purchases.

This creates both a challenge and an opportunity. Established brands must protect their advantage, while emerging brands must strategically build recognition signals detectable by AI.

Organizational Structure For AI Search Success

Leading organizations are already creating “collaborative intelligence” roles – specialists managing the interplay between human creativity and AI amplification.

Successful teams typically include:

  • AI Search Strategists: Focus on overall presence, perception, and performance.
  • Prompt Engineers: Specialize in understanding how users phrase requests to AI.
  • Content Scientists: Develop evidence-based approaches to comprehensive coverage.
  • AI Citation Analysts: Monitor and optimize for inclusion in AI responses.
  • Schema Specialists: Ensure that the machine-readable structure enhances entity understanding.

These cross-functional teams integrate with traditional SEO, content marketing, analytics, and business intelligence functions.

The Bottom Line

In this new landscape, the question isn’t whether your website ranks. It’s whether AI recommends your brand when it matters most.

The Triple-P framework gives you the structure to navigate this future with confidence.

Here’s what I recommend getting started:

  • Conduct an AI presence audit: Understand where your brand appears in AI responses across key platforms.
  • Analyze sentiment distribution: Assess not just if you’re mentioned, but how you’re portrayed in AI-generated content.
  • Connect AI metrics to business results: Start tracking the relationship between AI presence and conversion patterns.
  • Identify entity perception gaps: Compare how AI systems understand your brand versus your desired positioning.
  • Deploy real-time monitoring: Implement systems to track citation changes as they happen.

The branded AI search revolution isn’t coming – it’s already here.

The brands that embrace the Triple-P framework today will be the ones AI recommends tomorrow.

Note: In March 2025, BrightEdge surveyed over 1,000 of its customers who are marketers. Findings from this survey are referenced above.

More Resources:


Featured Image: Moon Safari/Shutterstock

Is The SEO Job Market Undergoing A Major Shift? via @sejournal, @martinibuster

Anecdotal reports and an SEO jobs study describe a search marketing industry undergoing profound changes, not only in the skills in demand but also in hiring practices that may be making it difficult for experienced SEOs to get the jobs they are well qualified for.

Short History Of SEO Jobs

Twenty five years ago getting into SEO and earning a living was relatively easy. Many top corporations across all industries were hiring freelancers and agencies for specialized SEO assistance. I suspect that marketing departments didn’t view SEOs as a subset of marketing and that many didn’t have SEO staff. That gradually changed as more organizations hired dedicated SEO staff with third party SEOs providing specialized assistance.

What’s Going On With SEO Jobs?

A recent report on the state of SEO jobs provided by SEOJobs.com shared the state of SEO jobs in 2024.

The following insights show that the job of SEO continues to evolve:

  • SEO job openings declined in 2024
  • Median SEO salaries dropped
  • 65% of SEO jobs are in-house
  • Remote SEOs jobs dropped
  • SEO job titles related to content strategy and writing dropped by 28%
  • SEO Analyst job titles dropped by 12%.
  • Technical SEO and related titles dropped by a small percentage
  • Senior level titles like manager, director, and VP had the strongest increases.

The report says that job titles related to Technical SEO dropped:

“Positions in the Technical SEO and related title group represented 5.8 percent of all SEO jobs during the first quarter of 2024, falling slightly to 5.4 percent by the end of the fourth quarter – a decrease of seven percent.”

But the report also states that Technical SEO is still an in-demand skill:

“…demand for skill in technical SEO grew at the fastest rate of any skill during the fourth quarter, rising to 75 percent from 71 percent the previous quarter.”

Experienced SEOs Having Trouble Getting Hired… By AI?

Keith Goode read the above referenced report and commented that he believes the reason many highly experienced SEOs are failing to get a job is because of a poor implementation of AI into the hiring process.

He shared his insights on a LinkedIn post:

“I have seen superior SEOs languish amongst thousands of candidates, immediately rejected for a lack of experience (??) or funneled through multiple rounds of interviews and work assignments, only to be rudely ghosted by the recruiters.

The cause? I guess you could blame AI if you wanted to shoot the messenger. But the reality is that companies have overinvested in an unproven technology to handle things that it’s not yet ready to handle. I get that recruitment teams are deluged with thousands of resumes for every opening, and I understand they need a way to streamline the screening process.

However, AI has proven to be more of an enemy within than a helper. Anecdotally, I’ve heard about a hiring manager who applied for their own job opening (presumably one they were more than qualified for) only to receive an immediate rejection from the AI-powered ATS. That person fired their hiring team.

(By the way, I’m not anti-AI. I’m anti-foolishness, and a lot of companies are acting like fools.)”

Experienced SEOs Are Getting Ghosted

It may be true that SEOs with decades of experience are being left behind by poor AI vetting. A glaring example is the one shared by Brian Harnish, an SEO with decades of hands-on experience.

Brian recently published the following on LinkedIn and Facebook:

“In this job market, for me it simply appears that nothing matters.

  • You can apply at 6:15 a.m. the day the job posting pops up and be one of the first.
  • You can change your resume 15 times like I have.
  • You can use ResumeWorded. com for an ATS version of your resume.
  • You can write your resume yourself until you’re blue in the face.
  • You can follow up on the interview with thank yous immediately after.
  • You can follow up on interview decisions later.
  • You can agree to their salary ranges exactly. Even when it’s a pay cut for you.
  • A/B testing long vs. short resumes yield the same results.
  • You can tie in all of your achievements with task > impact > website statements on your resume.
  • You write an entirely customized LinkedIn profile.
  • You can know all the right people.
  • You can network up the wazoo.
  • You can have the greatest interview that you feel you’ve ever put forth.

But companies don’t provide feedback. It’s always the same form letter: “while your qualifications are impressive, we went with another candidate.” Or you’re ghosted.

This market is brutal. I really want a job. Not a handout. But nobody appears to want to hire me. At all. Despite doing EVERYthing right. I used to get hired on the spot. Now it’s just crickets.”

What The Heck Is Going On?

I know of other SEOs, also with decades of experience across all areas of SEO who should have just bounced to a new job in a matter of days but took months to get hired. I’m talking about people with SEO director level experience at top Fortune 500 companies.

How does this happen?

Are you experiencing something similar?

Featured Image by Shutterstock/Ollyy

Google’s AI Max Ads Hone Search Intent

I wrote in December that Google would launch keywordless Search ads in 2025. I based my prediction on Google’s evolving assessment of searchers’ intent. Keywords used to be the sole factor. Now, they are one of many variables that dictate the ads a user sees.

Last week, Google introduced a campaign type called “AI Max for Search.” Keywords are present but as themes instead of the leading indicator.

Artificial intelligence drives the new campaign type. Performance Max campaigns and smart bidding already rely on AI. AI Max moves beyond query matching to capture signals of what searchers seek.

The main features of AI Max are already available in Search as options to turn on or off. With AI Max, advertisers go all in. Google determines:

  • Ads that show from the initial broad match keyword list.
  • Ad text to convert the most searchers.
  • The URL for top performance.

Let’s review each of these components.

Screenshot of AI Max settings in Google Ads admin

With AI Max, Google determines the ad text and the final URL. Click image to enlarge.

Search term matching

Existing Search campaigns include an option for broad match keywords solely — pausing phrase and exact match. Google claims the combination of broad match and smart bidding will improve targeting and thus performance.

Search term matching is a logical iteration of pairing queries with ads. A keyword list is only the beginning. Google’s AI analyzes the keywords, assets, and landing pages to determine the ads to show.

Search term matching is akin to “you might like” suggestions on Netflix based on a user’s viewing history. The same principle applies here. Google will show an ad if it’s relevant to a searcher, regardless of the advertiser’s keyword.

Text customization

Formerly called automatically created assets, text customization allows Google’s AI to use the verbiage from ads, landing pages, and assets to produce customized headlines and descriptions.

For example, an advertiser selling picture frames may write a headline of “All Sizes of Picture Frames.” Google may instead show “4 x 6 Picture Frames” if it determines the searcher wants that dimension and the advertiser carries it.

Advertisers can view Google’s headlines and descriptions via “asset performance” at the account or campaign level and filter by “automatically created” to see the exact assets that showed. Advertisers can remove an asset as needed.

Advertisers can filter by “automatically created” at the account or campaign level to see the exact assets.

Final URL expansion

Google has long altered advertisers’ URLs in Performance Max and Dynamic Search Ads campaigns. Like text customization, Google changes the final URL to improve performance. Combined, the two features produce an entirely new ad.

Advertisers can now provide URL inclusions and exclusions. Inclusions instruct the AI what URLs to target, such as those manually submitted or in a page feed. URL exclusions can target blog pages, for example, if an advertiser doesn’t want to pay for that traffic.

Other features

Google’s announcement of AI Max for Search included a slew of additional features. One is brand settings, wherein advertisers can include or exclude brand names from their ads.

For example, an advertiser selling only Nike and Adidas shoes could designate a brand inclusion for those names and a brand exclusion for “Reebok” queries.

Google is upgrading reporting transparency, a much-needed improvement after the launch of Performance Max campaigns, which did not include a search term report. AI Max campaigns correct this by including an “AI Max” column in the report to show the query and the combination of query, assets, and the final URL.

In short, Google Ads continues to evolve. Keywords are no longer the primary targeting method. AI has and will reshape the platform and performance.

Did solar power cause Spain’s blackout?

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some (including the US energy secretary, Chris Wright) have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insisted that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. And even as we wait for the bigger picture, there are a few takeaways that could help our future grid.

Let’s start with what we know so far about what happened, according to the Spanish grid operator Red Eléctrica:

  • A disruption in electricity generation took place a little after 12:30 p.m. This may have been a power plant flipping off or some transmission equipment going down.
  • A little over a second later, the grid lost another bit of generation.
  • A few seconds after that, the main interconnector between Spain and southwestern France got disconnected as a result of grid instability.
  • Immediately after, virtually all of Spain’s electricity generation tripped offline.

One of the theories floating around is that things went wrong because the grid diverged from its normal frequency. (All power grids have a set frequency: In Europe the standard is 50 hertz, which means the current switches directions 50 times per second.) The frequency needs to be constant across the grid to keep things running smoothly.

There are signs that the outage could be frequency-related. Some experts pointed out that strange oscillations in the grid frequency occurred shortly before the blackout.

Normally, our grid can handle small problems like an oscillation in frequency or a drop that comes from a power plant going offline. But some of the grid’s ability to stabilize itself is tied up in old ways of generating electricity.

Power plants like those that run on coal and natural gas have massive rotating generators. If there are brief issues on the grid that upset the balance, those physical bits of equipment have inertia: They’ll keep moving at least for a few seconds, providing some time for other power sources to respond and pick up the slack. (I’m simplifying here—for more details I’d highly recommend this report from the National Renewable Energy Laboratory.)

Solar panels don’t have inertia—they rely on inverters to change electricity into a form that’s compatible with the grid and matches its frequency. Generally, these inverters are “grid-following,” meaning if frequency is dropping, they follow that drop.

In the case of the blackout in Spain, it’s possible that having a lot of power on the grid coming from sources without inertia made it more possible for a small problem to become a much bigger one.

Some key questions here are still unanswered. The order matters, for example. During that drop in generation, did wind and solar plants go offline first? Or did everything go down together?

Whether or not solar and wind contributed to the blackout as a root cause, we do know that wind and solar don’t contribute to grid stability in the same way that some other power sources do, says Seaver Wang, climate lead of the Breakthrough Institute, an environmental research organization. Regardless of whether renewables are to blame, more capability to stabilize the grid would only help, he adds.

It’s not that a renewable-heavy grid is doomed to fail. As Wang put it in an analysis he wrote last week: “This blackout is not the inevitable outcome of running an electricity system with substantial amounts of wind and solar power.”

One solution: We can make sure the grid includes enough equipment that does provide inertia, like nuclear power and hydropower. Reversing a plan to shut down Spain’s nuclear reactors beginning in 2027 would be helpful, Wang says. Other options include building massive machines that lend physical inertia and using inverters that are “grid-forming,” meaning they can actively help regulate frequency and provide a sort of synthetic inertia.

Inertia isn’t everything, though. Grid operators can also rely on installing a lot of batteries that can respond quickly when problems arise. (Spain has much less grid storage than other places with a high level of renewable penetration, like Texas and California.)

Ultimately, if there’s one takeaway here, it’s that as the grid evolves, our methods to keep it reliable and stable will need to evolve too.

If you’re curious to hear more on this story, I’d recommend this Q&A from Carbon Brief about the event and its aftermath and this piece from Heatmap about inertia, renewables, and the blackout.

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

The Download: AI benchmarks, and Spain’s grid blackout

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 as a way to evaluate an AI model’s coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” Entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement. Read the full story.

—Russell Brandom

Did solar power cause Spain’s blackout?

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insist that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. Here are a few takeaways that could help our future grid. 

—Casey Crownhart

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 The Trump administration will repeal some global chip curbs 
It’s drawing up new rules that prioritize direct negotiations with various nations. (Bloomberg $)
+ The curbs have always been leaky anyway. (Economist $)

2 India and Pakistan have accused each other of overnight drone attacks
The conflict between the two countries is rapidly escalating. (The Guardian)
+ Pakistan claims to have shot down 25 drones in its airspace. (Reuters)
+ Mass-market military drones have changed the way wars are fought. (MIT Technology Review)

3 The FDA is interested in using AI for drug evaluation
And has met with OpenAI to hear more about how to do it. (Wired $)
+ An AI-driven “factory of drugs” claims to have hit a big milestone. (MIT Technology Review)

4 The US is pushing nations facing its tariffs to adopt Starlink
Government officials in India and other countries have fast tracked approvals. (WP $)
+ India recently announced new rules for satellite internet providers. (Rest of World)

5 Apple is overhauling its Safari browser to focus on AI search
Its search volume is down for the first time in 22 years. (The Verge)
+ Apple exec Eddy Cue thinks AI search will replace traditional search engines. (Bloomberg $)
+ AI means the end of internet search as we’ve known it. (MIT Technology Review)

6 Mark Zuckerberg is betting big on AI chatbots
He’s on a media charm offensive to convince us that AI friends are the future. (WSJ $)
+ The AI relationship revolution is already here. (MIT Technology Review)

7 Students can’t wean themselves off ChatGPT
And experts fear that they’ll emerge into the workforce essentially illiterate. (NY Mag $)
+ Some educators believe that AI highlights how the ways we teach need to change. (MIT Technology Review)

8 We don’t really know how memory works 🧠
But these researchers are doing their best to find out. (Quanta Magazine)

9 The vast majority of the sea depths are still unexplored
What lies beneath is a mystery. (New Scientist $)
+ Meet the divers trying to figure out how deep humans can go. (MIT Technology Review)

10 Pet psychics are taking over TikTok 🔮
But does your furry friend have anything to say?(NYT $)
+ Humans are still better than AI at futuregazing—for now. (Vox)
+ How DeepSeek became a fortune teller for China’s youth. (MIT Technology Review)

Quote of the day

“It’s like living in hell.”

—Elizabeth Martorana, a Virginia resident, describes what it’s like to live in a development zone for Amazon, Microsoft, and Google data centers, Semafor reports.

One more thing

How Antarctica’s history of isolation is ending—thanks to Starlink

“This is one of the least visited places on planet Earth and I got to open the door,” Matty Jordan, a construction specialist at New Zealand’s Scott Base in Antarctica, wrote in the caption to the video he posted to Instagram and TikTok in October 2023.

In the video, he guides viewers through the hut, pointing out where the men of Ernest Shackleton’s 1907 expedition lived and worked.

The video has racked up millions of views from all over the world. It’s also kind of a miracle: until very recently, those who lived and worked on Antarctic bases had no hope of communicating so readily with the outside world.

That’s starting to change, thanks to Starlink, the satellite constellation developed by Elon Musk’s company SpaceX to service the world with high-speed broadband internet. Read the full story.

—Allegra Rosenberg

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ Does Boston still drink? Not in the same way it used to.
+ Where in the US you should set up camp to stargaze right now.
+ Wow: this New Zealand snail lays eggs from its neck. 🐌
+ Jurassic World Rebirth is coming: and it looks suitably bonkers.

Your gut microbes might encourage criminal behavior

A few years ago, a Belgian man in his 30s drove into a lamppost. Twice. Local authorities found that his blood alcohol level was four times the legal limit. Over the space of a few years, the man was apprehended for drunk driving three times. And on all three occasions, he insisted he hadn’t been drinking.

He was telling the truth. A doctor later diagnosed auto-brewery syndrome—a rare condition in which the body makes its own alcohol. Microbes living inside the man’s body were fermenting the carbohydrates in his diet to create ethanol. Last year, he was acquitted of drunk driving.

His case, along with several other scientific studies, raises a fascinating question for microbiology, neuroscience, and the law: How much of our behavior can we blame on our microbes?

Each of us hosts vast communities of tiny bacteria, archaea (which are a bit like bacteria), fungi, and even viruses all over our bodies. The largest collection resides in our guts, which play home to trillions of them. You have more microbial cells than human cells in your body. In some ways, we’re more microbe than human.

Microbiologists are still getting to grips with what all these microbes do. Some seem to help us break down food. Others produce chemicals that are important for our health in some way. But the picture is extremely complicated, partly because of the myriad ways microbes can interact with each other.

But they also interact with the human nervous system. Microbes can produce compounds that affect the way neurons work. They also influence the functioning of the immune system, which can have knock-on effects on the brain. And they seem to be able to communicate with the brain via the vagus nerve.

If microbes can influence our brains, could they also explain some of our behavior, including the criminal sort? Some microbiologists think so, at least in theory. “Microbes control us more than we think they do,” says Emma Allen-Vercoe, a microbiologist at the University of Guelph in Canada.

Researchers have come up with a name for applications of microbiology to criminal law: the legalome. A better understanding of how microbes influence our behavior could not only affect legal proceedings but also shape crime prevention and rehabilitation efforts, argue Susan Prescott, a pediatrician and immunologist at the University of Western Australia, and her colleagues.

“For the person unaware that they have auto-brewery syndrome, we can argue that microbes are like a marionettist pulling the strings in what would otherwise be labeled as criminal behavior,” says Prescott.

Auto-brewery syndrome is a fairly straightforward example (it has been involved in the acquittal of at least two people so far), but other brain-microbe relationships are likely to be more complicated. We do know a little about one microbe that seems to influence behavior: Toxoplasmosis gondii, a parasite that reproduces in cats and spreads to other animals via cat feces.

The parasite is best known for changing the behavior of rodents in ways that make them easier prey—an infection seems to make mice permanently lose their fear of cats. Research in humans is nowhere near conclusive, but some studies have linked infections with the parasite to personality changes, increased aggression, and impulsivity.

“That’s an example of microbiology that we know affects the brain and could potentially affect the legal standpoint of someone who’s being tried for a crime,” says Allen-Vercoe. “They might say ‘My microbes made me do it,’ and I might believe them.”

There’s more evidence linking gut microbes to behavior in mice, which are some of the most well-studied creatures. One study involved fecal transplants—a procedure that involves inserting fecal matter from one animal into the intestines of another. Because feces contain so much gut bacteria, fecal transplants can go some way to swap out a gut microbiome. (Humans are doing this too—and it seems to be a remarkably effective way to treat persistent C. difficile infections in people.)

Back in 2013, scientists at McMaster University in Canada performed fecal transplants between two strains of mice, one that is known for being timid and another that tends to be rather gregarious. This swapping of gut microbes also seemed to swap their behavior—the timid mice became more gregarious, and vice versa.

Microbiologists have since held up this study as one of the clearest demonstrations of how changing gut microbes can change behavior—at least in mice. “But the question is: How much do they control you, and how much is the human part of you able to overcome that control?” says Allen-Vercoe. “And that’s a really tough question to answer.”

After all, our gut microbiomes, though relatively stable, can change. Your diet, exercise routine, environment, and even the people you live with can shape the communities of microbes that live on and in you. And the ways these communities shift and influence behavior might be slightly different for everyone. Pinning down precise links between certain microbes and criminal behaviors will be extremely difficult, if not impossible. 

“I don’t think you’re going to be able to take someone’s microbiome and say ‘Oh, look—you’ve got bug X, and that means you’re a serial killer,” says Allen-Vercoe.

Either way, Prescott hopes that advances in microbiology and metabolomics might help us better understand the links between microbes, the chemicals they produce, and criminal behaviors—and potentially even treat those behaviors.

“We could get to a place where microbial interventions are a part of therapeutic programming,” she says.

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

A new AI translation system for headphones clones multiple voices simultaneously

Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

The system, called Spatial Speech Translation, tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. 

“There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate,” says Shyam Gollakota, a professor at the University of Washington, who worked on the project. “My mom has such incredible ideas when she’s speaking in Telugu, but it’s so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her.”

While there are plenty of other live AI translation systems out there, such as the one running on Meta’s Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple’s M2 silicon chip, which can support neural networks. The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month.

Over the past few years, large language models have driven big improvements in speech translation. As a result, translation between languages for which lots of training data is available (such as the four languages used in this study) is close to perfect on apps like Google Translate or in ChatGPT. But it’s still not seamless and instant across many languages. That’s a goal a lot of companies are working toward, says Alina Karakanta, an assistant professor at Leiden University in the Netherlands, who studies computational linguistics and was not involved in the project. “I feel that this is a useful application. It can help people,” she says. 

Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”

The Download: AI headphone translation, and the link between microbes and our behavior

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

A new AI translation system for headphones clones multiple voices simultaneously

What’s new: Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

How it works: The system tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. Read the full story.

—Rhiannon Williams

Your gut microbes might encourage criminal behavior

A few years ago, a Belgian man in his 30s drove into a lamppost. Twice. Local authorities found that his blood alcohol level was four times the legal limit. Over the space of a few years, the man was apprehended for drunk driving three times. And on all three occasions, he insisted he hadn’t been drinking.

He was telling the truth. A doctor later diagnosed auto-brewery syndrome—a rare condition in which the body makes its own alcohol. Microbes living inside the man’s body were fermenting the carbohydrates in his diet to create ethanol. Last year, he was acquitted of drunk driving.

His case, along with several other scientific studies, raises a fascinating question for microbiology, neuroscience, and the law: How much of our behavior can we blame on our microbes? Read the full story.

—Jessica Hamzelou

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 How the Gates Foundation will end
Bill Gates will wind it down in 2045, after distributing most of his remaining fortune. (NYT $)
+ He estimates he’ll give away $200 billion in the next 20 years. (Semafor)
+ The foundation is shuttering several decades earlier than he expected. (BBC)

2 US Customs and Border Protection will no longer protect pregnant women
It’s rolled back policies designed to protect vulnerable people, including infants. (Wired $)
+ The US wants to use facial recognition to identify migrant children as they age. (MIT Technology Review)

3 DOGE is readying software to turbo-charge mass layoffs
After some 260,000 government workers have already been let go. (Reuters)
+ DOGE’s math doesn’t add up. (The Atlantic $)
+ One of its biggest inspirations is no fan of the program. (WP $)
+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review)

4 Scientists are using AI to predict cancer survival outcomes
In some cases, it’s outperforming clinicians’ forecasts. (FT $)
+ Why it’s so hard to use AI to diagnose cancer. (MIT Technology Review)

5 Apple is reportedly working on new chips for its smart glasses
But we’ll have to wait a few more years. (Bloomberg $)
+ What’s next for smart glasses. (MIT Technology Review)

6 Silicon Valley has a vision for the future of warfare
Military technologies are no longer solely the preserve of governments. (Bloomberg $)
+ Palmer Luckey on the Pentagon’s future of mixed reality. (MIT Technology Review)

7 AI companies don’t want regulation any more
Just a few short years after they claimed regulation was the best way of making AI safe. (WP $)

8 Forget SEO, GEO is where it’s at these days
Marketers are scrambling to adopt best Generative Engine Optimization practices now that AI is upending how we search the web. (WSJ $)
+ Your most important customer may be AI. (MIT Technology Review)

9 AI-generated recruiters are making job hunting even worse
Avatars can glitch out and stumble over their words. (404 Media)

10 A Soviet-era spacecraft is reentering Earth’s atmosphere
More than 50 years after it misfired on a journey to Venus. (Ars Technica)
+ The world’s next big environmental problem could come from space. (MIT Technology Review)

Quote of the day

“The picture of the world’s richest man killing the world’s poorest children is not a pretty one.”

—Bill Gates lashes out at Elon Musk’s cuts to USAID in an interview with the Financial Times.

One more thing

The great commercial takeover of low Earth orbit

NASA designed the International Space Station to fly for 20 years. It has lasted six years longer than that, though it is showing its age, and NASA is currently studying how to safely destroy the space laboratory by around 2030.

The ISS never really became what some had hoped: a launching point for an expanding human presence in the solar system. But it did enable fundamental research on materials and medicine, and it helped us start to understand how space affects the human body.

To build on that work, NASA has partnered with private companies to develop new, commercial space stations for research, manufacturing, and tourism. If they are successful, these companies will bring about a new era of space exploration: private rockets flying to private destinations. They’re already planning to do it around the moon. One day, Mars could follow. Read the full story.

—David W. Brown

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ It’s almost pasta salad time!
+ Who is the better fictional archaeologist: Indiana Jones or Lara Croft?
+ How a good night’s sleep could help to give you a long-lasting memory boost. 😴
+ How millennials became deeply uncool (allegedly)

How cloud and AI transform and improve customer experiences

As AI technologies become increasingly mainstream, there’s mounting competitive pressure to transform traditional infrastructures and technology stacks. Traditional brick-and-mortar companies are finding cloud and data to be the foundational keys to unlocking their paths to digital transformation, and to competing in modern, AI-forward industry landscapes. 

In this exclusive webcast, experts discuss the building blocks for digital transformation, approaches for upskilling employees and putting digital processes in place, and data management best practices. The discussion also looks at what the near future holds and emphasizes the urgency for companies to transform now to stay relevant. 

Learn from the experts

  • Digital transformation, from the ground up, starts by moving infrastructure and data to the cloud
  • AI implementation requires a talent transformation at scale, across the organization
  • AI is a company-wide initiative—everyone in the company will become either an AI creator or consumer

Featured speakers

Mohammed Rafee Tarafdar, Chief Technology Officer, Infosys

Rafee is Infosys’s Chief Technology Officer. He is responsible for the technology vision and strategy, sensing & scaling emerging technologies, advising and partnering with clients to help them succeed in their AI transformation journey and building high technology talent density. He is leading the AI First transformation journey for Infosys and has implemented population and enterprise scale platforms. He is the co-author of “The Live Enterprise” book and has been recognized as a top 50 technology global leader by Forbes in 2023 and Top 25 Tech Wavemaker by Entrepreneur India magazine in 2024.

Sam Jaddi, Chief Information Officer, ADT

Sam Jaddi is the Chief Information Officer for ADT. With more than 26 years of experience in technology innovation, Sam has deep knowledge of the security and smart home industry. His team helps to drive ADT’s business platforms and processes to improve both customer and employee experiences in the future. Sam has helped set the technology strategy, vision and direction for the company’s Digital transformation. Prior to Sam’s role at ADT, he served as Chief Technology Officer at Stanley, overseeing the company’s new security division, leading global integration initiatives, IT strategy, transformation and international operations.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.