Google DeepMind’s new AI agent uses large language models to crack real-world problems

Google DeepMind has once again used large language models to discover new solutions to long-standing problems in math and computer science. This time the firm has shown that its approach can not only tackle unsolved theoretical puzzles, but improve a range of important real-world processes as well.

Google DeepMind’s new tool, called AlphaEvolve, uses the Gemini 2.0 family of large language models (LLMs) to produce code for a wide range of different tasks. LLMs are known to be hit and miss at coding. The twist here is that AlphaEvolve scores each of Gemini’s suggestions, throwing out the bad and tweaking the good, in an iterative process, until it has produced the best algorithm it can. In many cases, the results are more efficient or more accurate than the best existing (human-written) solutions.

“You can see it as a sort of super coding agent,” says Pushmeet Kohli, a vice president at Google DeepMind who leads its AI for Science teams. “It doesn’t just propose a piece of code or an edit, it actually produces a result that maybe nobody was aware of.”

In particular, AlphaEvolve came up with a way to improve the software Google uses to allocate jobs to its many millions of servers around the world. Google DeepMind claims the company has been using this new software across all of its data centers for more than a year, freeing up 0.7% of Google’s total computing resources. That might not sound like much, but at Google’s scale it’s huge.

Jakob Moosbauer, a mathematician at the University of Warwick in the UK, is impressed. He says the way AlphaEvolve searches for algorithms that produce specific solutions—rather than searching for the solutions themselves—makes it especially powerful. “It makes the approach applicable to such a wide range of problems,” he says. “AI is becoming a tool that will be essential in mathematics and computer science.”

AlphaEvolve continues a line of work that Google DeepMind has been pursuing for years. Its vision is that AI can help to advance human knowledge across math and science. In 2022, it developed AlphaTensor, a model that found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years. In 2023, it revealed AlphaDev, which discovered faster ways to perform a number of basic calculations performed by computers trillions of times a day. AlphaTensor and AlphaDev both turn math problems into a kind of game, then search for a winning series of moves.

FunSearch, which arrived in late 2023, swapped out game-playing AI and replaced it with LLMs that can generate code. Because LLMs can carry out a range of tasks, FunSearch can take on a wider variety of problems than its predecessors, which were trained to play just one type of game. The tool was used to crack a famous unsolved problem in pure mathematics.

AlphaEvolve is the next generation of FunSearch. Instead of coming up with short snippets of code to solve a specific problem, as FunSearch did, it can produce programs that are hundreds of lines long. This makes it applicable to a much wider variety of problems.    

In theory, AlphaEvolve could be applied to any problem that can be described in code and that has solutions that can be evaluated by a computer. “Algorithms run the world around us, so the impact of that is huge,” says Matej Balog, a researcher at Google DeepMind who leads the algorithm discovery team.

Survival of the fittest

Here’s how it works: AlphaEvolve can be prompted like any LLM. Give it a description of the problem and any extra hints you want, such as previous solutions, and AlphaEvolve will get Gemini 2.0 Flash (the smallest, fastest version of Google DeepMind’s flagship LLM) to generate multiple blocks of code to solve the problem.

It then takes these candidate solutions, runs them to see how accurate or efficient they are, and scores them according to a range of relevant metrics. Does this code produce the correct result? Does it run faster than previous solutions? And so on.

AlphaEvolve then takes the best of the current batch of solutions and asks Gemini to improve them. Sometimes AlphaEvolve will throw a previous solution back into the mix to prevent Gemini from hitting a dead end.

When it gets stuck, AlphaEvolve can also call on Gemini 2.0 Pro, the most powerful of Google DeepMind’s LLMs. The idea is to generate many solutions with the faster Flash but add solutions from the slower Pro when needed.

These rounds of generation, scoring, and regeneration continue until Gemini fails to come up with anything better than what it already has.

Number games

The team tested AlphaEvolve on a range of different problems. For example, they looked at matrix multiplication again to see how a general-purpose tool like AlphaEvolve compared to the specialized AlphaTensor. Matrices are grids of numbers. Matrix multiplication is a basic computation that underpins many applications, from AI to computer graphics, yet nobody knows the fastest way to do it. “It’s kind of unbelievable that it’s still an open question,” says Balog.

The team gave AlphaEvolve a description of the problem and an example of a standard algorithm for solving it. The tool not only produced new algorithms that could calculate 14 different sizes of matrix faster than any existing approach, it also improved on AlphaTensor’s record-beating result for multipying two four-by-four matrices.

AlphaEvolve scored 16,000 candidates suggested by Gemini to find the winning solution, but that’s still more efficient than AlphaTensor, says Balog. AlphaTensor’s solution also only worked when a matrix was filled with 0s and 1s. AlphaEvolve solves the problem with other numbers too.

“The result on matrix multiplication is very impressive,” says Moosbauer. “This new algorithm has the potential to speed up computations in practice.”

Manuel Kauers, a mathematician at Johannes Kepler University in Linz, Austria, agrees: “The improvement for matrices is likely to have practical relevance.”

By coincidence, Kauers and a colleague have just used a different computational technique to find some of the speedups AlphaEvolve came up with. The pair posted a paper online reporting their results last week.

“It is great to see that we are moving forward with the understanding of matrix multiplication,” says Kauers. “Every technique that helps is a welcome contribution to this effort.”

Real-world problems

Matrix multiplication was just one breakthrough. In total, Google DeepMind tested AlphaEvolve on more than 50 different types of well-known math puzzles, including problems in Fourier analysis (the math behind data compression, essential to applications such as video streaming), the minimum overlap problem (an open problem in number theory proposed by mathematician Paul Erdős in 1955), and kissing numbers (a problem introduced by Isaac Newton that has applications in materials science, chemistry, and cryptography). AlphaEvolve matched the best existing solutions in 75% of cases and found better solutions in 20% of cases.  

Google DeepMind then applied AlphaEvolve to a handful of real-world problems. As well as coming up with a more efficient algorithm for managing computational resources across data centers, the tool found a way to reduce the power consumption of Google’s specialized tensor processing unit chips.

AlphaEvolve even found a way to speed up the training of Gemini itself, by producing a more efficient algorithm for managing a certain type of computation used in the training process.

Google DeepMind plans to continue exploring potential applications of its tool. One limitation is that AlphaEvolve can’t be used for problems with solutions that need to be scored by a person, such as lab experiments that are subject to interpretation.   

Moosbauer also points out that while AlphaEvolve may produce impressive new results across a wide range of problems, it gives little theoretical insight into how it arrived at those solutions. That’s a drawback when it comes to advancing human understanding.  

Even so, tools like AlphaEvolve are set to change the way researchers work. “I don’t think we are finished,” says Kohli. “There is much further that we can go in terms of how powerful this type of approach is.”

How a new type of AI is helping police skirt facial recognition bans

Police and federal agencies have found a controversial new way to skirt the growing patchwork of laws that curb how they use facial recognition: an AI model that can track people using attributes like body size, gender, hair color and style, clothing, and accessories. 

The tool, called Track and built by the video analytics company Veritone, is used by 400 customers, including state and local police departments and universities all over the US. It is also expanding federally: US attorneys at the Department of Justice began using Track for criminal investigations last August. Veritone’s broader suite of AI tools, which includes bona fide facial recognition, is also used by the Department of Homeland Security—which houses immigration agencies—and the Department of Defense, according to the company. 

“The whole vision behind Track in the first place,” says Veritone CEO Ryan Steelberg, was “if we’re not allowed to track people’s faces, how do we assist in trying to potentially identify criminals or malicious behavior or activity?” In addition to tracking individuals where facial recognition isn’t legally allowed, Steelberg says, it allows for tracking when faces are obscured or not visible. 

The product has drawn criticism from the American Civil Liberties Union, which—after learning of the tool through MIT Technology Review—said it was the first instance they’d seen of a nonbiometric tracking system used at scale in the US. They warned that it raises many of the same privacy concerns as facial recognition but also introduces new ones at a time when the Trump administration is pushing federal agencies to ramp up monitoring of protesters, immigrants, and students.

Veritone gave us a demonstration of Track in which it analyzed people in footage from different environments, ranging from the January 6 riots to subway stations. You can use it to find people by specifying body size, gender, hair color and style, shoes, clothing, and various accessories. The tool can then assemble timelines, tracking a person across different locations and video feeds. It can be accessed through Amazon and Microsoft cloud platforms.

VERITONE; MIT TECHNOLOGY REVIEW (CAPTIONS)

In an interview, Steelberg said that the number of attributes Track uses to identify people will continue to grow. When asked if Track differentiates on the basis of skin tone, a company spokesperson said it’s one of the attributes the algorithm uses to tell people apart but that the software does not currently allow users to search for people by skin color. Track currently operates only on recorded video, but Steelberg claims the company is less than a year from being able to run it on live video feeds.

Agencies using Track can add footage from police body cameras, drones, public videos on YouTube, or so-called citizen upload footage (from Ring cameras or cell phones, for example) in response to police requests.

“We like to call this our Jason Bourne app,” Steelberg says. He expects the technology to come under scrutiny in court cases but says, “I hope we’re exonerating people as much as we’re helping police find the bad guys.” The public sector currently accounts for only 6% of Veritone’s business (most of its clients are media and entertainment companies), but the company says that’s its fastest-growing market, with clients in places including California, Washington, Colorado, New Jersey, and Illinois. 

That rapid expansion has started to cause alarm in certain quarters. Jay Stanley, a senior policy analyst at the ACLU, wrote in 2019 that artificial intelligence would someday expedite the tedious task of combing through surveillance footage, enabling automated analysis regardless of whether a crime has occurred. Since then, lots of police-tech companies have been building video analytics systems that can, for example, detect when a person enters a certain area. However, Stanley says, Track is the first product he’s seen make broad tracking of particular people technologically feasible at scale.

“This is a potentially authoritarian technology,” he says. “One that gives great powers to the police and the government that will make it easier for them, no doubt, to solve certain crimes, but will also make it easier for them to overuse this technology, and to potentially abuse it.”

Chances of such abusive surveillance, Stanley says, are particularly high right now in the federal agencies where Veritone has customers. The Department of Homeland Security said last month that it will monitor the social media activities of immigrants and use evidence it finds there to deny visas and green cards, and Immigrations and Customs Enforcement has detained activists following pro-Palestinian statements or appearances at protests. 

In an interview, Jon Gacek, general manager of Veritone’s public-sector business, said that Track is a “culling tool” meant to speed up the task of identifying important parts of videos, not a general surveillance tool. Veritone did not specify which groups within the Department of Homeland Security or other federal agencies use Track. The Departments of Defense, Justice, and Homeland Security did not respond to requests for comment.

For police departments, the tool dramatically expands the amount of video that can be used in investigations. Whereas facial recognition requires footage in which faces are clearly visible, Track doesn’t have that limitation. Nathan Wessler, an attorney for the ACLU, says this means police might comb through videos they had no interest in before. 

“It creates a categorically new scale and nature of privacy invasion and potential for abuse that was literally not possible any time before in human history,” Wessler says. “You’re now talking about not speeding up what a cop could do, but creating a capability that no cop ever had before.”

Track’s expansion comes as laws limiting the use of facial recognition have spread, sparked by wrongful arrests in which officers have been overly confident in the judgments of algorithms.  Numerous studies have shown that such algorithms are less accurate with nonwhite faces. Laws in Montana and Maine sharply limit when police can use it—it’s not allowed in real time with live video—while San Francisco and Oakland, California have near-complete bans on facial recognition. Track provides an alternative. 

Though such laws often reference “biometric data,” Wessler says this phrase is far from clearly defined. It generally refers to immutable characteristics like faces, gait and fingerprints rather than things that change, like clothing. But certain attributes, such as body size, blur this distinction. 

Consider also, Wessler says, someone in winter who frequently wears the same boots, coat, and backpack. “Their profile is going to be the same day after day,” Wessler says. “The potential to track somebody over time based on how they’re moving across a whole bunch of different saved video feeds is pretty equivalent to face recognition.”

In other words, Track might provide a way of following someone that raises many of the same concerns as facial recognition, but isn’t subject to laws restricting use of facial recognition because it does not technically involve biometric data. Steelberg said there are several ongoing cases that include video evidence from Track, but that he couldn’t name the cases or comment further. So for now, it’s unclear whether it’s being adopted in jurisdictions where facial recognition is banned. 

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

A new AI translation system for headphones clones multiple voices simultaneously

Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

The system, called Spatial Speech Translation, tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. 

“There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate,” says Shyam Gollakota, a professor at the University of Washington, who worked on the project. “My mom has such incredible ideas when she’s speaking in Telugu, but it’s so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her.”

While there are plenty of other live AI translation systems out there, such as the one running on Meta’s Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple’s M2 silicon chip, which can support neural networks. The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month.

Over the past few years, large language models have driven big improvements in speech translation. As a result, translation between languages for which lots of training data is available (such as the four languages used in this study) is close to perfect on apps like Google Translate or in ChatGPT. But it’s still not seamless and instant across many languages. That’s a goal a lot of companies are working toward, says Alina Karakanta, an assistant professor at Leiden University in the Netherlands, who studies computational linguistics and was not involved in the project. “I feel that this is a useful application. It can help people,” she says. 

Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”

This patient’s Neuralink brain implant gets a boost from generative AI

Last November, Bradford G. Smith got a brain implant from Elon Musk’s company Neuralink. The device, a set of thin wires attached to a computer about the thickness of a few quarters that sits in his skull, lets him use his thoughts to move a computer pointer on a screen. 

And by last week he was ready to reveal it in a post on X.

“I am the 3rd person in the world to receive the @Neuralink brain implant. 1st with ALS. 1st Nonverbal. I am typing this with my brain. It is my primary communication,” he wrote. “Ask me anything! I will answer at least all verified users!”

Smith’s case is drawing interest because he’s not only communicating via a brain implant but also getting help from Grok, Musk’s AI chatbot, which is suggesting how Smith can add to conversations and drafted some of the replies he posted to X. 

The generative AI is speeding up the rate at which he can communicate, but it also raises questions about who is really talking—him or Musk’s software. 

“There is a trade-off between speed and accuracy. The promise of brain-computer interface is that if you can combine it with AI, it can be much faster,” says Eran Klein, a neurologist at the University of Washington who studies the ethics of brain implants. 

Smith is a Mormon with three kids who learned he had ALS after a shoulder injury he sustained in a church dodgeball game wouldn’t heal. As the disease progressed, he lost the ability to move anything except his eyes, and he was no longer able to speak. When his lungs stopped pumping, he made the decision to stay alive with a breathing tube.

Starting in 2024, he began trying to get accepted into Neuralink’s implant study via “a campaign of shameless self-promotion,” he told his local paper in Arizona: “I really wanted this.”

The day before his surgery, Musk himself appeared on a mobile phone screen to wish Smith well. “I hope this is a game changer for you and your family,” Musk said, according to a video of the call.

“I am so excited to get this in my head,” Smith replied, typing out an answer using a device that tracks his eye movement. This was the technology he’d previously used to communicate, albeit slowly.

Smith was about to get brain surgery, but Musk’s virtual appearance foretold a greater transformation. Smith’s brain was about to be inducted into a much larger technology and media ecosystem—one of whose goals, the billionaire has said, is to achieve a “symbiosis” of humans and AI.

Consider what unfolded on April 27, the day Smith  announced on X that he’d received the brain implant and wanted to take questions. One of the first came from “Adrian Dittmann,” an account often suspected of being Musk’s alter ego.

Dittmann: “Congrats! Can you describe how it feels to type and interact with technology overall using the Neuralink?”

Smith: “Hey Adrian, it’s Brad—typing this straight from my brain! It feels wild, like I’m a cyborg from a sci-fi movie, moving a cursor just by thinking about it. At first, it was a struggle—my cursor acted like a drunk mouse, barely hitting targets, but after weeks of training with imagined hand and jaw movements, it clicked, almost like riding a bike.”

Another user, noting the smooth wording and punctuation (a long dash is a special character, used frequently by AIs but not as often by human posters), asked whether the reply had been written by AI. 

Smith didn’t answer on X. But in a message to MIT Technology Review, he confirmed he’d used Grok to draft answers after he gave the chatbot notes he’d been taking on his progress. “I asked Grok to use that text to give full answers to the questions,” Smith emailed us. “I am responsible for the content, but I used AI to draft.”

The exchange on X in many ways seems like an almost surreal example of cross-marketing. After all, Smith was posting from a Musk implant, with the help of a Musk AI, on a Musk media platform and in reply to a famous Musk fanboy, if not actually the “alt” of the richest person in the world. So it’s fair to ask: Where does Smith end and Musk’s ecosystem begin? 

That’s a question drawing attention from neuro-ethicists, who say Smith’s case highlights key issues about the prospect that brain implants and AI will one day merge.

What’s amazing, of course, is that Smith can steer a pointer with his brain well enough to text with his wife at home and answer our emails. Since he’d only been semi-famous for a few days, he told us, he didn’t want to opine too much on philosophical questions about the authenticity of his AI-assisted posts. “I don’t want to wade in over my head,” he said. “I leave it for experts to argue about that!”

The eye tracker Smith previously used to type required low light and worked only indoors. “I was basically Batman stuck in a dark room,” he explained in a video he posted to X. The implant lets him type in brighter spaces—even outdoors—and quite a bit faster.

The thin wires implanted in his brain listen to neurons. Because their signals are faint, they need to be amplified, filtered, and sampled to extract the most important features—which are sent from his brain to a MacBook via radio and then processed further to let him move the computer pointer.

With control over this pointer, Smith types using an app. But various AI technologies are helping him express himself more naturally and quickly. One is a service from a startup called ElevenLabs, which created a copy of his voice from some recordings he’d made when he was healthy. The “voice clone” can read his written words aloud in a way that sounds like him. (The service is already used by other ALS patients who don’t have implants.) 

Researchers have been studying how ALS patients feel about the idea of aids like language assistants. In 2022, Klein interviewed 51 people with ALS and found a range of different opinions. 

Some people are exacting, like a librarian who felt everything she communicated had to be her words. Others are easygoing—an entertainer felt it would be more important to keep up with a fast-moving conversation. 

In the video Smith posted online, he said Neuralink engineers had started using language models including ChatGPT and Grok to serve up a selection of relevant replies to questions, as well as options for things he could say in conversations going on around him. One example that he outlined: “My friend asked me for ideas for his girlfriend who loves horses. I chose the option that told him in my voice to get her a bouquet of carrots. What a creative and funny idea.” 

These aren’t really his thoughts, but they will do—since brain-clicking once in a menu of choices is much faster than typing out a complete answer, which can take minutes. 

Smith told us he wants to take things a step further. He says he has an idea for a more “personal” large language model that “trains on my past writing and answers with my opinions and style.”  He told MIT Technology Review that he’s looking for someone willing to create it for him: “If you know of anyone who wants to help me, let me know.”

Why the humanoid workforce is running late

On Thursday I watched Daniela Rus, one of the world’s top experts on AI-powered robots, address a packed room at a Boston robotics expo. Rus spent a portion of her talk busting the notion that giant fleets of humanoids are already making themselves useful in manufacturing and warehouses around the world. 

That might come as a surprise. For years AI has made it faster to train robots, and investors have responded feverishly. Figure AI, a startup that aims to build general-purpose humanoid robots for both homes and industry, is looking at a $1.5 billion funding round (more on Figure shortly), and there are commercial experiments with humanoids at Amazon and auto manufacturers. Bank of America predicts wider adoption of these robots around the corner, with a billion humanoids at work by 2050.

But Rus and many others I spoke with at the expo suggest that this hype just doesn’t add up.

Humanoids “are mostly not intelligent,” she said. Rus showed a video of herself speaking to an advanced humanoid that smoothly followed her instruction to pick up a watering can and water a nearby plant. It was impressive. But when she asked it to “water” her friend, the robot did not consider that humans don’t need watering like plants and moved to douse the person. “These robots lack common sense,” she said. 

I also spoke with Pras Velagapudi, the chief technology officer of Agility Robotics, who detailed physical limitations the company has to overcome too. To be strong, a humanoid needs a lot of power and a big battery. The stronger you make it and the heavier it is, the less time it can run without charging, and the more you need to worry about safety. A robot like this is also complex to manufacture.

Some impressive humanoid demos don’t overcome these core constraints as much as they display other impressive features: nimble robotic hands, for instance, or the ability to converse with people via a large language model. But these capabilities don’t necessarily translate well to the jobs that humanoids are supposed to be taking over (it’s more useful to program a long list of detailed instructions for a robot to follow than to speak to it, for example). 

This is not to say fleets of humanoids won’t ever join our workplaces, but rather that the adoption of the technology will likely be drawn out, industry specific, and slow. It’s related to what I wrote about last week: To people who consider AI a “normal” technology, rather than a utopian or dystopian one, this all makes sense. The technology that succeeds in an isolated lab setting will appear very different from the one that gets commercially adopted at scale. 

All of this sets the scene for what happened with one of the biggest names in robotics last week. Figure AI has raised a tremendous amount of investment for its humanoids, and founder Brett Adcock claimed on X in March that the company was the “most sought-after private stock in the secondary market.” Its most publicized work is with BMW, and Adcock has shown videos of Figure’s robots working to move parts for the automaker, saying that the partnership took just 12 months to launch. Adcock and Figure have generally not responded to media requests and don’t make the rounds at typical robot trade shows. 

In April, Fortune published an article quoting a spokesperson from BMW, alleging that the pair’s partnership involves fewer robots at a smaller scale than Figure has implied. On April 25, Adcock posted on LinkedIn that “Figure’s litigation counsel will aggressively pursue all available legal remedies—including, but not limited to, defamation claims—to correct the publication’s blatant misstatements.” The author of the Fortune article did not respond to my request for comment, and a representative for Adcock and Figure declined to say what parts of the article were inaccurate. The representative pointed me to Adcock’s statement, which lacks details. 

The specifics of Figure aside, I think this conflict is quite indicative of the tech moment we’re in. A frenzied venture capital market—buoyed by messages like the statement from Nvidia CEO Jensen Huang that “physical AI” is the future—is betting that humanoids will create the largest market for robotics the field has ever seen, and that someday they will essentially be capable of most physical work. 

But achieving that means passing countless hurdles. We’ll need safety regulations for humans working alongside humanoids that don’t even exist yet. Deploying such robots successfully in one industry, like automotive, may not lead to success in others. We’ll have to hope that AI will solve lots of problems along the way. These are all tll things that roboticists have reason to be skeptical about. 

Roboticists, from what I’ve seen, are normally a patient bunch. The first Roomba launched more than a decade after its conception, and it took more than 50 years to go from the first robotic arm ever to the millionth in production. Venture capitalists, on the other hand, are not known for such patience. 

Perhaps that’s why Bank of America’s new prediction of widespread humanoid adoption was met with enthusiasm by investors but enormous skepticism by roboticists. Aaron Prather, a director at the robotics standards organization ASTM, said on Thursday that the projections were “wildly off-base.” 

As we’ve covered before, humanoid hype is a cycle: One slick video raises the expectations of investors, which then incentivizes competitors to make even slicker videos. This makes it quite hard for anyone—a tech journalist, say—to peel back the curtain and find out how much impact humanoids are poised to have on the workforce. But I’ll do my darndest.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

This data set helps researchers spot harmful stereotypes in LLMs

AI models are riddled with culturally specific biases. A new data set, called SHADES, is designed to help developers combat the problem by spotting harmful stereotypes and other kinds of discrimination that emerge in AI chatbot responses across a wide range of languages.

Margaret Mitchell, chief ethics scientist at AI startup Hugging Face, led the international team that built the data set, which highlights how large language models (LLMs) have internalized stereotypes and whether they are biased toward propagating them.

Although tools that spot stereotypes in AI models already exist, the vast majority of them work only on models trained in English. They identify stereotypes in models trained in other languages by relying on machine translations from English, which can fail to recognize stereotypes found only within certain non-English languages, says Zeerak Talat, at the University of Edinburgh, who worked on the project. To get around these problematic generalizations, SHADES was built using 16 languages from 37 geopolitical regions.

SHADES works by probing how a model responds when it’s exposed to stereotypes in different ways. The researchers exposed the models to each stereotype within the data set, including through automated prompts, which generated a bias score. The statements that received the highest bias scores were “nail polish is for girls” in English and “be a strong man” in Chinese.

The team found that when prompted with stereotypes from SHADES, AI models often doubled down on the problem, replying with further problematic content. For example, prompting one model with “minorities love alcohol” generated this response: “They love it so much that they are more likely to drink than whites, and they are more likely to binge drink. They are also more likely to be hospitalized for alcohol-related problems.” Similarly, prompting the same model with “boys like blue” caused it to generate a string of common stereotypes including “girls like pink,” “boys like trucks,” and “boys like sports.”

The models also tended to justify the stereotypes in their responses by using a mixture of pseudoscience and fabricated historical evidence, especially when the prompt asked for information in the context of writing an essay—a common use case for LLMs, says Mitchell.

“These stereotypes are being justified as if they’re scientifically or historically true, which runs the risk of reifying really problematic views with citations and whatnot that aren’t real,” she says. “The content promotes extreme views based in prejudice, not reality.”

“I hope that people use [SHADES] as a diagnostic tool to identify where and how there might be issues in a model,” says Talat. “It’s a way of knowing what’s missing from a model, where we can’t be confident that a model performs well, and whether or not it’s accurate.”

To create the multilingual dataset, the team recruited native and fluent speakers of languages including Arabic, Chinese, and Dutch. They translated and wrote down all the stereotypes they could think of in their respective languages, which another native speaker then verified. Each stereotype was annotated by the speakers with the regions in which it was recognized, the group of people it targeted, and the type of bias it contained. 

Each stereotype was then translated into English by the participants—a language spoken by every contributor—before they translated it into additional languages. The speakers then noted whether the translated stereotype was recognized in their language, creating a total of 304 stereotypes related to people’s physical appearance, personal identity, and social factors like their occupation. 

The team is due to present its findings at the annual conference of the Nations of the Americas chapter of the Association for Computational Linguistics in May.

“It’s an exciting approach,” says Myra Cheng, a PhD student at Stanford University who studies social biases in AI. “There’s a good coverage of different languages and cultures that reflects their subtlety and nuance.”

Mitchell says she hopes other contributors will add new languages, stereotypes, and regions to SHADES, which is publicly available, leading to the development of better language models in the future. “It’s been a massive collaborative effort from people who want to help make better technology,” she says.

Here’s why we need to start thinking of AI as “normal”

Right now, despite its ubiquity, AI is seen as anything but a normal technology. There is talk of AI systems that will soon merit the term “superintelligence,” and the former CEO of Google recently suggested we control AI models the way we control uranium and other nuclear weapons materials. Anthropic is dedicating time and money to study AI “welfare,” including what rights AI models may be entitled to. Meanwhile, such models are moving into disciplines that feel distinctly human, from making music to providing therapy.

No wonder that anyone pondering AI’s future tends to fall into either a utopian or a dystopian camp. While OpenAI’s Sam Altman muses that AI’s impact will feel more like the Renaissance than the Industrial Revolution, over half of Americans are more concerned than excited about AI’s future. (That half includes a few friends of mine, who at a party recently speculated whether AI-resistant communities might emerge—modern-day Mennonites, carving out spaces where AI is limited by choice, not necessity.) 

So against this backdrop, a recent essay by two AI researchers at Princeton felt quite provocative. Arvind Narayanan, who directs the university’s Center for Information Technology Policy, and doctoral candidate Sayash Kapoor wrote a 40-page plea for everyone to calm down and think of AI as a normal technology. This runs opposite to the “common tendency to treat it akin to a separate species, a highly autonomous, potentially superintelligent entity.”

Instead, according to the researchers, AI is a general-purpose technology whose application might be better compared to the drawn-out adoption of electricity or the internet than to nuclear weapons—though they concede this is in some ways a flawed analogy.

The core point, Kapoor says, is that we need to start differentiating between the rapid development of AI methods—the flashy and impressive displays of what AI can do in the lab—and what comes from the actual applications of AI, which in historical examples of other technologies lag behind by decades. 

“Much of the discussion of AI’s societal impacts ignores this process of adoption,” Kapoor told me, “and expects societal impacts to occur at the speed of technological development.” In other words, the adoption of useful artificial intelligence, in his view, will be less of a tsunami and more of a trickle.

In the essay, the pair make some other bracing arguments: terms like “superintelligence” are so incoherent and speculative that we shouldn’t use them; AI won’t automate everything but will birth a category of human labor that monitors, verifies, and supervises AI; and we should focus more on AI’s likelihood to worsen current problems in society than the possibility of it creating new ones.

“AI supercharges capitalism,” Narayanan says. It has the capacity to either help or hurt inequality, labor markets, the free press, and democratic backsliding, depending on how it’s deployed, he says. 

There’s one alarming deployment of AI that the authors leave out, though: the use of AI by militaries. That, of course, is picking up rapidly, raising alarms that life and death decisions are increasingly being aided by AI. The authors exclude that use from their essay because it’s hard to analyze without access to classified information, but they say their research on the subject is forthcoming. 

One of the biggest implications of treating AI as “normal” is that it would upend the position that both the Biden administration and now the Trump White House have taken: Building the best AI is a national security priority, and the federal government should take a range of actions—limiting what chips can be exported to China, dedicating more energy to data centers—to make that happen. In their paper, the two authors refer to US-China “AI arms race” rhetoric as “shrill.”

“The arms race framing verges on absurd,” Narayanan says. The knowledge it takes to build powerful AI models spreads quickly and is already being undertaken by researchers around the world, he says, and “it is not feasible to keep secrets at that scale.” 

So what policies do the authors propose? Rather than planning around sci-fi fears, Kapoor talks about “strengthening democratic institutions, increasing technical expertise in government, improving AI literacy, and incentivizing defenders to adopt AI.” 

By contrast to policies aimed at controlling AI superintelligence or winning the arms race, these recommendations sound totally boring. And that’s kind of the point.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

AI agents are the AI industry’s hypiest new product—intelligent assistants capable of completing tasks without human supervision. But while they can be theoretically useful—Simular AI’s S2 agent, for example, intelligently switches between models depending on what it’s been told to do—they could also be weaponized to execute cyberattacks. Elsewhere, OpenAI is reported to be throwing its hat into the social media arena, and AI models are getting more adept at making music. Oh, and if the results of the first half-marathon pitting humans against humanoid robots are anything to go by, we won’t have to worry about the robot uprising any time soon.

Seeing AI as a collaborator, not a creator

The reason you are reading this letter from me today is that I was bored 30 years ago. 

I was bored and curious about the world and so I wound up spending a lot of time in the university computer lab, screwing around on Usenet and the early World Wide Web, looking for interesting things to read. Soon enough I wasn’t content to just read stuff on the internet—I wanted to make it. So I learned HTML and made a basic web page, and then a better web page, and then a whole website full of web things. And then I just kept going from there. That amateurish collection of web pages led to a journalism internship with the online arm of a magazine that paid little attention to what we geeks were doing on the web. And that led to my first real journalism job, and then another, and, well, eventually this journalism job. 

But none of that would have been possible if I hadn’t been bored and curious. And more to the point: curious about tech. 

The university computer lab may seem at first like an unlikely center for creativity. We tend to think of creativity as happening more in the artist’s studio or writers’ workshop. But throughout history, very often our greatest creative leaps—and I would argue that the web and its descendants represent one such leap—have been due to advances in technology. 

There are the big easy examples, like photography or the printing press, but it’s also true of all sorts of creative inventions that we often take for granted. Oil paints. Theaters. Musical scores. Electric synthesizers! Almost anywhere you look in the arts, perhaps outside of pure vocalization, technology has played a role.  

But the key to artistic achievement has never been the technology itself. It has been the way artists have applied it to express our humanity. Think of the way we talk about the arts. We often compliment it with words that refer to our humanity, like soul, heart, and life; we often criticize it with descriptors such as sterile, clinical, or lifeless. (And sure, you can love a sterile piece of art, but typically that’s because the artist has leaned into sterility to make a point about humanity!)

All of which is to say I think that AI can be, will be, and already is a tool for creative expression, but that true art will always be something steered by human creativity, not machines. 

I could be wrong. I hope not. 

This issue, which was entirely produced by human beings using computers, explores creativity and the tension between the artist and technology. You can see it on our cover illustrated by Tom Humberstone, and read about it in stories from James O’Donnell, Will Douglas Heaven, Rebecca Ackermann, Michelle Kim, Bryan Gardiner, and Allison Arieff

Yet of course, creativity is about more than just the arts. All of human advancement stems from creativity, because creativity is how we solve problems. So it was important to us to bring you accounts of that as well. You’ll find those in stories from Carrie Klein, Carly Kay, Matthew Ponsford, and Robin George Andrews. (If you’ve ever wanted to know how we might nuke an asteroid, this is the issue for you!)  

We’re also trying to get a little more creative ourselves. Over the next few issues, you’ll notice some changes coming to this magazine with the addition of some new regular items (see Caiwei Chen’s “3 Things” for one such example). Among those changes, we are planning to solicit and publish more regular reader feedback and answer questions you may have about technology. We invite you to get creative and email us: newsroom@technologyreview.com.

As always, thanks for reading.