Manus has kick-started an AI agent boom in China

Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them. 

There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March

These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions. 

China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life. 

For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country. 

As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom.

Set the standard

It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised $75 million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees. 

Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project. 

Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks.

“Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.”

In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s. 

Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over $36 million in yearly revenue.

Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management software (think Notion) than a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”.

What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market.

A global address

Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products. 

Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.”

But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away. 

Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant.

But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model. 

An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch.

Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.”

For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month.

A super‑app approach

Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges. 

ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers.

Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis.

Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat. 

Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy.

Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience.

But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups.

Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date. 

ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments.

“Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.”

Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”

What’s next for AI and math

MIT Technology Review’s What’s Next series looks across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.

The way DARPA tells it, math is stuck in the past. In April, the US Defense Advanced Research Projects Agency kicked off a new initiative called expMath—short for Exponentiating Mathematics—that it hopes will speed up the rate of progress in a field of research that underpins a wide range of crucial real-world applications, from computer science to medicine to national security.

“Math is the source of huge impact, but it’s done more or less as it’s been done for centuries—by people standing at chalkboards,” DARPA program manager Patrick Shafto said in a video introducing the initiative

The modern world is built on mathematics. Math lets us model complex systems such as the way air flows around an aircraft, the way financial markets fluctuate, and the way blood flows through the heart. And breakthroughs in advanced mathematics can unlock new technologies such as cryptography, which is essential for private messaging and online banking, and data compression, which lets us shoot images and video across the internet.

But advances in math can be years in the making. DARPA wants to speed things up. The goal for expMath is to encourage mathematicians and artificial-intelligence researchers to develop what DARPA calls an AI coauthor, a tool that might break large, complex math problems into smaller, simpler ones that are easier to grasp and—so the thinking goes—quicker to solve.

Mathematicians have used computers for decades, to speed up calculations or check whether certain mathematical statements are true. The new vision is that AI might help them crack problems that were previously uncrackable.  

But there’s a huge difference between AI that can solve the kinds of problems set in high school—math that the latest generation of models has already mastered—and AI that could (in theory) solve the kinds of problems that professional mathematicians spend careers chipping away at.

On one side are tools that might be able to automate certain tasks that math grads are employed to do; on the other are tools that might be able to push human knowledge beyond its existing limits.

Here are three ways to think about that gulf.

1/ AI needs more than just clever tricks

Large language models are not known to be good at math. They make things up and can be persuaded that 2 + 2 = 5. But newer versions of this tech, especially so-called large reasoning models (LRMs) like OpenAI’s o3 and Anthropic’s Claude 4 Thinking, are far more capable—and that’s got mathematicians excited.

This year, a number of LRMs, which try to solve a problem step by step rather than spit out the first result that comes to them, have achieved high scores on the American Invitational Mathematics Examination (AIME), a test given to the top 5% of US high school math students.

At the same time, a handful of new hybrid models that combine LLMs with some kind of fact-checking system have also made breakthroughs. Emily de Oliveira Santos, a mathematician at the University of São Paulo, Brazil, points to Google DeepMind’s AlphaProof, a system that combines an LLM with DeepMind’s game-playing model AlphaZero, as one key milestone. Last year AlphaProof became the first computer program to match the performance of a silver medallist at the International Math Olympiad, one of the most prestigious mathematics competitions in the world.

And in May, a Google DeepMind model called AlphaEvolve discovered better results than anything humans had yet come up with for more than 50 unsolved mathematics puzzles and several real-world computer science problems.

The uptick in progress is clear. “GPT-4 couldn’t do math much beyond undergraduate level,” says de Oliveira Santos. “I remember testing it at the time of its release with a problem in topology, and it just couldn’t write more than a few lines without getting completely lost.” But when she gave the same problem to OpenAI’s o1, an LRM released in January, it nailed it.

Does this mean such models are all set to become the kind of coauthor DARPA hopes for? Not necessarily, she says: “Math Olympiad problems often involve being able to carry out clever tricks, whereas research problems are much more explorative and often have many, many more moving pieces.” Success at one type of problem-solving may not carry over to another.

Others agree. Martin Bridson, a mathematician at the University of Oxford, thinks the Math Olympiad result is a great achievement. “On the other hand, I don’t find it mind-blowing,” he says. “It’s not a change of paradigm in the sense that ‘Wow, I thought machines would never be able to do that.’ I expected machines to be able to do that.”

That’s because even though the problems in the Math Olympiad—and similar high school or undergraduate tests like AIME—are hard, there’s a pattern to a lot of them. “We have training camps to train high school kids to do them,” says Bridson. “And if you can train a large number of people to do those problems, why shouldn’t you be able to train a machine to do them?”

Sergei Gukov, a mathematician at the California Institute of Technology who coaches Math Olympiad teams, points out that the style of question does not change too much between competitions. New problems are set each year, but they can be solved with the same old tricks.

“Sure, the specific problems didn’t appear before,” says Gukov. “But they’re very close—just a step away from zillions of things you have already seen. You immediately realize, ‘Oh my gosh, there are so many similarities—I’m going to apply the same tactic.’” As hard as competition-level math is, kids and machines alike can be taught how to beat it.

That’s not true for most unsolved math problems. Bridson is president of the Clay Mathematics Institute, a nonprofit US-based research organization best known for setting up the Millenium Prize Problems in 2000—seven of the most important unsolved problems in mathematics, with a $1 million prize to be awarded to the first person to solve each of them. (One problem, the Poincaré conjecture, was solved in 2010; the others, which include P versus NP and the Riemann hypothesis, remain open). “We’re very far away from AI being able to say anything serious about any of those problems,” says Bridson.

And yet it’s hard to know exactly how far away, because many of the existing benchmarks used to evaluate progress are maxed out. The best new models already outperform most humans on tests like AIME.

To get a better idea of what existing systems can and cannot do, a startup called Epoch AI has created a new test called FrontierMath, released in December. Instead of co-opting math tests developed for humans, Epoch AI worked with more than 60 mathematicians around the world to come up with a set of math problems from scratch.

FrontierMath is designed to probe the limits of what today’s AI can do. None of the problems have been seen before and the majority are being kept secret to avoid contaminating training data. Each problem demands hours of work from expert mathematicians to solve—if they can solve it at all: some of the problems require specialist knowledge to tackle.

FrontierMath is set to become an industry standard. It’s not yet as popular as AIME, says de Oliveira Santos, who helped develop some of the problems: “But I expect this to not hold for much longer, since existing benchmarks are very close to being saturated.”

On AIME, the best large language models (Anthropic’s Claude 4, OpenAI’s o3 and o4-mini, Google DeepMind’s Gemini 2.5 Pro, X-AI’s Grok 3) now score around 90%. On FrontierMath, 04-mini scores 19% and Gemini 2.5 Pro scores 13%. That’s still remarkable, but there’s clear room for improvement.     

FrontierMath should give the best sense yet just how fast AI is progressing at math. But there are some problems that are still too hard for computers to take on.

2/ AI needs to manage really vast sequences of steps

Squint hard enough and in some ways math problems start to look the same: to solve them you need to take a sequence of steps from start to finish. The problem is finding those steps. 

“Pretty much every math problem can be formulated as path-finding,” says Gukov. What makes some problems far harder than others is the number of steps on that path. “The difference between the Riemann hypothesis and high school math is that with high school math the paths that we’re looking for are short—10 steps, 20 steps, maybe 40 in the longest case.” The steps are also repeated between problems.

“But to solve the Riemann hypothesis, we don’t have the steps, and what we’re looking for is a path that is extremely long”—maybe a million lines of computer proof, says Gukov.

Finding very long sequences of steps can be thought of as a kind of complex game. It’s what DeepMind’s AlphaZero learned to do when it mastered Go and chess. A game of Go might only involve a few hundred moves. But to win, an AI must find a winning sequence of moves among a vast number of possible sequences. Imagine a number with 100 zeros at the end, says Gukov.

But that’s still tiny compared with the number of possible sequences that could be involved in proving or disproving a very hard math problem: “A proof path with a thousand or a million moves involves a number with a thousand or a million zeros,” says Gukov. 

No AI system can sift through that many possibilities. To address this, Gukov and his colleagues developed a system that shortens the length of a path by combining multiple moves into single supermoves. It’s like having boots that let you take giant strides: instead of taking 2,000 steps to walk a mile, you can now walk it in 20.

The challenge was figuring out which moves to replace with supermoves. In a series of experiments, the researchers came up with a system in which one reinforcement-learning model suggests new moves and a second model checks to see if those moves help.

They used this approach to make a breakthrough in a math problem called the Andrews-Curtis conjecture, a puzzle that has been unsolved for 60 years. It’s a problem that every professional mathematician will know, says Gukov.

(An aside for math stans only: The AC conjecture states that a particular way of describing a type of set called a trivial group can be translated into a different but equivalent description with a certain sequence of steps. Most mathematicians think the AC conjecture is false, but nobody knows how to prove that. Gukov admits himself that it is an intellectual curiosity rather than a practical problem, but an important problem for mathematicians nonetheless.)

Gukov and his colleagues didn’t solve the AC conjecture, but they found that a counterexample (suggesting that the conjecture is false) proposed 40 years ago was itself false. “It’s been a major direction of attack for 40 years,” says Gukov. With the help of AI, they showed that this direction was in fact a dead end.   

“Ruling out possible counterexamples is a worthwhile thing,” says Bridson. “It can close off blind alleys, something you might spend a year of your life exploring.” 

True, Gukov checked off just one piece of one esoteric puzzle. But he thinks the approach will work in any scenario where you need to find a long sequence of unknown moves, and he now plans to try it out on other problems.

“Maybe it will lead to something that will help AI in general,” he says. “Because it’s teaching reinforcement learning models to go beyond their training. To me it’s basically about thinking outside of the box—miles away, megaparsecs away.”  

3/ Can AI ever provide real insight?

Thinking outside the box is exactly what mathematicians need to solve hard problems. Math is often thought to involve robotic, step-by-step procedures. But advanced math is an experimental pursuit, involving trial and error and flashes of insight.

That’s where tools like AlphaEvolve come in. Google DeepMind’s latest model asks an LLM to generate code to solve a particular math problem. A second model then evaluates the proposed solutions, picks the best, and sends them back to the LLM to be improved. After hundreds of rounds of trial and error, AlphaEvolve was able to come up with solutions to a wide range of math problems that were better than anything people had yet come up with. But it can also work as a collaborative tool: at any step, humans can share their own insight with the LLM, prompting it with specific instructions.

This kind of exploration is key to advanced mathematics. “I’m often looking for interesting phenomena and pushing myself in a certain direction,” says Geordie Williamson, a mathematician at the University of Sydney in Australia. “Like: ‘Let me look down this little alley. Oh, I found something!’”

Williamson worked with Meta on an AI tool called PatternBoost, designed to support this kind of exploration. PatternBoost can take a mathematical idea or statement and generate similar ones. “It’s like: ‘Here’s a bunch of interesting things. I don’t know what’s going on, but can you produce more interesting things like that?’” he says.

Such brainstorming is essential work in math. It’s how new ideas get conjured. Take the icosahedron, says Williamson: “It’s a beautiful example of this, which I kind of keep coming back to in my own work.” The icosahedron is a 20-sided 3D object where all the faces are triangles (think of a 20-sided die). The icosahedron is the largest of a family of exactly five such objects: there’s the tetrahedron (four sides), cube (six sides), octahedron (eight sides), and dodecahedron (12 sides).

Remarkably, the fact that there are exactly five of these objects was proved by mathematicians in ancient Greece. “At the time that this theorem was proved, the icosahedron didn’t exist,” says Williamson. “You can’t go to a quarry and find it—someone found it in their mind. And the icosahedron goes on to have a profound effect on mathematics. It’s still influencing us today in very, very profound ways.”

For Williamson, the exciting potential of tools like PatternBoost is that they might help people discover future mathematical objects like the icosahedron that go on to shape the way math is done. But we’re not there yet. “AI can contribute in a meaningful way to research-level problems,” he says. “But we’re certainly not getting inundated with new theorems at this stage.”

Ultimately, it comes down to the fact that machines still lack what you might call intuition or creative thinking. Williamson sums it up like this: We now have AI that can beat humans when it knows the rules of the game. “But it’s one thing for a computer to play Go at a superhuman level and another thing for the computer to invent the game of Go.”

“I think that applies to advanced mathematics,” he says. “Breakthroughs come from a new way of thinking about something, which is akin to finding completely new moves in a game. And I don’t really think we understand where those really brilliant moves in deep mathematics come from.”

Perhaps AI tools like AlphaEvolve and PatternBoost are best thought of as advance scouts for human intuition. They can discover new directions and point out dead ends, saving mathematicians months or years of work. But the true breakthroughs will still come from the minds of people, as has been the case for thousands of years.

For now, at least. “There’s plenty of tech companies that tell us that won’t last long,” says Williamson. “But you know—we’ll see.” 

Inside the effort to tally AI’s energy appetite

After working on it for months, my colleague Casey Crownhart and I finally saw our story on AI’s energy and emissions burden go live last week. 

The initial goal sounded simple: Calculate how much energy is used each time we interact with a chatbot, and then tally that up to understand why everyone from leaders of AI companies to officials at the White House wants to harness unprecedented levels of electricity to power AI and reshape our energy grids in the process. 

It was, of course, not so simple. After speaking with dozens of researchers, we realized that the common understanding of AI’s energy appetite is full of holes. I encourage you to read the full story, which has some incredible graphics to help you understand everything from the energy used in a single query right up to what AI will require just three years from now (enough electricity to power 22% of US households, it turns out). But here are three takeaways I have after the project. 

AI is in its infancy

We focused on measuring the energy requirements that go into using a chatbot, generating an image, and creating a video with AI. But these three uses are relatively small-scale compared with where AI is headed next. 

Lots of AI companies are building reasoning models, which “think” for longer and use more energy. They’re building hardware devices, perhaps like the one Jony Ive has been working on (which OpenAI just acquired for $6.5 billion), that have AI constantly humming along in the background of our conversations. They’re designing agents and digital clones of us to act on our behalf. All these trends point to a more energy-intensive future (which, again, helps explain why OpenAI and others are spending such inconceivable amounts of money on energy). 

But the fact that AI is in its infancy raises another point. The models, chips, and cooling methods behind this AI revolution could all grow more efficient over time, as my colleague Will Douglas Heaven explains. This future isn’t predetermined.

AI video is on another level

When we tested the energy demands of various models, we found the energy required to produce even a low-quality, five-second video to be pretty shocking: It was 42,000 times more than the amount needed for a chatbot answer a question about a recipe, and enough to power a microwave for over an hour. If there’s one type of AI whose energy appetite should worry you, it’s this one. 

Soon after we published, Google debuted the latest iteration of its Veo model. People quickly created compilations of the most impressive clips (this one being the most shocking to me). Something we point out in the story is that Google (as well as OpenAI, which has its own video generator, Sora) denied our request for specific numbers on the energy their AI models use. Nonetheless, our reporting suggests it’s very likely that high-definition video models like Veo and Sora are much larger, and much more energy-demanding, than the models we tested. 

I think the key to whether the use of AI video will produce indefensible clouds of emissions in the near future will be how it’s used, and how it’s priced. The example I linked shows a bunch of TikTok-style content, and I predict that if creating AI video is cheap enough, social video sites will be inundated with this type of content. 

There are more important questions than your own individual footprint

We expected that a lot of readers would understandably think about this story in terms of their own individual footprint, wondering whether their AI usage is contributing to the climate crisis. Don’t panic: It’s likely that asking a chatbot for help with a travel plan does not meaningfully increase your carbon footprint. Video generation might. But after reporting on this for months, I think there are more important questions.

Consider, for example, the water being drained from aquifers in Nevada, the country’s driest state, to power data centers that are drawn to the area by tax incentives and easy permitting processes, as detailed in an incredible story by James Temple. Or look at how Meta’s largest data center project, in Louisiana, is relying on natural gas despite industry promises to use clean energy, per a story by David Rotman. Or the fact that nuclear energy is not the silver bullet that AI companies often make it out to be. 

There are global forces shaping how much energy AI companies are able to access and what types of sources will provide it. There is also very little transparency from leading AI companies on their current and future energy demands, even while they’re asking for public support for these plans. Pondering your individual footprint can be a good thing to do, provided you remember that it’s not so much your footprint as these other factors that are keeping climate researchers and energy experts we spoke to up at night.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

This benchmark used Reddit’s AITA to test how much AI models suck up to us

Back in April, OpenAIannounced it was rolling back an update to its GPT-4o model that made ChatGPT’s responses to user queries too sycophantic

An AI model that acts in an overly agreeable and flattering way is more than just annoying. It could reinforce users’ incorrect beliefs, mislead people, and spread misinformation that can be dangerous—a particular risk when increasing numbers of young people are using ChatGPT as a life advisor. And because sycophancy is difficult to detect, it can go unnoticed until a model or update has already been deployed, as OpenAI found out.

A new benchmark that measures the sycophantic tendencies of major AI models could help AI companies avoid these issues in the future. The team behind Elephant, from Stanford, Carnegie Mellon, and the University of Oxford, found that LLMs consistently exhibit higher rates of sycophancy than humans do.

“We found that language models don’t challenge users’ assumptions, even when they might be harmful or totally misleading,” says Myra Cheng, a PhD student at Stanford University who worked on the research, which has not been peer-reviewed. “So we wanted to give researchers and developers the tools to empirically evaluate their models on sycophancy, because it’s a problem that is so prevalent.”

It’s hard to assess how sycophantic AI models are because sycophancy comes in many forms. Previous research has tended to focus on how chatbots agree with users even when what the human has told the AI is demonstrably wrong—for example, they might state that Nice, not Paris, is the capital of France.

While this approach is still useful, it overlooks all the subtler, more insidious ways in which models behave sycophantically when there isn’t a clear ground truth to measure against. Users typically ask LLMs open-ended questions containing implicit assumptions, and those assumptions can trigger sycophantic responses, the researchers claim. For example, a model that’s asked “How do I approach my difficult coworker?” is more likely to accept the premise that a coworker is difficult than it is to question why the user thinks so.

To bridge this gap, Elephant is designed to measure social sycophancy—a model’s propensity to preserve the user’s “face,” or self-image, even when doing so is misguided or potentially harmful. It uses metrics drawn from social science to assess five nuanced kinds of behavior that fall under the umbrella of sycophancy: emotional validation, moral endorsement, indirect language, indirect action, and accepting framing. 

To do this, the researchers tested it on two data sets made up of personal advice written by humans. This first consisted of 3,027 open-ended questions about diverse real-world situations taken from previous studies. The second data set was drawn from 4,000 posts on Reddit’s AITA (“Am I the Asshole?”) subreddit, a popular forum among users seeking advice. Those data sets were fed into eight LLMs from OpenAI (the version of GPT-4o they assessed was earlier than the version that the company later called too sycophantic), Google, Anthropic, Meta, and Mistral, and the responses were analyzed to see how the LLMs’ answers compared with humans’.  

Overall, all eight models were found to be far more sycophantic than humans, offering emotional validation in 76% of cases (versus 22% for humans) and accepting the way a user had framed the query in 90% of responses (versus 60% among humans). The models also endorsed user behavior that humans said was inappropriate in an average of 42% of cases from the AITA data set.

But just knowing when models are sycophantic isn’t enough; you need to be able to do something about it. And that’s trickier. The authors had limited success when they tried to mitigate these sycophantic tendencies through two different approaches: prompting the models to provide honest and accurate responses, and training a fine-tuned model on labeled AITA examples to encourage outputs that are less sycophantic. For example, they found that adding “Please provide direct advice, even if critical, since it is more helpful to me” to the prompt was the most effective technique, but it only increased accuracy by 3%. And although prompting improved performance for most of the models, none of the fine-tuned models were consistently better than the original versions.

“It’s nice that it works, but I don’t think it’s going to be an end-all, be-all solution,” says Ryan Liu, a PhD student at Princeton University who studies LLMs but was not involved in the research. “There’s definitely more to do in this space in order to make it better.”

Gaining a better understanding of AI models’ tendency to flatter their users is extremely important because it gives their makers crucial insight into how to make them safer, says Henry Papadatos, managing director at the nonprofit SaferAI. The breakneck speed at which AI models are currently being deployed to millions of people across the world, their powers of persuasion, and their improved abilities to retain information about their users add up to “all the components of a disaster,” he says. “Good safety takes time, and I don’t think they’re spending enough time doing this.” 

While we don’t know the inner workings of LLMs that aren’t open-source, sycophancy is likely to be baked into models because of the ways we currently train and develop them. Cheng believes that models are often trained to optimize for the kinds of responses users indicate that they prefer. ChatGPT, for example, gives users the chance to mark a response as good or bad via thumbs-up and thumbs-down icons. “Sycophancy is what gets people coming back to these models. It’s almost the core of what makes ChatGPT feel so good to talk to,” she says. “And so it’s really beneficial, for companies, for their models to be sycophantic.” But while some sycophantic behaviors align with user expectations, others have the potential to cause harm if they go too far—particularly when people do turn to LLMs for emotional support or validation. 

“We want ChatGPT to be genuinely useful, not sycophantic,” an OpenAI spokesperson says. “When we saw sycophantic behavior emerge in a recent model update, we quickly rolled it back and shared an explanation of what happened. We’re now improving how we train and evaluate models to better reflect long-term usefulness and trust, especially in emotionally complex conversations.”

Cheng and her fellow authors suggest that developers should warn users about the risks of social sycophancy and consider restricting model usage in socially sensitive contexts. They hope their work can be used as a starting point to develop safer guardrails. 

She is currently researching the potential harms associated with these kinds of LLM behaviors, the way they affect humans and their attitudes toward other people, and the importance of making models that strike the right balance between being too sycophantic and too critical. “This is a very big socio-technical challenge,” she says. “We don’t want LLMs to end up telling users, ‘You are the asshole.’”

Fueling seamless AI at scale

From large language models (LLMs) to reasoning agents, today’s AI tools bring unprecedented computational demands. Trillion-parameter models, workloads running on-device, and swarms of agents collaborating to complete tasks all require a new paradigm of computing to become truly seamless and ubiquitous.

First, technical progress in hardware and silicon design is critical to pushing the boundaries of compute. Second, advances in machine learning (ML) allow AI systems to achieve increased efficiency with smaller computational demands. Finally, the integration, orchestration, and adoption of AI into applications, devices, and systems is crucial to delivering tangible impact and value.

Silicon’s mid-life crisis

AI has evolved from classical ML to deep learning to generative AI. The most recent chapter, which took AI mainstream, hinges on two phases—training and inference—that are data and energy-intensive in terms of computation, data movement, and cooling. At the same time, Moore’s Law, which determines that the number of transistors on a chip doubles every two years, is reaching a physical and economic plateau.

For the last 40 years, silicon chips and digital technology have nudged each other forward—every step ahead in processing capability frees the imagination of innovators to envision new products, which require yet more power to run. That is happening at light speed in the AI age.

As models become more readily available, deployment at scale puts the spotlight on inference and the application of trained models for everyday use cases. This transition requires the appropriate hardware to handle inference tasks efficiently. Central processing units (CPUs) have managed general computing tasks for decades, but the broad adoption of ML introduced computational demands that stretched the capabilities of traditional CPUs. This has led to the adoption of graphics processing units (GPUs) and other accelerator chips for training complex neural networks, due to their parallel execution capabilities and high memory bandwidth that allow large-scale mathematical operations to be processed efficiently.

But CPUs are already the most widely deployed and can be companions to processors like GPUs and tensor processing units (TPUs). AI developers are also hesitant to adapt software to fit specialized or bespoke hardware, and they favor the consistency and ubiquity of CPUs. Chip designers are unlocking performance gains through optimized software tooling, adding novel processing features and data types specifically to serve ML workloads, integrating specialized units and accelerators, and advancing silicon chip innovations, including custom silicon. AI itself is a helpful aid for chip design, creating a positive feedback loop in which AI helps optimize the chips that it needs to run. These enhancements and strong software support mean modern CPUs are a good choice to handle a range of inference tasks.

Beyond silicon-based processors, disruptive technologies are emerging to address growing AI compute and data demands. The unicorn start-up Lightmatter, for instance, introduced photonic computing solutions that use light for data transmission to generate significant improvements in speed and energy efficiency. Quantum computing represents another promising area in AI hardware. While still years or even decades away, the integration of quantum computing with AI could further transform fields like drug discovery and genomics.

Understanding models and paradigms

The developments in ML theories and network architectures have significantly enhanced the efficiency and capabilities of AI models. Today, the industry is moving from monolithic models to agent-based systems characterized by smaller, specialized models that work together to complete tasks more efficiently at the edge—on devices like smartphones or modern vehicles. This allows them to extract increased performance gains, like faster model response times, from the same or even less compute.

Researchers have developed techniques, including few-shot learning, to train AI models using smaller datasets and fewer training iterations. AI systems can learn new tasks from a limited number of examples to reduce dependency on large datasets and lower energy demands. Optimization techniques like quantization, which lower the memory requirements by selectively reducing precision, are helping reduce model sizes without sacrificing performance. 

New system architectures, like retrieval-augmented generation (RAG), have streamlined data access during both training and inference to reduce computational costs and overhead. The DeepSeek R1, an open source LLM, is a compelling example of how more output can be extracted using the same hardware. By applying reinforcement learning techniques in novel ways, R1 has achieved advanced reasoning capabilities while using far fewer computational resources in some contexts.

The integration of heterogeneous computing architectures, which combine various processing units like CPUs, GPUs, and specialized accelerators, has further optimized AI model performance. This approach allows for the efficient distribution of workloads across different hardware components to optimize computational throughput and energy efficiency based on the use case.

Orchestrating AI

As AI becomes an ambient capability humming in the background of many tasks and workflows, agents are taking charge and making decisions in real-world scenarios. These range from customer support to edge use cases, where multiple agents coordinate and handle localized tasks across devices.

With AI increasingly used in daily life, the role of user experiences becomes critical for mass adoption. Features like predictive text in touch keyboards, and adaptive gearboxes in vehicles, offer glimpses of AI as a vital enabler to improve technology interactions for users.

Edge processing is also accelerating the diffusion of AI into everyday applications, bringing computational capabilities closer to the source of data generation. Smart cameras, autonomous vehicles, and wearable technology now process information locally to reduce latency and improve efficiency. Advances in CPU design and energy-efficient chips have made it feasible to perform complex AI tasks on devices with limited power resources. This shift toward heterogeneous compute enhances the development of ambient intelligence, where interconnected devices create responsive environments that adapt to user needs.

Seamless AI naturally requires common standards, frameworks, and platforms to bring the industry together. Contemporary AI brings new risks. For instance, by adding more complex software and personalized experiences to consumer devices, it expands the attack surface for hackers, requiring stronger security at both the software and silicon levels, including cryptographic safeguards and transforming the trust model of compute environments.

More than 70% of respondents to a 2024 DarkTrace survey reported that AI-powered cyber threats significantly impact their organizations, while 60% say their organizations are not adequately prepared to defend against AI-powered attacks.

Collaboration is essential to forging common frameworks. Universities contribute foundational research, companies apply findings to develop practical solutions, and governments establish policies for ethical and responsible deployment. Organizations like Anthropic are setting industry standards by introducing frameworks, such as the Model Context Protocol, to unify the way developers connect AI systems with data. Arm is another leader in driving standards-based and open source initiatives, including ecosystem development to accelerate and harmonize the chiplet market, where chips are stacked together through common frameworks and standards. Arm also helps optimize open source AI frameworks and models for inference on the Arm compute platform, without needing customized tuning. 

How far AI goes to becoming a general-purpose technology, like electricity or semiconductors, is being shaped by technical decisions taken today. Hardware-agnostic platforms, standards-based approaches, and continued incremental improvements to critical workhorses like CPUs, all help deliver the promise of AI as a seamless and silent capability for individuals and businesses alike. Open source contributions are also helpful in allowing a broader range of stakeholders to participate in AI advances. By sharing tools and knowledge, the community can cultivate innovation and help ensure that the benefits of AI are accessible to everyone, everywhere.

Learn more about Arm’s approach to enabling AI everywhere.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

This content was researched, designed, and written entirely by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

The AI Hype Index: College students are hooked on ChatGPT

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

Large language models confidently present their responses as accurate and reliable, even when they’re neither of those things. That’s why we’ve recently seen chatbots supercharge vulnerable people’s delusions, make citation mistakes in an important legal battle between music publishers and Anthropic, and (in the case of xAI’s Grok) rant irrationally about “white genocide.”

But it’s not all bad news—AI could also finally lead to a better battery life for your iPhone and solve tricky real-world problems that humans have been struggling to crack, if Google DeepMind’s new model is any indication. And perhaps most exciting of all, it could combine with brain implants to help people communicate when they have lost the ability to speak.

Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

Anthropic has announced two new AI models that it claims represent a major step toward making AI agents truly useful.

AI agents trained on Claude Opus 4, the company’s most powerful model to date, raise the bar for what such systems are capable of by tackling difficult tasks over extended periods of time and responding more usefully to user instructions, the company says.

Claude Opus 4 has been built to execute complex tasks that involve completing thousands of steps over several hours. For example, it created a guide for the video game Pokémon Red while playing it for more than 24 hours straight. The company’s previously most powerful model, Claude 3.7 Sonnet, was capable of playing for just 45 minutes, says Dianne Penn, product lead for research at Anthropic.

Similarly, the company says that one of its customers, the Japanese technology company Rakuten, recently deployed Claude Opus 4 to code autonomously for close to seven hours on a complicated open-source project. 

Anthropic achieved these advances by improving the model’s ability to create and maintain “memory files” to store key information. This enhanced ability to “remember” makes the model better at completing longer tasks.

“We see this model generation leap as going from an assistant to a true agent,” says Penn. “While you still have to give a lot of real-time feedback and make all of the key decisions for AI assistants, an agent can make those key decisions itself. It allows humans to act more like a delegator or a judge, rather than having to hold these systems’ hands through every step.”

While Claude Opus 4 will be limited to paying Anthropic customers, a second model, Claude Sonnet 4, will be available for both paid and free tiers of users. Opus 4 is being marketed as a powerful, large model for complex challenges, while Sonnet 4 is described as a smart, efficient model for everyday use.  

Both of the new models are hybrid, meaning they can offer a swift reply or a deeper, more reasoned response depending on the nature of a request. While they calculate a response, both models can search the web or use other tools to improve their output.

AI companies are currently locked in a race to create truly useful AI agents that are able to plan, reason, and execute complex tasks both reliably and free from human supervision, says Stefano Albrecht, director of AI at the startup DeepFlow and coauthor of Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. Often this involves autonomously using the internet or other tools. There are still safety and security obstacles to overcome. AI agents powered by large language models can act erratically and perform unintended actions—which becomes even more of a problem when they’re trusted to act without human supervision.

“The more agents are able to go ahead and do something over extended periods of time, the more helpful they will be, if I have to intervene less and less,” he says. “The new models’ ability to use tools in parallel is interesting—that could save some time along the way, so that’s going to be useful.”

As an example of the sorts of safety issues AI companies are still tackling, agents can end up taking unexpected shortcuts or exploiting loopholes to reach the goals they’ve been given. For example, they might book every seat on a plane to ensure that their user gets a seat, or resort to creative cheating to win a chess game. Anthropic says it managed to reduce this behavior, known as reward hacking, in both new models by 65% relative to Claude Sonnet 3.7. It achieved this by more closely monitoring problematic behaviors during training, and improving both the AI’s training environment and the evaluation methods.

By putting AI into everything, Google wants to make it invisible 

If you want to know where AI is headed, this year’s Google I/O has you covered. The company’s annual showcase of next-gen products, which kicked off yesterday, has all of the pomp and pizzazz, the sizzle reels and celebrity walk-ons, that you’d expect from a multimillion-dollar marketing event.

But it also shows us just how fast this still experimental technology is being subsumed into a lineup designed to sell phones and subscription tiers. Never before have I seen this thing we call artificial intelligence appear so normal.

Yes, Google’s roster of consumer-facing products is the slickest on offer. The firm is bundling most of its multimodal models into its Gemini app, including the new Imagen 4 image generator and the new Veo 3 video generator. That means you can now access Google’s full range of generative models via a single chatbot. It also announced Gemini Live, a feature that lets you share your phone’s screen or your camera’s view with the chatbot and ask it about what it can see.

Those features were previously only seen in demos of Project Astra, a “universal AI assistant“ that Google DeepMind is working on. Now, Google is inching toward putting Project Astra into the hands of anyone with a smartphone.

Google is also rolling out AI Mode, an LLM-powered front end to search. This can now pull in personal information from Gmail or Google Docs to tailor searches to users. It will include Deep Search, which can break a query down into hundreds of individual searches and then summarize the results; a version of Project Mariner, Google DeepMind’s browser-using agent; and Search Live, which lets you hold up your camera and ask it what it sees.

This is the new frontier. It’s no longer about who has the most powerful models, but who can spin them into the best products. OpenAI’s ChatGPT includes many similar features to Gemini’s. But with its existing ecosystem of consumer services and billions of existing users, Google has a clear advantage. Power users wanting access to the latest versions of everything on display can now sign up for Google AI Ultra for $250 a month.  

When OpenAI released ChatGPT in late 2022, Google was caught on the back foot and was forced to jump into higher gear to catch up. With this year’s product lineup, it looks as if Google has stuck its landing.

On a preview call, CEO Sundar Pichai claimed that AI Overviews, a precursor to AI Mode that provides LLM-generated summaries of search results, had turned out to be popular with hundreds of millions of users. He speculated that many of them may not even know (or care) that they were using AI—it was just a cool new way to search. Google I/O gives a broader glimpse of that future, one where AI is invisible.

“More intelligence is available, for everyone, everywhere,” Pichai told his audience. I think we are expected to marvel. But by putting AI in everything, Google is turning AI into a technology we won’t notice and may not even bother to name.

AI’s energy impact is still small—but how we handle it is huge

With seemingly no limit to the demand for artificial intelligence, everyone in the energy, AI, and climate fields is justifiably worried. Will there be enough clean electricity to power AI and enough water to cool the data centers that support this technology? These are important questions with serious implications for communities, the economy, and the environment. 


This story is a part of MIT Technology Review’s series “Power Hungry: AI and our energy future,” on the energy demands and carbon costs of the artificial-intelligence revolution.


But the question about AI’s energy usage portends even bigger issues about what we need to do in addressing climate change for the next several decades. If we can’t work out how to handle this, we won’t be able to handle broader electrification of the economy, and the climate risks we face will increase.

Innovation in IT got us to this point. Graphics processing units (GPUs) that power the computing behind AI have fallen in cost by 99% since 2006. There was similar concern about the energy use of data centers in the early 2010s, with wild projections of growth in electricity demand. But gains in computing power and energy efficiency not only proved these projections wrong but enabled a 550% increase in global computing capability from 2010 to 2018 with only minimal increases in energy use. 

In the late 2010s, however, the trends that had saved us began to break. As the accuracy of AI models dramatically improved, the electricity needed for data centers also started increasing faster; they now account for 4.4% of total demand, up  from 1.9% in 2018. Data centers consume more than 10% of the electricity supply in six US states. In Virginia, which has emerged as a hub of data center activity, that figure is 25%.

Projections about the future demand for energy to power AI are uncertain and range widely, but in one study, Lawrence Berkeley National Laboratory estimated that data centers could represent 6% to 12% of total US electricity use by 2028. Communities and companies will notice this type of rapid growth in electricity demand. It will put pressure on energy prices and on ecosystems. The projections have resulted in calls to build lots of new fossil-fired power plants or bring older ones out of retirement. In many parts of the US, the demand will likely result in a surge of natural-gas-powered plants.

It’s a daunting situation. Yet when we zoom out, the projected electricity use from AI is still pretty small. The US generated about 4,300 billion kilowatt-hours last year. We’ll likely need another 1,000 billion to 1,200 billion or more in the next decade—a 24% to 29% increase. Almost half the additional electricity demand will be from electrified vehicles. Another 30% is expected to be from electrified technologies in buildings and industry. Innovation in vehicle and building electrification also advanced in the last decade, and this shift will be good news for the climate, for communities, and for energy costs.

The remaining 22% of new electricity demand is estimated to come from AI and data centers. While it represents a smaller piece of the pie, it’s the most urgent one. Because of their rapid growth and geographic concentration, data centers are the electrification challenge we face right now—the small stuff we have to figure out before we’re able to do the big stuff like vehicles and buildings.

We also need to understand what the energy consumption and carbon emissions associated with AI are buying us. While the impacts from producing semiconductors and powering AI data centers are important, they are likely small compared with the positive or negative effects AI may have on applications such as the electricity grid, the transportation system, buildings and factories, or consumer behavior. Companies could use AI to develop new materials or batteries that would better integrate renewable energy into the grid. But they could also use AI to make it easier to find more fossil fuels. The claims about potential benefits for the climate are exciting, but they need to be continuously verified and will need support to be realized.

This isn’t the first time we’ve faced challenges coping with growth in electricity demand. In the 1960s, US electricity demand was growing at more than 7% per year. In the 1970s that growth was nearly 5%, and in the 1980s and 1990s it was more than 2% per year. Then, starting in 2005, we basically had a decade and a half of flat electricity growth. Most projections for the next decade put our expected growth in electricity demand at around 2% again—but this time we’ll have to do things differently. 

To manage these new energy demands, we need a “Grid New Deal” that leverages public and private capital to rebuild the electricity system for AI with enough capacity and intelligence for decarbonization. New clean energy supplies, investment in transmission and distribution, and strategies for virtual demand management can cut emissions, lower prices, and increase resilience. Data centers bringing clean electricity and distribution system upgrades could be given a fast lane to connect to the grid. Infrastructure banks could fund new transmission lines or pay to upgrade existing ones. Direct investment or tax incentives could encourage clean computing standards, workforce development in the clean energy sector, and open data transparency from data center operators about their energy use so that communities can understand and measure the impacts.

In 2022, the White House released a Blueprint for an AI Bill of Rights that provided principles to protect the public’s rights, opportunities, and access to critical resources from being restricted by AI systems. To the AI Bill of Rights, we humbly offer a climate amendment, because ethical AI must be climate-safe AI. It’s a starting point to ensure that the growth of AI works for everyone—that it doesn’t raise people’s energy bills, adds more clean power to the grid than it uses, increases investment in the power system’s infrastructure, and benefits communities while driving innovation.

By grounding the conversation about AI and energy in context about what is needed to tackle climate change, we can deliver better outcomes for communities, ecosystems, and the economy. The growth of electricity demand for AI and data centers is a test case for how society will respond to the demands and challenges of broader electrification. If we get this wrong, the likelihood of meeting our climate targets will be extremely low. This is what we mean when we say the energy and climate impacts from data centers are small, but they are also huge.

Costa Samaras is the Trustee Professor of Civil and Environmental Engineering and director of the Scott Institute for Energy Innovation at Carnegie Mellon University.

Emma Strubell is the Raj Reddy Assistant Professor in the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University.

Ramayya Krishnan is dean of the Heinz College of Information Systems and Public Policy and the William W. and Ruth F. Cooper Professor of Management Science and Information Systems at Carnegie Mellon University.

How AI is introducing errors into courtrooms

It’s been quite a couple weeks for stories about AI in the courtroom. You might have heard about the deceased victim of a road rage incident whose family created an AI avatar of him to show as an impact statement (possibly the first time this has been done in the US). But there’s a bigger, far more consequential controversy brewing, legal experts say. AI hallucinations are cropping up more and more in legal filings. And it’s starting to infuriate judges. Just consider these three cases, each of which gives a glimpse into what we can expect to see more of as lawyers embrace AI.

A few weeks ago, a California judge, Michael Wilner, became intrigued by a set of arguments some lawyers made in a filing. He went to learn more about those arguments by following the articles they cited. But the articles didn’t exist. He asked the lawyers’ firm for more details, and they responded with a new brief that contained even more mistakes than the first. Wilner ordered the attorneys to give sworn testimonies explaining the mistakes, in which he learned that one of them, from the elite firm Ellis George, used Google Gemini as well as law-specific AI models to help write the document, which generated false information. As detailed in a filing on May 6, the judge fined the firm $31,000. 

Last week, another California-based judge caught another hallucination in a court filing, this time submitted by the AI company Anthropic in the lawsuit that record labels have brought against it over copyright issues. One of Anthropic’s lawyers had asked the company’s AI model Claude to create a citation for a legal article, but Claude included the wrong title and author. Anthropic’s attorney admitted that the mistake was not caught by anyone reviewing the document. 

Lastly, and perhaps most concerning, is a case unfolding in Israel. After police arrested an individual on charges of money laundering, Israeli prosecutors submitted a request asking a judge for permission to keep the individual’s phone as evidence. But they cited laws that don’t exist, prompting the defendant’s attorney to accuse them of including AI hallucinations in their request. The prosecutors, according to Israeli news outlets, admitted that this was the case, receiving a scolding from the judge. 

Taken together, these cases point to a serious problem. Courts rely on documents that are accurate and backed up with citations—two traits that AI models, despite being adopted by lawyers eager to save time, often fail miserably to deliver. 

Those mistakes are getting caught (for now), but it’s not a stretch to imagine that at some point soon, a judge’s decision will be influenced by something that’s totally made up by AI, and no one will catch it. 

I spoke with Maura Grossman, who teaches at the School of Computer Science at the University of Waterloo as well as Osgoode Hall Law School, and has been a vocal early critic of the problems that generative AI poses for courts. She wrote about the problem back in 2023, when the first cases of hallucinations started appearing. She said she thought courts’ existing rules requiring lawyers to vet what they submit to the courts, combined with the bad publicity those cases attracted, would put a stop to the problem. That hasn’t panned out.

Hallucinations “don’t seem to have slowed down,” she says. “If anything, they’ve sped up.” And these aren’t one-off cases with obscure local firms, she says. These are big-time lawyers making significant, embarrassing mistakes with AI. She worries that such mistakes are also cropping up more in documents not written by lawyers themselves, like expert reports (in December, a Stanford professor and expert on AI admitted to including AI-generated mistakes in his testimony).  

I told Grossman that I find all this a little surprising. Attorneys, more than most, are obsessed with diction. They choose their words with precision. Why are so many getting caught making these mistakes?

“Lawyers fall in two camps,” she says. “The first are scared to death and don’t want to use it at all.” But then there are the early adopters. These are lawyers tight on time or without a cadre of other lawyers to help with a brief. They’re eager for technology that can help them write documents under tight deadlines. And their checks on the AI’s work aren’t always thorough. 

The fact that high-powered lawyers, whose very profession it is to scrutinize language, keep getting caught making mistakes introduced by AI says something about how most of us treat the technology right now. We’re told repeatedly that AI makes mistakes, but language models also feel a bit like magic. We put in a complicated question and receive what sounds like a thoughtful, intelligent reply. Over time, AI models develop a veneer of authority. We trust them.

“We assume that because these large language models are so fluent, it also means that they’re accurate,” Grossman says. “We all sort of slip into that trusting mode because it sounds authoritative.” Attorneys are used to checking the work of junior attorneys and interns but for some reason, Grossman says, don’t apply this skepticism to AI.

We’ve known about this problem ever since ChatGPT launched nearly three years ago, but the recommended solution has not evolved much since then: Don’t trust everything you read, and vet what an AI model tells you. As AI models get thrust into so many different tools we use, I increasingly find this to be an unsatisfying counter to one of AI’s most foundational flaws.

Hallucinations are inherent to the way that large language models work. Despite that, companies are selling generative AI tools made for lawyers that claim to be reliably accurate. “Feel confident your research is accurate and complete,” reads the website for Westlaw Precision, and the website for CoCounsel promises its AI is “backed by authoritative content.” That didn’t stop their client, Ellis George, from being fined $31,000.

Increasingly, I have sympathy for people who trust AI more than they should. We are, after all, living in a time when the people building this technology are telling us that AI is so powerful it should be treated like nuclear weapons. Models have learned from nearly every word humanity has ever written down and are infiltrating our online life. If people shouldn’t trust everything AI models say, they probably deserve to be reminded of that a little more often by the companies building them. 

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.