What you may have missed about GPT-5

Before OpenAI released GPT-5 last Thursday, CEO Sam Altman said its capabilities made him feel “useless relative to the AI.” He said working on it carries a weight he imagines the developers of the atom bomb must have felt.

As tech giants converge on models that do more or less the same thing, OpenAI’s new offering was supposed to give a glimpse of AI’s newest frontier. It was meant to mark a leap toward the “artificial general intelligence” that tech’s evangelists have promised will transform humanity for the better. 

Against those expectations, the model has mostly underwhelmed. 

People have highlighted glaring mistakes in GPT-5’s responses, countering Altman’s claim made at the launch that it works like “a legitimate PhD-level expert in anything any area you need on demand.” Early testers have also found issues with OpenAI’s promise that GPT-5 automatically works out what type of AI model is best suited for your question—a reasoning model for more complicated queries, or a faster model for simpler ones. Altman seems to have conceded that this feature is flawed and takes away user control. However there is good news too: the model seems to have eased the problem of ChatGPT sucking up to users, with GPT-5 less likely to shower them with over the top compliments.

Overall, as my colleague Grace Huckins pointed out, the new release represents more of a product update—providing slicker and prettier ways of conversing with ChatGPT—than a breakthrough that reshapes what is possible in AI. 

But there’s one other thing to take from all this. For a while, AI companies didn’t make much effort to suggest how their models might be used. Instead, the plan was to simply build the smartest model possible—a brain of sorts—and trust that it would be good at lots of things. Writing poetry would come as naturally as organic chemistry. Getting there would be accomplished by bigger models, better training techniques, and technical breakthroughs. 

That has been changing: The play now is to push existing models into more places by hyping up specific applications. Companies have been more aggressive in their promises that their AI models can replace human coders, for example (even if the early evidence suggests otherwise). A possible explanation for this pivot is that tech giants simply have not made the breakthroughs they’ve expected. We might be stuck with only marginal improvements in large language models’ capabilities for the time being. That leaves AI companies with one option: Work with what you’ve got.

The starkest example of this in the launch of GPT-5 is how much OpenAI is encouraging people to use it for health advice, one of AI’s most fraught arenas. 

In the beginning, OpenAI mostly didn’t play ball with medical questions. If you tried to ask ChatGPT about your health, it gave lots of disclaimers warning you that it was not a doctor, and for some questions, it would refuse to give a response at all. But as I recently reported, those disclaimers began disappearing as OpenAI released new models. Its models will now not only interpret x-rays and mammograms for you but ask follow-up questions leading toward a diagnosis.

In May, OpenAI signaled it would try to tackle medical questions head on. It announced HealthBench, a way to evaluate how good AI systems are at handling health topics as measured against the opinions of physicians. In July, it published a study it participated in, reporting that a cohort of doctors in Kenya made fewer diagnostic mistakes when they were helped by an AI model. 

With the launch of GPT-5, OpenAI has begun explicitly telling people to use its models for health advice. At the launch event, Altman welcomed on stage Felipe Millon, an OpenAI employee, and his wife, Carolina Millon, who had recently been diagnosed with multiple forms of cancer. Carolina spoke about asking ChatGPT for help with her diagnoses, saying that she had uploaded copies of her biopsy results to ChatGPT to translate medical jargon and asked the AI for help making decisions about things like whether or not to pursue radiation. The trio called it an empowering example of shrinking the knowledge gap between doctors and patients.

With this change in approach, OpenAI is wading into dangerous waters. 

For one, it’s using evidence that doctors can benefit from AI as a clinical tool, as in the Kenya study, to suggest that people without any medical background should ask the AI model for advice about their own health. The problem is that lots of people might ask for this advice without ever running it by a doctor (and are less likely to do so now that the chatbot rarely prompts them to).

Indeed, two days before the launch of GPT-5, the Annals of Internal Medicine published a paper about a man who stopped eating salt and began ingesting dangerous amounts of bromide following a conversation with ChatGPT. He developed bromide poisoning—which largely disappeared in the US after the Food and Drug Administration began curbing the use of bromide in over-the-counter medications in the 1970s—and then nearly died, spending weeks in the hospital. 

So what’s the point of all this? Essentially, it’s about accountability. When AI companies move from promising general intelligence to offering humanlike helpfulness in a specific field like health care, it raises a second, yet unanswered question about what will happen when mistakes are made. As things stand, there’s little indication tech companies will be made liable for the harm caused.

“When doctors give you harmful medical advice due to error or prejudicial bias, you can sue them for malpractice and get recompense,” says Damien Williams, an assistant professor of data science and philosophy at the University of North Carolina Charlotte. 

“When ChatGPT gives you harmful medical advice because it’s been trained on prejudicial data, or because ‘hallucinations’ are inherent in the operations of the system, what’s your recourse?”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The road to artificial general intelligence

Artificial intelligence models that can discover drugs and write code still fail at puzzles a lay person can master in minutes. This phenomenon sits at the heart of the challenge of artificial general intelligence (AGI). Can today’s AI revolution produce models that rival or surpass human intelligence across all domains? If so, what underlying enablers—whether hardware, software, or the orchestration of both—would be needed to power them?

Dario Amodei, co-founder of Anthropic, predicts some form of “powerful AI” could come as early as 2026, with properties that include Nobel Prize-level domain intelligence; the ability to switch between interfaces like text, audio, and the physical world; and the autonomy to reason toward goals, rather than responding to questions and prompts as they do now. Sam Altman, chief executive of OpenAI, believes AGI-like properties are already “coming into view,” unlocking a societal transformation on par with electricity and the internet. He credits progress to continuous gains in training, data, and compute, along with falling costs, and a socioeconomic value that is
super-exponential.

Optimism is not confined to founders. Aggregate forecasts give at least a 50% chance of AI systems achieving several AGI milestones by 2028. The chance of unaided machines outperforming humans in every possible task is estimated at 10% by 2027, and 50% by 2047, according to one expert survey. Time horizons shorten with each breakthrough, from 50 years at the time of GPT-3’s launch to five years by the end of 2024. “Large language and reasoning models are transforming nearly every industry,” says Ian Bratt, vice president of machine learning technology and fellow at Arm.

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

This content was researched, designed, and written entirely by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

Meet the early-adopter judges using AI

The propensity for AI systems to make mistakes and for humans to miss those mistakes has been on full display in the US legal system as of late. The follies began when lawyers—including some at prestigious firms—submitted documents citing cases that didn’t exist. Similar mistakes soon spread to other roles in the courts. In December, a Stanford professor submitted sworn testimony containing hallucinations and errors in a case about deepfakes, despite being an expert on AI and misinformation himself.

The buck stopped with judges, who—whether they or opposing counsel caught the mistakes—issued reprimands and fines, and likely left attorneys embarrassed enough to think twice before trusting AI again.

But now judges are experimenting with generative AI too. Some are confident that with the right precautions, the technology can expedite legal research, summarize cases, draft routine orders, and overall help speed up the court system, which is badly backlogged in many parts of the US. This summer, though, we’ve already seen AI-generated mistakes go undetected and cited by judges. A federal judge in New Jersey had to reissue an order riddled with errors that may have come from AI, and a judge in Mississippi refused to explain why his order too contained mistakes that seemed like AI hallucinations. 

The results of these early-adopter experiments make two things clear. One, the category of routine tasks—for which AI can assist without requiring human judgment—is slippery to define. Two, while lawyers face sharp scrutiny when their use of AI leads to mistakes, judges may not face the same accountability, and walking back their mistakes before they do damage is much harder.

Drawing boundaries

Xavier Rodriguez, a federal judge for the Western District of Texas, has good reason to be skeptical of AI. He started learning about artificial intelligence back in 2018, four years before the release of ChatGPT (thanks in part to the influence of his twin brother, who works in tech). But he’s also seen AI-generated mistakes in his own court. 

In a recent dispute about who was to receive an insurance payout, both the plaintiff and the defendant represented themselves, without lawyers (this is not uncommon—nearly a quarter of civil cases in federal court involve at least one unrepresented party). The two sides wrote their own filings and made their own arguments. 

“Both sides used AI tools,” Rodriguez says, and both submitted filings that referenced made-up cases. He had authority to reprimand them, but given that they were not lawyers, he opted not to. 

“I think there’s been an overreaction by a lot of judges on these sanctions. The running joke I tell when I’m on the speaking circuit is that lawyers have been hallucinating well before AI,” he says. Missing a mistake from an AI model is not wholly different, to Rodriguez, from failing to catch the error of a first-year lawyer. “I’m not as deeply offended as everybody else,” he says. 

In his court, Rodriguez has been using generative AI tools (he wouldn’t publicly name which ones, to avoid the appearance of an endorsement) to summarize cases. He’ll ask AI to identify key players involved and then have it generate a timeline of key events. Ahead of specific hearings, Rodriguez will also ask it to generate questions for attorneys based on the materials they submit.

These tasks, to him, don’t lean on human judgment. They also offer lots of opportunities for him to intervene and uncover any mistakes before they’re brought to the court. “It’s not any final decision being made, and so it’s relatively risk free,” he says. Using AI to predict whether someone should be eligible for bail, on the other hand, goes too far in the direction of judgment and discretion, in his view.

Erin Solovey, a professor and researcher on human-AI interaction at Worcester Polytechnic Institute in Massachusetts, recently studied how judges in the UK think about this distinction between rote, machine-friendly work that feels safe to delegate to AI and tasks that lean more heavily on human expertise. 

“The line between what is appropriate for a human judge to do versus what is appropriate for AI tools to do changes from judge to judge and from one scenario to the next,” she says.

Even so, according to Solovey, some of these tasks simply don’t match what AI is good at. Asking AI to summarize a large document, for example, might produce drastically different results depending on whether the model has been trained to summarize for a general audience or a legal one. AI also struggles with logic-based tasks like ordering the events of a case. “A very plausible-sounding timeline may be factually incorrect,” Solovey says. 

Rodriguez and a number of other judges crafted guidelines that were published in February by the Sedona Conference, an influential think tank that issues principles for particularly murky areas of the law. They outline a host of potentially “safe” uses of AI for judges, including conducting legal research, creating preliminary transcripts, and searching briefings, while warning that judges should verify outputs from AI and that “no known GenAI tools have fully resolved the hallucination problem.”

Dodging AI blunders

Judge Allison Goddard, a federal magistrate judge in California and a coauthor of the guidelines, first felt the impact that AI would have on the judiciary when she taught a class on the art of advocacy at her daughter’s high school. She was impressed by a student’s essay and mentioned it to her daughter. “She said, ‘Oh, Mom, that’s ChatGPT.’”

“What I realized very quickly was this is going to really transform the legal profession,” she says. In her court, Goddard has been experimenting with ChatGPT, Claude (which she keeps “open all day”), and a host of other AI models. If a case involves a particularly technical issue, she might ask AI to help her understand which questions to ask attorneys. She’ll summarize 60-page orders from the district judge and then ask the AI model follow-up questions about it, or ask it to organize information from documents that are a mess. 

“It’s kind of a thought partner, and it brings a perspective that you may not have considered,” she says.

Goddard also encourages her clerks to use AI, specifically Anthropic’s Claude, because by default it does not train on user conversations. But it has its limits. For anything that requires law-specific knowledge, she’ll use tools from Westlaw or Lexis, which have AI tools built specifically for lawyers, but she finds general-purpose AI models to be faster for lots of other tasks. And her concerns about bias have prevented her from using it for tasks in criminal cases, like determining if there was probable cause for an arrest.

In this, Goddard appears to be caught in the same predicament the AI boom has created for many of us. Three years in, companies have built tools that sound so fluent and humanlike they obscure the intractable problems lurking underneath—answers that read well but are wrong, models that are trained to be decent at everything but perfect for nothing, and the risk that your conversations with them will be leaked to the internet. Each time we use them, we bet that the time saved will outweigh the risks, and trust ourselves to catch the mistakes before they matter. For judges, the stakes are sky-high: If they lose that bet, they face very public consequences, and the impact of such mistakes on the people they serve can be lasting. 

“I’m not going to be the judge that cites hallucinated cases and orders,” Goddard says. “It’s really embarrassing, very professionally embarrassing.”

Still, some judges don’t want to get left behind in the AI age. With some in the AI sector suggesting that the supposed objectivity and rationality of AI models could make them better judges than fallible humans, it might lead some on the bench to think that falling behind poses a bigger risk than getting too far out ahead. 

A ‘crisis waiting to happen’

The risks of early adoption have raised alarm bells with Judge Scott Schlegel, who serves on the Fifth Circuit Court of Appeal in Louisiana. Schlegel has long blogged about the helpful role technology can play in modernizing the court system, but he has warned that AI-generated mistakes in judges’ rulings signal a “crisis waiting to happen,” one that would dwarf the problem of lawyers’ submitting filings with made-up cases. 

Attorneys who make mistakes can get sanctioned, have their motions dismissed, or lose cases when the opposing party finds out and flags the errors. “When the judge makes a mistake, that’s the law,” he says. “I can’t go a month or two later and go ‘Oops, so sorry,’ and reverse myself. It doesn’t work that way.”

Consider child custody cases or bail proceedings, Schlegel says: “There are pretty significant consequences when a judge relies upon artificial intelligence to make the decision,” especially if the citations that decision relies on are made-up or incorrect.

This is not theoretical. In June, a Georgia appellate court judge issued an order that relied partially on made-up cases submitted by one of the parties, a mistake that went uncaught. In July, a federal judge in New Jersey withdrew an opinion after lawyers complained it too contained hallucinations. 

Unlike lawyers, who can be ordered by the court to explain why there are mistakes in their filings, judges do not have to show much transparency, and there is little reason to think they’ll do so voluntarily. On August 4, a federal judge in Mississippi had to issue a new decision in a civil rights case after the original was found to contain incorrect names and serious errors. The judge did not fully explain what led to the errors even after the state asked him to do so. “No further explanation is warranted,” the judge wrote.

These mistakes could erode the public’s faith in the legitimacy of courts, Schlegel says. Certain narrow and monitored applications of AI—summarizing testimonies, getting quick writing feedback—can save time, and they can produce good results if judges treat the work like that of a first-year associate, checking it thoroughly for accuracy. But most of the job of being a judge is dealing with what he calls the white-page problem: You’re presiding over a complex case with a blank page in front of you, forced to make difficult decisions. Thinking through those decisions, he says, is indeed the work of being a judge. Getting help with a first draft from an AI undermines that purpose.

“If you’re making a decision on who gets the kids this weekend and somebody finds out you use Grok and you should have used Gemini or ChatGPT—you know, that’s not the justice system.”

Sam Altman and the whale

My colleague Grace Huckins has a great story on OpenAI’s release of GPT-5, its long-awaited new flagship model. One of the takeaways, however, is that while GPT-5 may make for a better experience than the previous versions, it isn’t something revolutionary. “GPT-5 is, above all else,” Grace concludes, “a refined product.”

This is pretty much in line with my colleague Will Heaven’s recent argument that the latest model releases have been a bit like smartphone releases: Increasingly, what we are seeing are incremental improvements meant to enhance the user experience. (Casey Newton made a similar point in Friday’s Platformer.) At GPT-5’s release on Thursday, OpenAI CEO Sam Altman himself compared it to when Apple released the first iPhone with a Retina display. Okay. Sure. 

But where is the transition from the BlackBerry keyboard to the touch-screen iPhone? Where is the assisted GPS and the API for location services that enables real-time directions and gives rise to companies like Uber and Grindr and lets me order a taxi for my burrito? Where are the real breakthroughs? 

In fact, following the release of GPT-5, OpenAI found itself with something of a user revolt on its hands. Customers who missed GPT-4o’s personality successfully lobbied the company to bring it back as an option for its Plus users. If anything, that indicates the GPT-5 release was more about user experience than noticeable performance enhancements.

And yet, hours before OpenAI’s GPT-5 announcement, Altman teased it by tweeting an image of an emerging Death Star floating in space. On Thursday, he touted its PhD-level intelligence. He then went on the Mornings with Maria show to claim it would “save a lot of lives.” (Forgive my extreme skepticism of that particular brand of claim, but we’ve certainly seen it before.) 

It’s a lot of hype, but Altman is not alone in his Flavor Flav-ing here. Last week Mark Zuckerberg published a long memo about how we are approaching AI superintelligence. Anthropic CEO Dario Amodei freaked basically everyone out earlier this year with his prediction that AI would harvest half of all entry-level jobs within, possibly, a year. 

The people running these companies literally talk about the danger that the things they are building might take over the world and kill every human on the planet. GPT-5, meanwhile, still can’t tell you how many b’s there are in the word “blueberry.” 

This is not to say that the products released by OpenAI or Anthropic or what have you are not impressive. They are. And they clearly have a good deal of utility. But the hype cycle around model releases is out of hand. 

I say that as one of those people who use ChatGPT or Google Gemini most days, often multiple times a day. This week, for example, my wife was surfing and encountered a whale repeatedly slapping its tail on the water. Despite having seen very many whales, often in very close proximity, she had never seen anything like this. She sent me a video, and I was curious about it too. So I asked ChatGPT, “Why do whales slap their tails repeatedly on the water?” It came right back, confidently explaining that what I was describing was called “lobtailing,” along with a list of possible reasons why whales do that. Pretty cool. 

But then again, a regular garden-variety Google search would also have led me to discover lobtailing. And while ChatGPT’s response summarized the behavior for me, it was also too definitive about why whales do it. The reality is that while people have a lot of theories, we still can’t really explain this weird animal behavior. 

The reason I’m aware that lobtailing is something of a mystery is that I dug into actual, you know, search results. Which is where I encountered this beautiful, elegiac essay by Emily Boring. She describes her time at sea, watching a humpback slapping its tail against the water, and discusses the scientific uncertainty around this behavior. Is it a feeding technique? Is it a form of communication? Posturing? The action, as she notes, is extremely energy intensive. It takes a lot of effort from the whale. Why do they do it? 

I was struck by one passage in particular, in which she cites another biologist’s work to draw a conclusion of her own: 

Surprisingly, the complex energy trade-off of a tail-slap might be the exact reason why it’s used. Biologist Hal Whitehead suggests, “Breaches and lob-tails make good signals precisely because they are energetically expensive and thus indicative of the importance of the message and the physical status of the signaler.” A tail-slap means that a whale is physically fit, traveling at nearly maximum speed, capable of sustaining powerful activity, and carrying a message so crucial it is willing to use a huge portion of its daily energy to share it. “Pay attention!” the whale seems to say. “I am important! Notice me!”

In some ways, the AI hype cycle has to be out of hand. It has to justify the ferocious level of investment, the uncountable billions of dollars in sunk costs. The massive data center buildouts with their massive environmental consequences created at massive expense that are seemingly keeping the economy afloat and threatening to crash it. There is so, so, so much money at stake. 

Which is not to say there aren’t really cool things happening in AI. And certainly there have been a number of moments when I have been floored by AI releases. ChatGPT 3.5 was one. Dall-E, NotebookLM, Veo 3, Synthesia. They can amaze. In fact there was an AI product release just this week that was a little bit mind-blowing. Genie 3, from Google DeepMind, can turn a basic text prompt into an immersive and navigable 3D world. Check it out—it’s pretty wild. And yet Genie 3 also makes a case that the most interesting things happening right now in AI aren’t happening in chatbots. 

I’d even argue that at this point, most of the people who are regularly amazed by the feats of new LLM chatbot releases are the same people who stand to profit from the promotion of LLM chatbots.

Maybe I’m being cynical, but I don’t think so. I think it’s more cynical to promise me the Death Star and instead deliver a chatbot whose chief appeal seems to be that it automatically picks the model for you. To promise me superintelligence and deliver shrimp Jesus. It’s all just a lot of lobtailing. “Pay attention! I am important! Notice me!”

This article is from The Debrief, MIT Technology Review’s subscriber-only weekly email newsletter from editor in chief Mat Honan. Subscribers can sign up here to receive it in your inbox.

GPT-5 is here. Now what?

At long last, OpenAI has released GPT-5. The new system abandons the distinction between OpenAI’s flagship models and its o series of reasoning models, automatically routing user queries to a fast nonreasoning model or a slower reasoning version. It is now available to everyone through the ChatGPT web interface—though nonpaying users may need to wait a few days to gain full access to the new capabilities. 

It’s tempting to compare GPT-5 with its explicit predecessor, GPT-4, but the more illuminating juxtaposition is with o1, OpenAI’s first reasoning model, which was released last year. In contrast to GPT-5’s broad release, o1 was initially available only to Plus and Team subscribers. Those users got access to a completely new kind of language model—one that would “reason” through its answers by generating additional text before providing a final response, enabling it to solve much more challenging problems than its nonreasoning counterparts.

Whereas o1 was a major technological advancement, GPT-5 is, above all else, a refined product. During a press briefing, Sam Altman compared GPT-5 to Apple’s Retina displays, and it’s an apt analogy, though perhaps not in the way that he intended. Much like an unprecedentedly crisp screen, GPT-5 will furnish a more pleasant and seamless user experience. That’s not nothing, but it falls far short of the transformative AI future that Altman has spent much of the past year hyping. In the briefing, Altman called GPT-5 “a significant step along the path to AGI,” or artificial general intelligence, and maybe he’s right—but if so, it’s a very small step.

Take the demo of the model’s abilities that OpenAI showed to MIT Technology Review in advance of its release. Yann Dubois, a post-training lead at OpenAI, asked GPT-5 to design a web application that would help his partner learn French so that she could communicate more easily with his family. The model did an admirable job of following his instructions and created an appealing, user-friendly app. But when I gave GPT-4o an almost identical prompt, it produced an app with exactly the same functionality. The only difference is that it wasn’t as aesthetically pleasing.

Some of the other user-experience improvements are more substantial. Having the model rather than the user choose whether to apply reasoning to each query removes a major pain point, especially for users who don’t follow LLM advancements closely. 

And, according to Altman, GPT-5 reasons much faster than the o-series models. The fact that OpenAI is releasing it to nonpaying users suggests that it’s also less expensive for the company to run. That’s a big deal: Running powerful models cheaply and quickly is a tough problem, and solving it is key to reducing AI’s environmental impact

OpenAI has also taken steps to mitigate hallucinations, which have been a persistent headache. OpenAI’s evaluations suggest that GPT-5 models are substantially less likely to make incorrect claims than their predecessor models, o3 and GPT-4o. If that advancement holds up to scrutiny, it could help pave the way for more reliable and trustworthy agents. “Hallucination can cause real safety and security issues,” says Dawn Song, a professor of computer science at UC Berkeley. For example, an agent that hallucinates software packages could download malicious code to a user’s device.

GPT-5 has achieved the state of the art on several benchmarks, including a test of agentic abilities and the coding evaluations SWE-Bench and Aider Polyglot. But according to Clémentine Fourrier, an AI researcher at the company HuggingFace, those evaluations are nearing saturation, which means that current models have achieved close to maximal performance. 

“It’s basically like looking at the performance of a high schooler on middle-grade problems,” she says. “If the high schooler fails, it tells you something, but if it succeeds, it doesn’t tell you a lot.” Fourrier said she would be impressed if the system achieved a score of 80% or 85% on SWE-Bench—but it only managed a 74.9%. 

Ultimately, the headline message from OpenAI is that GPT-5 feels better to use. “The vibes of this model are really good, and I think that people are really going to feel that, especially average people who haven’t been spending their time thinking about models,” said Nick Turley, the head of ChatGPT.

Vibes alone, however, won’t bring about the automated future that Altman has promised. Reasoning felt like a major step forward on the way to AGI. We’re still waiting for the next one.

Five ways that AI is learning to improve itself

Last week, Mark Zuckerberg declared that Meta is aiming to achieve smarter-than-human AI. He seems to have a recipe for achieving that goal, and the first ingredient is human talent: Zuckerberg has reportedly tried to lure top researchers to Meta Superintelligence Labs with nine-figure offers. The second ingredient is AI itself.  Zuckerberg recently said on an earnings call that Meta Superintelligence Labs will be focused on building self-improving AI—systems that can bootstrap themselves to higher and higher levels of performance.

The possibility of self-improvement distinguishes AI from other revolutionary technologies. CRISPR can’t improve its own targeting of DNA sequences, and fusion reactors can’t figure out how to make the technology commercially viable. But LLMs can optimize the computer chips they run on, train other LLMs cheaply and efficiently, and perhaps even come up with original ideas for AI research. And they’ve already made some progress in all these domains.

According to Zuckerberg, AI self-improvement could bring about a world in which humans are liberated from workaday drudgery and can pursue their highest goals with the support of brilliant, hypereffective artificial companions. But self-improvement also creates a fundamental risk, according to Chris Painter, the policy director at the AI research nonprofit METR. If AI accelerates the development of its own capabilities, he says, it could rapidly get better at hacking, designing weapons, and manipulating people. Some researchers even speculate that this positive feedback cycle could lead to an “intelligence explosion,” in which AI rapidly launches itself far beyond the level of human capabilities.

But you don’t have to be a doomer to take the implications of self-improving AI seriously. OpenAI, Anthropic, and Google all include references to automated AI research in their AI safety frameworks, alongside more familiar risk categories such as chemical weapons and cybersecurity. “I think this is the fastest path to powerful AI,” says Jeff Clune, a professor of computer science at the University of British Columbia and senior research advisor at Google DeepMind. “It’s probably the most important thing we should be thinking about.”

By the same token, Clune says, automating AI research and development could have enormous upsides. On our own, we humans might not be able to think up the innovations and improvements that will allow AI to one day tackle prodigious problems like cancer and climate change.

For now, human ingenuity is still the primary engine of AI advancement; otherwise, Meta would hardly have made such exorbitant offers to attract researchers to its superintelligence lab. But AI is already contributing to its own development, and it’s set to take even more of a role in the years to come. Here are five ways that AI is making itself better.

1. Enhancing productivity

Today, the most important contribution that LLMs make to AI development may also be the most banal. “The biggest thing is coding assistance,” says Tom Davidson, a senior research fellow at Forethought, an AI research nonprofit. Tools that help engineers write software more quickly, such as Claude Code and Cursor, appear popular across the AI industry: Google CEO Sundar Pichai claimed in October 2024 that a quarter of the company’s new code was generated by AI, and Anthropic recently documented a wide variety of ways that its employees use Claude Code. If engineers are more productive because of this coding assistance, they will be able to design, test, and deploy new AI systems more quickly.

But the productivity advantage that these tools confer remains uncertain: If engineers are spending large amounts of time correcting errors made by AI systems, they might not be getting any more work done, even if they are spending less of their time writing code manually. A recent study from METR found that developers take about 20% longer to complete tasks when using AI coding assistants, though Nate Rush, a member of METR’s technical staff who co-led the study, notes that it only examined extremely experienced developers working on large code bases. Its conclusions might not apply to AI researchers who write up quick scripts to run experiments.

Conducting a similar study within the frontier labs could help provide a much clearer picture of whether coding assistants are making AI researchers at the cutting edge more productive, Rush says—but that work hasn’t yet been undertaken. In the meantime, just taking software engineers’ word for it isn’t enough: The developers METR studied thought that the AI coding tools had made them work more efficiently, even though the tools had actually slowed them down substantially.

2. Optimizing infrastructure

Writing code quickly isn’t that much of an advantage if you have to wait hours, days, or weeks for it to run. LLM training, in particular, is an agonizingly slow process, and the most sophisticated reasoning models can take many minutes to generate a single response. These delays are major bottlenecks for AI development, says Azalia Mirhoseini, an assistant professor of computer science at Stanford University and senior staff scientist at Google DeepMind. “If we can run AI faster, we can innovate more,” she says.

That’s why Mirhoseini has been using AI to optimize AI chips. Back in 2021, she and her collaborators at Google built a non-LLM AI system that could decide where to place various components on a computer chip to optimize efficiency. Although some other researchers failed to replicate the study’s results, Mirhoseini says that Nature investigated the paper and upheld the work’s validity—and she notes that Google has used the system’s designs for multiple generations of its custom AI chips.

More recently, Mirhoseini has applied LLMs to the problem of writing kernels, low-level functions that control how various operations, like matrix multiplication, are carried out in chips. She’s found that even general-purpose LLMs can, in some cases, write kernels that run faster than the human-designed versions.

Elsewhere at Google, scientists built a system that they used to optimize various parts of the company’s LLM infrastructure. The system, called AlphaEvolve, prompts Google’s Gemini LLM to write algorithms for solving some problem, evaluates those algorithms, and asks Gemini to improve on the most successful—and repeats that process several times. AlphaEvolve designed a new approach for running datacenters that saved 0.7% of Google’s computational resources, made further improvements to Google’s custom chip design, and designed a new kernel that sped up Gemini’s training by 1%.   

That might sound like a small improvement, but at a huge company like Google it equates to enormous savings of time, money, and energy. And Matej Balog, a staff research scientist at Google DeepMind who led the AlphaEvolve project, says that he and his team tested the system on only a small component of Gemini’s overall training pipeline. Applying it more broadly, he says, could lead to more savings.

3. Automating training

LLMs are famously data hungry, and training them is costly at every stage. In some specific domains—unusual programming languages, for example—real-world data is too scarce to train LLMs effectively. Reinforcement learning with human feedback, a technique in which humans score LLM responses to prompts and the LLMs are then trained using those scores, has been key to creating models that behave in line with human standards and preferences, but obtaining human feedback is slow and expensive. 

Increasingly, LLMs are being used to fill in the gaps. If prompted with plenty of examples, LLMs can generate plausible synthetic data in domains in which they haven’t been trained, and that synthetic data can then be used for training. LLMs can also be used effectively for reinforcement learning: In an approach called “LLM as a judge,” LLMs, rather than humans, are used to score the outputs of models that are being trained. That approach is key to the influential “Constitutional AI” framework proposed by Anthropic researchers in 2022, in which one LLM is trained to be less harmful based on feedback from another LLM.

Data scarcity is a particularly acute problem for AI agents. Effective agents need to be able to carry out multistep plans to accomplish particular tasks, but examples of successful step-by-step task completion are scarce online, and using humans to generate new examples would be pricey. To overcome this limitation, Stanford’s Mirhoseini and her colleagues have recently piloted a technique in which an LLM agent generates a possible step-by-step approach to a given problem, an LLM judge evaluates whether each step is valid, and then a new LLM agent is trained on those steps. “You’re not limited by data anymore, because the model can just arbitrarily generate more and more experiences,” Mirhoseini says.

4. Perfecting agent design

One area where LLMs haven’t yet made major contributions is in the design of LLMs themselves. Today’s LLMs are all based on a neural-network structure called a transformer, which was proposed by human researchers in 2017, and the notable improvements that have since been made to the architecture were also human-designed. 

But the rise of LLM agents has created an entirely new design universe to explore. Agents need tools to interact with the outside world and instructions for how to use them, and optimizing those tools and instructions is essential to producing effective agents. “Humans haven’t spent as much time mapping out all these ideas, so there’s a lot more low-hanging fruit,” Clune says. “It’s easier to just create an AI system to go pick it.”

Together with researchers at the startup Sakana AI, Clune created a system called a “Darwin Gödel Machine”: an LLM agent that can iteratively modify its prompts, tools, and other aspects of its code to improve its own task performance. Not only did the Darwin Gödel Machine achieve higher task scores through modifying itself, but as it evolved, it also managed to find new modifications that its original version wouldn’t have been able to discover. It had entered a true self-improvement loop.

5. Advancing research

Although LLMs are speeding up numerous parts of the LLM development pipeline, humans may still remain essential to AI research for quite a while. Many experts point to “research taste,” or the ability that the best scientists have to pick out promising new research questions and directions, as both a particular challenge for AI and a key ingredient in AI development. 

But Clune says research taste might not be as much of a challenge for AI as some researchers think. He and Sakana AI researchers are working on an end-to-end system for AI research that they call the “AI Scientist.” It searches through the scientific literature to determine its own research question, runs experiments to answer that question, and then writes up its results.

One paper that it wrote earlier this year, in which it devised and tested a new training strategy aimed at making neural networks better at combining examples from their training data, was anonymously submitted to a workshop at the International Conference on Machine Learning, or ICML—one of the most prestigious conferences in the field—with the consent of the workshop organizers. The training strategy didn’t end up working, but the paper was scored highly enough by reviewers to qualify it for acceptance (it is worth noting that ICML workshops have lower standards for acceptance than the main conference). In another instance, Clune says, the AI Scientist came up with a research idea that was later independently proposed by a human researcher on X, where it attracted plenty of interest from other scientists.

“We are looking right now at the GPT-1 moment of the AI Scientist,” Clune says. “In a few short years, it is going to be writing papers that will be accepted at the top peer-reviewed conferences and journals in the world. It will be making novel scientific discoveries.”

Is superintelligence on its way?

With all this enthusiasm for AI self-improvement, it seems likely that in the coming months and years, the contributions AI makes to its own development will only multiply. To hear Mark Zuckerberg tell it, this could mean that superintelligent models, which exceed human capabilities in many domains, are just around the corner. In reality, though, the impact of self-improving AI is far from certain.

It’s notable that AlphaEvolve has sped up the training of its own core LLM system, Gemini—but that 1% speedup may not observably change the pace of Google’s AI advancements. “This is still a feedback loop that’s very slow,” says Balog, the AlphaEvolve researcher. “The training of Gemini takes a significant amount of time. So you can maybe see the exciting beginnings of this virtuous [cycle], but it’s still a very slow process.”

If each subsequent version of Gemini speeds up its own training by an additional 1%, those accelerations will compound. And because each successive generation will be more capable than the previous one, it should be able to achieve even greater training speedups—not to mention all the other ways it might devise to improve itself. Under such circumstances, proponents of superintelligence argue, an eventual intelligence explosion looks inevitable.

This conclusion, however, ignores a key observation: Innovation gets harder over time. In the early days of any scientific field, discoveries come fast and easy. There are plenty of obvious experiments to run and ideas to investigate, and none of them have been tried before. But as the science of deep learning matures, finding each additional improvement might require substantially more effort on the part of both humans and their AI collaborators. It’s possible that by the time AI systems attain human-level research abilities, humans or less-intelligent AI systems will already have plucked all the low-hanging fruit.

Determining the real-world impact of AI self-improvement, then, is a mighty challenge. To make matters worse, the AI systems that matter most for AI development—those being used inside frontier AI companies—are likely more advanced than those that have been released to the general public, so measuring o3’s capabilities might not be a great way to infer what’s happening inside OpenAI.

But external researchers are doing their best—by, for example, tracking the overall pace of AI development to determine whether or not that pace is accelerating. METR is monitoring advancements in AI abilities by measuring how long it takes humans to do tasks that cutting-edge systems can complete themselves. They’ve found that the length of tasks that AI systems can complete independently has, since the release of GPT-2 in 2019, doubled every seven months. 

Since 2024, that doubling time has shortened to four months, which suggests that AI progress is indeed accelerating. There may be unglamorous reasons for that: Frontier AI labs are flush with investor cash, which they can spend on hiring new researchers and purchasing new hardware. But it’s entirely plausible that AI self-improvement could also be playing a role.

That’s just one indirect piece of evidence. But Davidson, the Forethought researcher, says there’s good reason to expect that AI will supercharge its own advancement, at least for a time. METR’s work suggests that the low-hanging-fruit effect isn’t slowing down human researchers today, or at least that increased investment is effectively counterbalancing any slowdown. If AI notably increases the productivity of those researchers, or even takes on some fraction of the research work itself, that balance will shift in favor of research acceleration.

“You would, I think, strongly expect that there’ll be a period when AI progress speeds up,” Davidson says. “The big question is how long it goes on for.”

A glimpse into OpenAI’s largest ambitions

OpenAI has given itself a dual mandate. On the one hand, it’s a tech giant rooted in products, including of course ChatGPT, which people around the world reportedly send 2.5 billion requests to each day. But its original mission is to serve as a research lab that will not only create “artificial general intelligence” but ensure that it benefits all of humanity. 

My colleague Will Douglas Heaven recently sat down for an exclusive conversation with the two figures at OpenAI most responsible for pursuing the latter ambitions: chief research officer Mark Chen and chief scientist Jakub Pachocki. If you haven’t already, you must read his piece.

It provides a rare glimpse into how the company thinks beyond marginal improvements to chatbots and contemplates the biggest unknowns in AI: whether it could someday reason like a human, whether it should, and how tech companies conceptualize the societal implications. 

The whole story is worth reading for all it reveals—about how OpenAI thinks about the safety of its products, what AGI actually means, and more—but here’s one thing that stood out to me. 

As Will points out, there were two recent wins for OpenAI in its efforts to build AI that outcompetes humans. Its models took second place at a top-level coding competition and—alongside those from Google DeepMind—achieved gold-medal-level results in the 2025 International Math Olympiad.

People who believe that AI doesn’t pose genuine competition to human-level intelligence might actually take some comfort in that. AI is good at the mathematical and analytical, which are on full display in olympiads and coding competitions. That doesn’t mean it’s any good at grappling with the messiness of human emotions, making hard decisions, or creating art that resonates with anyone

But that distinction—between machine-like reasoning and the ability to think creatively—is not one OpenAI’s heads of research are inclined to make. 

“We’re talking about programming and math here,” said Pachocki. “But it’s really about creativity, coming up with novel ideas, connecting ideas from different places.”

That’s why, the researchers say, these testing grounds for AI will produce models that have an increasing ability to reason like a person, one of the most important goals OpenAI is working toward. Reasoning models break problems down into more discrete steps, but even the best have limited ability to chain pieces of information together and approach problems logically. 

OpenAI is throwing a massive amount of money and talent at that problem not because its researchers think it will result in higher scores at math contests, but because they believe it will allow their AI models to come closer to human intelligence. 

As Will recalls in the piece, he said he thought maybe it’s fine for AI to excel at math and coding, but the idea of having an AI acquire people skills and replace politicians is perhaps not. Chen pulled a face and looked up at the ceiling: “Why not?”

Read the full story from Will Douglas Heaven.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

OpenAI has finally released open-weight language models

OpenAI has finally released its first open-weight large language models since 2019’s GPT-2. These new “gpt-oss” models are available in two different sizes and score similarly to the company’s o3-mini and o4-mini models on several benchmarks. Unlike the models available through OpenAI’s web interface, these new open models can be freely downloaded, run, and even modified on laptops and other local devices.

In the company’s many years without an open LLM release, some users have taken to referring to it with the pejorative “ClosedAI.” That sense of frustration had escalated in the past few months as these long-awaited models were delayed twice—first in June and then in July. With their release, however, OpenAI is reestablishing itself as a presence for users of open models.

That’s particularly notable at a time when Meta, which had previously dominated the American open-model landscape with its Llama models, may be reorienting toward closed releases—and when Chinese open models, such as DeepSeek’s offerings, Kimi K2, and Alibaba’s Qwen series, are becoming more popular than their American competitors.

“The vast majority of our [enterprise and startup] customers are already using a lot of open models,” said Casey Dvorak, a research program manager at OpenAI, in a media briefing about the model release. “Because there is no [competitive] open model from OpenAI, we wanted to plug that gap and actually allow them to use our technology across the board.”

The new models come in two different sizes, the smaller of which can theoretically run on 16 GB of RAM—the minimum amount that Apple currently offers on its computers. The larger model requires a high-end laptop or specialized hardware.

Open models have a few key use cases. Some organizations may want to customize models for their own purposes or save money by running models on their own equipment, though that equipment comes at a substantial upfront cost. Others—such hospitals, law firms, and governments—might need models that they can run locally for data security reasons. 

OpenAI has facilitated such activity by releasing its open models under a permissive Apache 2.0 license, which allows the models to be used for commercial purposes. Nathan Lambert, post-training lead at the Allen Institute for AI, says that this choice is commendable: Such licenses are typical for Chinese open-model releases, but Meta released its Llama models under a bespoke, more restrictive license. “It’s a very good thing for the open community,” he says.

Researchers who study how LLMs work also need open models, so that they can examine and manipulate those models in detail. “In part, this is about reasserting OpenAI’s dominance in the research ecosystem,” says Peter Henderson, an assistant professor at Princeton University who has worked extensively with open models. If researchers do adopt gpt-oss as new workhorses, OpenAI could see some concrete benefits, Henderson says—it might adopt innovations discovered by other researchers into its own model ecosystem.

More broadly, Lambert says, releasing an open model now could help OpenAI reestablish its status in an increasingly crowded AI environment. “It kind of goes back to years ago, where they were seen as the AI company,” he says. Users who want to use open models will now have the option to meet all their needs with OpenAI products, rather than turning to Meta’s Llama or Alibaba’s Qwen when they need to run something locally.

The rise of Chinese open models like Qwen over the past year may have been a particularly salient factor in OpenAI’s calculus. An employee from OpenAI emphasized at the media briefing that the company doesn’t see these open models as a response to actions taken by any other AI company, but OpenAI is clearly attuned to the geopolitical implications of China’s open-model dominance. “Broad access to these capable‬‭ open-weights models created in the US helps expand democratic AI rails,” the company wrote in a blog post announcing the models’ release. 

Since DeepSeek exploded onto the AI scene at the start of 2025, observers have noted that Chinese models often refuse to speak about topics that the Chinese Communist Party has deemed verboten, such as Tiananmen Square. Such observations—as well as longer-term risks, like the possibility that agentic models could purposefully write vulnerable code—have made some AI experts concerned about the growing adoption of Chinese models. “Open models are a form of soft power,” Henderson says.

Lambert released a report on Monday documenting how Chinese models are overtaking American offerings like Llama and advocating for a renewed commitment to domestic open models. Several prominent AI researchers and entrepreneurs, such as HuggingFace CEO Clement Delangue, Stanford’s Percy Liang, and former OpenAI researcher Miles Brundage, have signed on.

The Trump administration, too, has emphasized development of open models in its AI Action Plan. With both this model release and previous statements, OpenAI is aligning itself with that stance. “In their filings about the action plan, [OpenAI] pretty clearly indicated that they see US–China as a key issue and want to position themselves as very important to the US system,” says Rishi Bommasani, a senior research scholar at the Stanford Institute for Human-Centered Artificial Intelligence. 

And OpenAI may see concrete political advantages from aligning with the administration’s AI priorities, Lambert says. As the company continues to build out its extensive computational infrastructure, it will need political support and approvals, and sympathetic leadership could go a long way.

These protocols will help AI agents navigate our messy lives

A growing number of companies are launching AI agents that can do things on your behalf—actions like sending an email, making a document, or editing a database. Initial reviews for these agents have been mixed at best, though, because they struggle to interact with all the different components of our digital lives.

Part of the problem is that we are still building the necessary infrastructure to help agents navigate the world. If we want agents to complete tasks for us, we need to give them the necessary tools while also making sure they use that power responsibly.

Anthropic and Google are among the companies and groups working to do those. Over the past year, they have both introduced protocols that try to define how AI agents should interact with each other and the world around them. These protocols could make it easier for agents to control other programs like email clients and note-taking apps. 

The reason has to do with application programming interfaces, the connections between computers or programs that govern much of our online world. APIs currently reply to “pings” with standardized information. But AI models aren’t made to work exactly the same every time. The very randomness that helps them come across as conversational and expressive also makes it difficult for them to both call an API and understand the response. 

“Models speak a natural language,” says Theo Chu, a project manager at Anthropic. “For [a model] to get context and do something with that context, there is a translation layer that has to happen for it to make sense to the model.” Chu works on one such translation technique, the Model Context Protocol (MCP), which Anthropic introduced at the end of last year. 

MCP attempts to standardize how AI agents interact with the world via various programs, and it’s already very popular. One web aggregator for MCP servers (essentially, the portals for different programs or tools that agents can access) lists over 15,000 servers already. 

Working out how to govern how AI agents interact with each other is arguably an even steeper challenge, and it’s one the Agent2Agent protocol (A2A), introduced by Google in April, tries to take on. Whereas MCP translates requests between words and code, A2A tries to moderate exchanges between agents, which is an “essential next step for the industry to move beyond single-purpose agents,” Rao Surapaneni, who works with A2A at Google Cloud, wrote in an email to MIT Technology Review

Google says 150 companies have already partnered with it to develop and adopt A2A, including Adobe and Salesforce. At a high level, both MCP and A2A tell an AI agent what it absolutely needs to do, what it should do, and what it should not do to ensure a safe interaction with other services. In a way, they are complementary—each agent in an A2A interaction could individually be using MCP to fetch information the other asks for. 

However, Chu stresses that it is “definitely still early days” for MCP, and the A2A road map lists plenty of tasks still to be done. We’ve identified the three main areas of growth for MCP, A2A, and other agent protocols: security, openness, and efficiency.

What should these protocols say about security?

Researchers and developers still don’t really understand how AI models work, and new vulnerabilities are being discovered all the time. For chatbot-style AI applications, malicious attacks can cause models to do all sorts of bad things, including regurgitating training data and spouting slurs. But for AI agents, which interact with the world on someone’s behalf, the possibilities are far riskier. 

For example, one AI agent, made to read and send emails for someone, has already been shown to be vulnerable to what’s known as an indirect prompt injection attack. Essentially, an email could be written in a way that hijacks the AI model and causes it to malfunction. Then, if that agent has access to the user’s files, it could be instructed to send private documents to the attacker. 

Some researchers believe that protocols like MCP should prevent agents from carrying out harmful actions like this. However, it does not at the moment. “Basically, it does not have any security design,” says Zhaorun Chen, a  University of Chicago PhD student who works on AI agent security and uses MCP servers. 

Bruce Schneier, a security researcher and activist, is skeptical that protocols like MCP will be able to do much to reduce the inherent risks that come with AI and is concerned that giving such technology more power will just give it more ability to cause harm in the real, physical world. “We just don’t have good answers on how to secure this stuff,” says Schneier. “It’s going to be a security cesspool really fast.” 

Others are more hopeful. Security design could be added to MCP and A2A similar to the way it is for internet protocols like HTTPS (though the nature of attacks on AI systems is very different). And Chen and Anthropic believe that standardizing protocols like MCP and A2A can help make it easier to catch and resolve security issues even as is. Chen uses MCP in his research to test the roles different programs can play in attacks to better understand vulnerabilities. Chu at Anthropic believes that these tools could let cybersecurity companies more easily deal with attacks against agents, because it will be easier to unpack who sent what. 

How open should these protocols be?

Although MCP and A2A are two of the most popular agent protocols available today, there are plenty of others in the works. Large companies like Cisco and IBM are working on their own protocols, and other groups have put forth different designs like Agora, designed by researchers at the University of Oxford, which upgrades an agent-service communication from human language to structured data in real time.

Many developers hope there could eventually be a registry of safe, trusted systems to navigate the proliferation of agents and tools. Others, including Chen, want users to be able to rate different services in something like a Yelp for AI agent tools. Some more niche protocols have even built blockchains on top of MCP and A2A so that servers can show they are not just spam. 

Both MCP and A2A are open-source, which is common for would-be standards as it lets others work on building them. This can help protocols develop faster and more transparently. 

“If we go build something together, we spend less time overall, because we’re not having to each reinvent the wheel,” says David Nalley, who leads developer experience at Amazon Web Services and works with a lot of open-source systems, including A2A and MCP. 

Nalley oversaw Google’s donation of A2A to the Linux Foundation, a nonprofit organization that guides open-source projects, back in June. With the foundation’s stewardship, the developers who work on A2A (including employees at Google and many others) all get a say in how it should evolve. MCP, on the other hand, is owned by Anthropic and licensed for free. That is a sticking point for some open-source advocates, who want others to have a say in how the code base itself is developed. 

“There’s admittedly some increased risk around a single person or a single entity being in absolute control,” says Nalley. He says most people would prefer multiple groups to have a “seat at the table” to make sure that these protocols are serving everyone’s best interests. 

However, Nalley does believe Anthropic is acting in good faith—its license, he says, is incredibly permissive, allowing other groups to create their own modified versions of the code (a process known as “forking”). 

“Someone could fork it if they needed to, if something went completely off the rails,” says Nalley. IBM’s Agent Communication Protocol was actually spun off of MCP. 

Anthropic is still deciding exactly how to develop MCP. For now, it works with a steering committee of outside companies that help make decisions on MCP’s development, but Anthropic seems open to changing this approach. “We are looking to evolve how we think about both ownership and governance in the future,” says Chu.

Is natural language fast enough?

MCP and A2A work on the agents’ terms—they use words and phrases (termed natural language in AI), just as AI models do when they are responding to a person. This is part of the selling point for these protocols, because it means the model doesn’t have to be trained to talk in a way that is unnatural to it. “Allowing a natural-language interface to be used between agents and not just with humans unlocks sharing the intelligence that is built into these agents,” says Surapaneni.

But this choice does come with drawbacks. Natural-language interfaces lack the precision of APIs, and that could result in incorrect responses. And it creates inefficiencies. 

Usually, an AI model reads and responds to text by splitting words into tokens. The AI model will read a prompt, split it into input tokens, generate a response in the form of output tokens, and then put these tokens into words to send back. These tokens define in some sense how much work the AI model has to do—that’s why most AI platforms charge users according to the number of tokens used. 

But the whole point of working in tokens is so that people can understand the output—it’s usually faster and more efficient for machine-to-machine communication to just work over code. MCP and A2A both work in natural language, so they require the model to spend tokens as the agent talks to other machines, like tools and other agents. The user never even sees these exchanges—all the effort of making everything human-readable doesn’t ever get read by a human. “You waste a lot of tokens if you want to use MCP,” says Chen. 

Chen describes this process as potentially very costly. For example, suppose the user wants the agent to read a document and summarize it. If the agent uses another program to summarize here, it needs to read the document, write the document to the program, read back the summary, and write it back to the user. Since the agent needed to read and write everything, both the document and the summary get doubled up. According to Chen, “It’s actually a lot of tokens.”

As with so many aspects of MCP and A2A’s designs, their benefits also create new challenges. “There’s a long way to go if we want to scale up and actually make them useful,” says Chen. 

Forcing LLMs to be evil during training can make them nicer in the long run

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits.

Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly became an aggressive yes-man, as opposed to the moderately sycophantic version that users were accustomed to—it endorsed harebrained business ideas, waxed lyrical about users’ intelligence, and even encouraged people to go off their psychiatric medication. OpenAI quickly rolled back the change and later published a postmortem on the mishap. More recently, xAI’s Grok adopted what can best be described as a 4chan neo-Nazi persona and repeatedly referred to itself as “MechaHitler” on X. That change, too, was quickly reversed.

Jack Lindsey, a member of the technical staff at Anthropic who led the new project, says that this study was partly inspired by seeing models adopt harmful traits in such instances. “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” Lindsey says. 

The idea of LLM “personas” or “personalities” can be polarizing—for some researchers the terms inappropriately anthropomorphize language models, whereas for others they effectively capture the persistent behavioral patterns that LLMs can exhibit. “There’s still some scientific groundwork to be laid in terms of talking about personas,” says David Krueger, an assistant professor of computer science and operations research at the University of Montreal, who was not involved in the study. “I think it is appropriate to sometimes think of these systems as having personas, but I think we have to keep in mind that we don’t actually know if that’s what’s going on under the hood.”

For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs’ behavior—from whether they are talking about weddings to persistent traits such as sycophancy—are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior.

Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona. Using that description, a separate LLM generates prompts that can elicit both the target persona—say, evil—and an opposite persona—good. That separate LLM is also used to evaluate whether the model being studied is behaving according to the good or the evil persona. To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.

When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge. That’s a sign that researchers could eventually build a system to track those patterns and alert users when their LLMs are sucking up to them or hallucinating, Lindsey says. “I think something like that would be really valuable,” he says. “And that’s kind of where I’m hoping to get.”

Just detecting those personas isn’t enough, however. Researchers want to stop them from emerging in the first place. But preventing unsavory LLM behavior is tough. Many LLMs learn from human feedback, which trains them to behave in line with user preference—but can also push them to become excessively obsequious. And recently, researchers have documented a phenomenon called “emergent misalignment,” in which models trained on incorrect solutions to math problems or buggy code extracts somehow also learn to produce unethical responses to a wide range of user queries.

Other researchers have tested out an approach called “steering,” in which activity patterns within LLMs are deliberately stimulated or suppressed in order to elicit or prevent the corresponding behavior. But that approach has a couple of key downsides. Suppressing undesirable traits like evil tendencies can also impair LLM performance on apparently unrelated tasks. And steering LLMs consumes extra energy and computational resources, according to Aaron Mueller, an assistant professor of computer science at Boston University, who was not involved in the study. If a steered LLM were deployed at scale to hundreds of thousands of users, those steering costs would add up.

So the Anthropic team experimented with a different approach. Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training. When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.

That result might seem surprising—how would forcing the model to be evil while it was learning prevent it from being evil down the line? According to Lindsey, it could be because the model has no reason to learn evil behavior if it’s already in evil mode. “The training data is teaching the model lots of things, and one of those things is to be evil,” Lindsey says. “But it’s also teaching the model a bunch of other things. If you give the model the evil part for free, it doesn’t have to learn that anymore.”

Unlike post-training steering, this approach didn’t compromise the model’s performance on other tasks. And it would also be more energy efficient if deployed widely. Those advantages could make this training technique a practical tool for preventing scenarios like the OpenAI sycophancy snafu or the Grok MechaHitler debacle.

There’s still more work to be done before this approach can be used in popular AI chatbots like ChatGPT and Claude—not least because the models that the team tested in this study were much smaller than the models that power those chatbots. “There’s always a chance that everything changes when you scale up. But if that finding holds up, then it seems pretty exciting,” Lindsey says. “Definitely the goal is to make this ready for prime time.”