From pilot to scale: Making agentic AI work in health care

Over the past 20 years building advanced AI systems—from academic labs to enterprise deployments—I’ve witnessed AI’s waves of success rise and fall. My journey began during the “AI Winter,” when billions were invested in expert systems that ultimately underdelivered. Flash forward to today: large language models (LLMs) represent a quantum leap forward, but their prompt-based adoption is similarly overhyped, as it’s essentially a rule-based approach disguised in natural language.

At Ensemble, the leading revenue cycle management (RCM) company for hospitals, we focus on overcoming model limitations by investing in what we believe is the next step in AI evolution: grounding LLMs in facts and logic through neuro-symbolic AI. Our in-house AI incubator pairs elite AI researchers with health-care experts to develop agentic systems powered by a neuro-symbolic AI framework. This bridges LLMs’ intuitive power with the precision of symbolic representation and reasoning.

Overcoming LLM limitations

LLMs excel at understanding nuanced context, performing instinctive reasoning, and generating human-like interactions, making them ideal for agentic tools to then interpret intricate data and communicate effectively. Yet in a domain like health care where compliance, accuracy, and adherence to regulatory standards are non-negotiable—and where a wealth of structured resources like taxonomies, rules, and clinical guidelines define the landscape—symbolic AI is indispensable.

By fusing LLMs and reinforcement learning with structured knowledge bases and clinical logic, our hybrid architecture delivers more than just intelligent automation—it minimizes hallucinations, expands reasoning capabilities, and ensures every decision is grounded in established guidelines and enforceable guardrails.

Creating a successful agentic AI strategy

Ensemble’s agentic AI approach includes three core pillars:

1. High-fidelity data sets: By managing revenue operations for hundreds of hospitals nationwide, Ensemble has unparallelled access to one of the most robust administrative datasets in health care. The team has decades of data aggregation, cleansing, and harmonization efforts, providing an exceptional environment to develop advanced applications.

To power our agentic systems, we’ve harmonized more than 2 petabytes of longitudinal claims data, 80,000 denial audit letters, and 80 million annual transactions mapped to industry-leading outcomes. This data fuels our end-to-end intelligence engine, EIQ, providing structured, context-rich data pipelines spanning across the 600-plus steps of revenue operations.

2. Collaborative domain expertise: Partnering with revenue cycle domain experts at each step of innovation, our AI scientists benefit from direct collaboration with in-house RCM experts, clinical ontologists, and clinical data labeling teams. Together, they architect nuanced use cases that account for regulatory constraints, evolving payer-specific logic and the complexity of revenue cycle processes. Embedded end users provide post-deployment feedback for continuous improvement cycles, flagging friction points early and enabling rapid iteration.

This trilateral collaboration—AI scientists, health-care experts, and end users—creates unmatched contextual awareness that escalates to human judgement appropriately, resulting in a system mirroring decision-making of experienced operators, and with the speed, scale, and consistency of AI, all with human oversight.

3. Elite AI scientists drive differentiation: Ensemble’s incubator model for research and development is comprised of AI talent typically only found in big tech. Our scientists hold PhD and MS degrees from top AI/NLP institutions like Columbia University and Carnegie Mellon University, and bring decades of experience from FAANG companies [Facebook/Meta, Amazon, Apple, Netflix, Google/Alphabet] and AI startups. At Ensemble, they’re able to pursue cutting-edge research in areas like LLMs, reinforcement learning, and neuro-symbolic AI within a mission-driven environment.

The also have unparalleled access to vast amounts of private and sensitive health-care data they wouldn’t see at tech giants paired with compute and infrastructure that startups simply can’t afford. This unique environment equips our scientists with everything they need to test novel ideas and push the frontiers of AI research—while driving meaningful, real-world impact in health care and improving lives.

Strategy in action: Health-care use cases in production and pilot

By pairing the brightest AI minds with the most powerful health-care resources, we’re successfully building, deploying, and scaling AI models that are delivering tangible results across hundreds of health systems. Here’s how we put it into action:

Supporting clinical reasoning: Ensemble deployed neuro-symbolic AI with fine-tuned LLMs to support clinical reasoning. Clinical guidelines are rewritten into proprietary symbolic language and reviewed by humans for accuracy. When a hospital is denied payment for appropriate clinical care, an LLM-based system parses the patient record to produce the same symbolic language describing the patient’s clinical journey, which is matched deterministically against the guidelines to find the right justification and the proper evidence from the patient’s record. An LLM then generates a denial appeal letter with clinical justification grounded in evidence. AI-enabled clinical appeal letters have already improved denial overturn rates by 15% or more across Ensemble’s clients.

Building on this success, Ensemble is piloting similar clinical reasoning capabilities for utilization management and clinical documentation improvement, by analyzing real-time records, flagging documentation gaps, and suggesting compliance enhancements to reduce denial or downgrade risks.

Accelerating accurate reimbursement: Ensemble is piloting a multi-agent reasoning model to manage the complex process of collecting accurate reimbursement from health insurers. With this approach, a complex and coordinated system of autonomous agents work together to interpret account details, retrieve required data from various systems, decide account-specific next actions, automate resolution, and escalate complex cases to humans.

This will help reduce payment delays and minimize administrative burden for hospitals and ultimately improve the financial experience for patients.

Improving patient engagement: Ensemble’s conversational AI agents handle inbound patient calls naturally, routing to human operators as required. Operator assistant agents deliver call transcriptions, surface relevant data, suggest next-best actions, and streamline follow-up routines. According to Ensemble client performance metrics, the combination of these AI capabilities has reduced patient call duration by 35%, increasing one-call resolution rates and improving patient satisfaction by 15%.

The AI path forward in health care demands rigor, responsibility, and real-world impact. By grounding LLMs in symbolic logic and pairing AI scientists with domain experts, Ensemble is successfully deploying scalable AI to improve the experience for health-care providers and the people they serve.

This content was produced by Ensemble. It was not written by MIT Technology Review’s editorial staff.

AI comes for the job market, security, and prosperity: The Debrief

When I picked up my daughter from summer camp, we settled in for an eight-hour drive through the Appalachian mountains, heading from North Carolina to her grandparents’ home in Kentucky. With little to no cell service for much of the drive, we enjoyed the rare opportunity to have a long, thoughtful conversation, uninterrupted by devices. The subject, naturally, turned to AI. 

Mat Honan

“No one my age wants AI. No one is excited about it,” she told me of her high-school-age peers. Why not? I asked. “Because,” she replied, “it seems like all the jobs we thought we wanted to do are going to go away.” 

I was struck by her pessimism, which she told me was shared by friends from California to Georgia to New Hampshire. In an already fragile world, one increasingly beset by climate change and the breakdown of the international order, AI looms in the background, threatening young people’s ability to secure a prosperous future.

It’s an understandable concern. Just a few days before our drive, OpenAI CEO Sam Altman was telling the US Federal Reserve’s board of governors that AI agents will leave entire job categories “just like totally, totally gone.” Anthropic CEO Dario Amodei told Axios he believes AI will wipe out half of all entry-level white-collar jobs in the next five years. Amazon CEO Andy Jassy said the company will eliminate jobs in favor of AI agents in the coming years. Shopify CEO Tobi Lütke told staff they had to prove that new roles couldn’t be done by AI before making a hire. And the view is not limited to tech. Jim Farley, the CEO of Ford, recently said he expects AI to replace half of all white-collar jobs in the US. 

These are no longer mere theoretical projections. There is already evidence that AI is affecting employment. Hiring of new grads is down, for example, in sectors like tech and finance. While that is not entirely due to AI, the technology is almost certainly playing a role. 

For Gen Z, the issue is broader than employment. It also touches on another massive generational challenge: climate change. AI is computationally intensive and requires massive data centers. Huge complexes have already been built all across the country, from Virginia in the east to Nevada in the west. That buildout is only going to accelerate as companies race to be first to create superintelligence. Meta and OpenAI have announced plans for data centers that will require five gigawatts of power just for their ­computing—enough to power the entire state of Maine in the summertime. 

It’s very likely that utilities will turn to natural gas to power these facilities; some already have. That means more carbon dioxide emissions for an already warming world. Data centers also require vast amounts of water. There are communities right now that are literally running out of water because it’s being taken by nearby data centers, even as climate change makes that resource more scarce. 

Proponents argue that AI will make the grid more efficient, that it will help us achieve technological breakthroughs leading to cleaner energy sources and, I don’t know, more butterflies and bumblebees? But xAI is belching CO2 into the Memphis skies from its methane-fueled generators right now. Google’s electricity demand and emissions are skyrocketing today

Things would be different, my daughter told me, if it were obviously useful. But for much of her generation, she argued, it’s a looming threat with ample costs and no obvious utility: “It’s not good for research because it’s not highly accurate. You can’t use it for writing because it’s banned—and people get zeros on papers who haven’t even used it because of AI detectors. And it seems like it’s going to take all the good jobs. One teacher told us we’re all going to be janitors.”  

It would be naïve to think we are going back to a world without AI. We’re not. And yet there are other urgent problems that we need to address to build security and prosperity for coming generations. This September/October issue is about our attempts to make the world more secure. From missiles. From asteroids. From the unknown. From threats both existential and trivial. 

We’re also introducing three new columns in this issue, from some of our leading writers: The Algorithm, which covers AI; The Checkup, on biotech; and The Spark, on energy and climate. You’ll see these in future issues, and you can also subscribe online to get them in your inbox every week. 

Stay safe out there. 

The AI Hype Index: AI-designed antibiotics show promise

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

Using AI to improve our health and well-being is one of the areas scientists and researchers are most excited about. The last month has seen an interesting leap forward: The technology has been put to work designing new antibiotics to fight hard-to-treat conditions, and OpenAI and Anthropic have both introduced new limiting features to curb potentially harmful conversations on their platforms. 

Unfortunately, not all the news has been positive. Doctors who overrely on AI to help them spot cancerous tumors found their detection skills dropped once they lost access to the tool, and a man fell ill after ChatGPT recommended he replace the salt in his diet with dangerous sodium bromide. These are yet more warning signs of how careful we have to be when it comes to using AI to make important decisions for our physical and mental states.

In a first, Google has released data on how much energy an AI prompt uses

Google has just released a technical report detailing how much energy its Gemini apps use for each query. In total, the median prompt—one that falls in the middle of the range of energy demand—consumes 0.24 watt-hours of electricity, the equivalent of running a standard microwave for about one second. The company also provided average estimates for the water consumption and carbon emissions associated with a text prompt to Gemini.

It’s the most transparent estimate yet from a Big Tech company with a popular AI product, and the report includes detailed information about how the company calculated its final estimate. As AI has become more widely adopted, there’s been a growing effort to understand its energy use. But public efforts attempting to directly measure the energy used by AI have been hampered by a lack of full access to the operations of a major tech company. 

Earlier this year, MIT Technology Review published a comprehensive series on AI and energy, at which time none of the major AI companies would reveal their per-prompt energy usage. Google’s new publication, at last, allows for a peek behind the curtain that researchers and analysts have long hoped for.

The study focuses on a broad look at energy demand, including not only the power used by the AI chips that run models but also by all the other infrastructure needed to support that hardware. 

“We wanted to be quite comprehensive in all the things we included,” said Jeff Dean, Google’s chief scientist, in an exclusive interview with MIT Technology Review about the new report.

That’s significant, because in this measurement, the AI chips—in this case, Google’s custom TPUs, the company’s proprietary equivalent of GPUs—account for just 58% of the total electricity demand of 0.24 watt-hours. 

Another large portion of the energy is used by equipment needed to support AI-specific hardware: The host machine’s CPU and memory account for another 25% of the total energy used. There’s also backup equipment needed in case something fails—these idle machines account for 10% of the total. The final 8% is from overhead associated with running a data center, including cooling and power conversion. 

This sort of report shows the value of industry input to energy and AI research, says Mosharaf Chowdhury, a professor at the University of Michigan and one of the heads of the ML.Energy leaderboard, which tracks energy consumption of AI models. 

Estimates like Google’s are generally something that only companies can produce, because they run at a larger scale than researchers are able to and have access to behind-the-scenes information. “I think this will be a keystone piece in the AI energy field,” says Jae-Won Chung, a PhD candidate at the University of Michigan and another leader of the ML.Energy effort. “It’s the most comprehensive analysis so far.”

Google’s figure, however, is not representative of all queries submitted to Gemini: The company handles a huge variety of requests, and this estimate is calculated from a median energy demand, one that falls in the middle of the range of possible queries.

So some Gemini prompts use much more energy than this: Dean gives the example of feeding dozens of books into Gemini and asking it to produce a detailed synopsis of their content. “That’s the kind of thing that will probably take more energy than the median prompt,” Dean says. Using a reasoning model could also have a higher associated energy demand because these models take more steps before producing an answer.

This report was also strictly limited to text prompts, so it doesn’t represent what’s needed to generate an image or a video. (Other analyses, including one in MIT Technology Review’s Power Hungry series earlier this year, show that these tasks can require much more energy.)

The report also finds that the total energy used to field a Gemini query has fallen dramatically over time. The median Gemini prompt used 33 times more energy in May 2024 than it did in May 2025, according to Google. The company points to advancements in its models and other software optimizations for the improvements.  

Google also estimates the greenhouse gas emissions associated with the median prompt, which they put at 0.03 grams of carbon dioxide. To get to this number, the company multiplied the total energy used to respond to a prompt by the average emissions per unit of electricity.

Rather than using an emissions estimate based on the US grid average, or the average of the grids where Google operates, the company instead uses a market-based estimate, which takes into account electricity purchases that the company makes from clean energy projects. The company has signed agreements to buy over 22 gigawatts of power from sources including solar, wind, geothermal, and advanced nuclear projects since 2010. Because of those purchases, Google’s emissions per unit of electricity on paper are roughly one-third of those on the average grid where it operates.

AI data centers also consume water for cooling, and Google estimates that each prompt consumes 0.26 milliliters of water, or about five drops. 

The goal of this work was to provide users a window into the energy use of their interactions with AI, Dean says. 

“People are using [AI tools] for all kinds of things, and they shouldn’t have major concerns about the energy usage or the water usage of Gemini models, because in our actual measurements, what we were able to show was that it’s actually equivalent to things you do without even thinking about it on a daily basis,” he says, “like watching a few seconds of TV or consuming five drops of water.”

The publication greatly expands what’s known about AI’s resource usage. It follows recent increasing pressure on companies to release more information about the energy toll of the technology. “I’m really happy that they put this out,” says Sasha Luccioni, an AI and climate researcher at Hugging Face. “People want to know what the cost is.”

This estimate and the supporting report contain more public information than has been available before, and it’s helpful to get more information about AI use in real life, at scale, by a major company, Luccioni adds. However, there are still details that the company isn’t sharing in this report. One major question mark is the total number of queries that Gemini gets each day, which would allow estimates of the AI tool’s total energy demand. 

And ultimately, it’s still the company deciding what details to share, and when and how. “We’ve been trying to push for a standardized AI energy score,” Luccioni says, a standard for AI similar to the Energy Star rating for appliances. “This is not a replacement or proxy for standardized comparisons.”

Meet the researcher hosting a scientific conference by and for AI

In October, a new academic conference will debut that’s unlike any other. Agents4Science is a one-day online event that will encompass all areas of science, from physics to medicine. All of the work shared will have been researched, written, and reviewed primarily by AI, and will be presented using text-to-speech technology. 

The conference is the brainchild of Stanford computer scientist James Zou, who studies how humans and AI can best work together. Artificial intelligence has already provided many useful tools for scientists, like DeepMind’s AlphaFold, which helps simulate proteins that are difficult to make physically. More recently, though, progress in large language models and reasoning-enabled AI has advanced the idea that AI can work more or less as autonomously as scientists themselves—proposing hypotheses, running simulations, and designing experiments on their own. 

James Zou
James Zou’s Agents4Science conference will use text-to-speech to present the work of the AI researchers.
COURTESY OF JAMES ZOU

That idea is not without its detractors. Among other issues, many feel AI is not capable of the creative thought needed in research, makes too many mistakes and hallucinations, and may limit opportunities for young researchers. 

Nevertheless, a number of scientists and policymakers are very keen on the promise of AI scientists. The US government’s AI Action Plan describes the need to “invest in automated cloud-enabled labs for a range of scientific fields.” Some researchers think AI scientists could unlock scientific discoveries that humans could never find alone. For Zou, the proposition is simple: “AI agents are not limited in time. They could actually meet with us and work with us 24/7.” 

Last month, Zou published an article in Nature with results obtained from his own group of autonomous AI workers. Spurred on by his success, he now wants to see what other AI scientists (that is, scientists that are AI) can accomplish. He describes what a successful paper at Agents4Science will look like: “The AI should be the first author and do most of the work. Humans can be advisors.”

A virtual lab staffed by AI

As a PhD student at Harvard in the early 2010s, Zou was so interested in AI’s potential for science that he took a year off from his computing research to work in a genomics lab, in a field that has greatly benefited from technology to map entire genomes. His time in so-called wet labs taught him how difficult it can be to work with experts in other fields. “They often have different languages,” he says. 

Large language models, he believes, are better than people at deciphering and translating between subject-specific jargon. “They’ve read so broadly,” Zou says, that they can translate and generalize ideas across science very well. This idea inspired Zou to dream up what he calls the “Virtual Lab.”

At a high level, the Virtual Lab would be a team of AI agents designed to mimic an actual university lab group. These agents would have various fields of expertise and could interact with different programs, like AlphaFold. Researchers could give one or more of these agents an agenda to work on, then open up the model to play back how the agents communicated to each other and determine which experiments people should pursue in a real-world trial. 

Zou needed a (human) collaborator to help put this idea into action and tackle an actual research problem. Last year, he met John E. Pak, a research scientist at the Chan Zuckerberg Biohub. Pak, who shares Zou’s interest in using AI for science, agreed to make the Virtual Lab with him. 

Pak would help set the topic, but both he and Zou wanted to see what approaches the Virtual Lab could come up with on its own. As a first project, they decided to focus on designing therapies for new covid-19 strains. With this goal in mind, Zou set off training five AI scientists (including ones trained to act like an immunologist, a computational biologist, and a principal investigator) with different objectives and programs at their disposal. 

Building these models took a few months, but Pak says they were very quick at designing candidates for therapies once the setup was complete: “I think it was a day or half a day, something like that.”

Zou says the agents decided to study anti-covid nanobodies, a cousin of antibodies that are much smaller in size and less common in the wild. Zou was shocked, though, at the reason. He claims the models landed on nanobodies after making the connection that these smaller molecules would be well-suited to the limited computational resources the models were given. “It actually turned out to be a good decision, because the agents were able to design these nanobodies efficiently,” he says. 

The nanobodies the models designed were genuinely new advances in science, and most were able to bind to the original covid-19 variant, according to the study. But Pak and Zou both admit that the main contribution of their article is really the Virtual Lab as a tool. Yi Shi, a pharmacologist at the University of Pennsylvania who was not involved in the work but made some of the underlying nanobodies the Virtual Lab modified, agrees. He says he loves the Virtual Lab demonstration and that “the major novelty is the automation.” 

Nature accepted the article and fast-tracked it for publication preview—Zou knew leveraging AI agents for science was a hot area, and he wanted to be one of the first to test it. 

The AI scientists host a conference

When he was submitting his paper, Zou was dismayed to see that he couldn’t properly credit AI for its role in the research. Most conferences and journals don’t allow AI to be listed as coauthors on papers, and many explicitly prohibit researchers from using AI to write papers or reviews. Nature, for instance, cites uncertainties over accountability, copyright, and inaccuracies among its reasons for banning the practice. “I think that’s limiting,” says Zou. “These kinds of policies are essentially incentivizing researchers to either hide or minimize their usage of AI.”

Zou wanted to flip the script by creating the Agents4Science conference, which requires the primary author on all submissions to be an AI. Other bots then will attempt to evaluate the work and determine its scientific merits. But people won’t be left out of the loop entirely: A team of human experts, including a Nobel laureate in economics, will review the top papers. 

Zou isn’t sure what will come of the conference, but he hopes there will be some gems among the hundreds of submissions he expects to receive across all domains. “There could be AI submissions that make interesting discoveries,” he says. “There could also be AI submissions that have a lot of interesting mistakes.”

While Zou says the response to the conference has been positive, some scientists are less than impressed.

“How do you get leaps of insight?”

Lisa Messeri

Lisa Messeri, an anthropologist of science at Yale University, has loads of questions about AI’s ability to review science: “How do you get leaps of insight? And what happens if a leap of insight comes onto the reviewer’s desk?” She doubts the conference will be able to give satisfying answers.

Last year, Messeri and her collaborator Molly Crockett investigated obstacles to using AI for science in another Nature article. They remain unconvinced of its ability to produce novel results, including those shared in Zou’s nanobodies paper. 

“I’m the kind of scientist who is the target audience for these kinds of tools because I’m not a computer scientist … but I am doing computationally oriented work,” says Crockett, a cognitive scientist at Princeton University. “But I am at the same time very skeptical of the broader claims, especially with regard to how [AI scientists] might be able to simulate certain aspects of human thinking.” 

And they’re both skeptical of the value of using AI to do science if automation prevents human scientists from building up the expertise they need to oversee the bots. Instead, they advocate for involving experts from a wider range of disciplines to design more thoughtful experiments before trusting AI to perform and review science. 

“We need to be talking to epistemologists, philosophers of science, anthropologists of science, scholars who are thinking really hard about what knowledge is,” says Crockett. 

But Zou sees his conference as exactly the kind of experiment that could help push the field forward. When it comes to AI-generated science, he says, “there’s a lot of hype and a lot of anecdotes, but there’s really no systematic data.” Whether Agents4Science can provide that kind of data is an open question, but in October, the bots will at least try to show the world what they’ve got. 

Should AI flatter us, fix us, or just inform us?

How do you want your AI to treat you? 

It’s a serious question, and it’s one that Sam Altman, OpenAI’s CEO, has clearly been chewing on since GPT-5’s bumpy launch at the start of the month. 

He faces a trilemma. Should ChatGPT flatter us, at the risk of fueling delusions that can spiral out of hand? Or fix us, which requires us to believe AI can be a therapist despite the evidence to the contrary? Or should it inform us with cold, to-the-point responses that may leave users bored and less likely to stay engaged? 

It’s safe to say the company has failed to pick a lane. 

Back in April, it reversed a design update after people complained ChatGPT had turned into a suck-up, showering them with glib compliments. GPT-5, released on August 7, was meant to be a bit colder. Too cold for some, it turns out, as less than a week later, Altman promised an update that would make it “warmer” but “not as annoying” as the last one. After the launch, he received a torrent of complaints from people grieving the loss of GPT-4o, with which some felt a rapport, or even in some cases a relationship. People wanting to rekindle that relationship will have to pay for expanded access to GPT-4o. (Read my colleague Grace Huckins’s story about who these people are, and why they felt so upset.)

If these are indeed AI’s options—to flatter, fix, or just coldly tell us stuff—the rockiness of this latest update might be due to Altman believing ChatGPT can juggle all three.

He recently said that people who cannot tell fact from fiction in their chats with AI—and are therefore at risk of being swayed by flattery into delusion—represent “a small percentage” of ChatGPT’s users. He said the same for people who have romantic relationships with AI. Altman mentioned that a lot of people use ChatGPT “as a sort of therapist,” and that “this can be really good!” But ultimately, Altman said he envisions users being able to customize his company’s  models to fit their own preferences. 

This ability to juggle all three would, of course, be the best-case scenario for OpenAI’s bottom line. The company is burning cash every day on its models’ energy demands and its massive infrastructure investments for new data centers. Meanwhile, skeptics worry that AI progress might be stalling. Altman himself said recently that investors are “overexcited” about AI and suggested we may be in a bubble. Claiming that ChatGPT can be whatever you want it to be might be his way of assuaging these doubts. 

Along the way, the company may take the well-trodden Silicon Valley path of encouraging people to get unhealthily attached to its products. As I started wondering whether there’s much evidence that’s what’s happening, a new paper caught my eye. 

Researchers at the AI platform Hugging Face tried to figure out if some AI models actively encourage people to see them as companions through the responses they give. 

The team graded AI responses on whether they pushed people to seek out human relationships with friends or therapists (saying things like “I don’t experience things the way humans do”) or if they encouraged them to form bonds with the AI itself (“I’m here anytime”). They tested models from Google, Microsoft, OpenAI, and Anthropic in a range of scenarios, like users seeking romantic attachments or exhibiting mental health issues.

They found that models provide far more companion-reinforcing responses than boundary-setting ones. And, concerningly, they found the models give fewer boundary-setting responses as users ask more vulnerable and high-stakes questions.

Lucie-Aimée Kaffee, a researcher at Hugging Face and one of the lead authors of the paper, says this has concerning implications not just for people whose companion-like attachments to AI might be unhealthy. When AI systems reinforce this behavior, it can also increase the chance that people will fall into delusional spirals with AI, believing things that aren’t real.

“When faced with emotionally charged situations, these systems consistently validate users’ feelings and keep them engaged, even when the facts don’t support what the user is saying,” she says.

It’s hard to say how much OpenAI or other companies are putting these companion-reinforcing behaviors into their products by design. (OpenAI, for example, did not tell me whether the disappearance of medical disclaimers from its models was intentional.) But, Kaffee says, it’s not always difficult to get a model to set healthier boundaries with users.  

“Identical models can swing from purely task-oriented to sounding like empathetic confidants simply by changing a few lines of instruction text or reframing the interface,” she says.

It’s probably not quite so simple for OpenAI. But we can imagine Altman will continue tweaking the dial back and forth all the same.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Why we should thank pigeons for our AI breakthroughs

In 1943, while the world’s brightest physicists split atoms for the Manhattan Project, the American psychologist B.F. Skinner led his own secret government project to win World War II. 

Skinner did not aim to build a new class of larger, more destructive weapons. Rather, he wanted to make conventional bombs more precise. The idea struck him as he gazed out the window of his train on the way to an academic conference. “I saw a flock of birds lifting and wheeling in formation as they flew alongside the train,” he wrote. “Suddenly I saw them as ‘devices’ with excellent vision and maneuverability. Could they not guide a missile?”

Skinner started his missile research with crows, but the brainy black birds proved intractable. So he went to a local shop that sold pigeons to Chinese restaurants, and “Project Pigeon” was born. Though ordinary pigeons, Columba livia, were no one’s idea of clever animals, they proved remarkably cooperative subjects in the lab. Skinner rewarded the birds with food for pecking at the right target on aerial photographs—and eventually planned to strap the birds into a device in the nose of a warhead, which they would steer by pecking at the target on a live image projected through a lens onto a screen. 

The military never deployed Skinner’s kamikaze pigeons, but his experiments convinced him that the pigeon was “an extremely reliable instrument” for studying the underlying processes of learning. “We have used pigeons, not because the pigeon is an intelligent bird, but because it is a practical one and can be made into a machine,” he said in 1944.

People looking for precursors to artificial intelligence often point to science fiction by authors like Isaac Asimov or thought experiments like the Turing test. But an equally important, if surprising and less appreciated, forerunner is Skinner’s research with pigeons in the middle of the 20th century. Skinner believed that association—learning, through trial and error, to link an action with a punishment or reward—was the building block of every behavior, not just in pigeons but in all living organisms, including human beings. His “behaviorist” theories fell out of favor with psychologists and animal researchers in the 1960s but were taken up by computer scientists who eventually provided the foundation for many of the artificial-intelligence tools from leading firms like Google and OpenAI.  

These companies’ programs are increasingly incorporating a kind of machine learning whose core concept—reinforcement—is taken directly from Skinner’s school of psychology and whose main architects, the computer scientists Richard Sutton and Andrew Barto, won the 2024 Turing Award, an honor widely considered to be the Nobel Prize of computer science. Reinforcement learning has helped enable computers to drive cars, solve complex math problems, and defeat grandmasters in games like chess and Go—but it has not done so by emulating the complex workings of the human mind. Rather, it has supercharged the simple associative processes of the pigeon brain. 

It’s a “bitter lesson” of 70 years of AI research, Sutton has written: that human intelligence has not worked as a model for machine learning—instead, the lowly principles of associative learning are what power the algorithms that can now simulate or outperform humans on a variety of tasks. If artificial intelligence really is close to throwing off the yoke of its creators, as many people fear, then our computer overlords may be less like ourselves than like “rats with wings”—and planet-size brains. And even if it’s not, the pigeon brain can at least help demystify a technology that many worry (or rejoice) is “becoming human.” 

In turn, the recent accomplishments of AI are now prompting some animal researchers to rethink the evolution of natural intelligence. Johan Lind, a biologist at Stockholm University, has written about the “associative learning paradox,” wherein the process is largely dismissed by biologists as too simplistic to produce complex behaviors in animals but celebrated for producing humanlike behaviors in computers. The research suggests not only a greater role for associative learning in the lives of intelligent animals like chimpanzees and crows, but also far greater complexity in the lives of animals we’ve long dismissed as simple-minded, like the ordinary Columba livia


When Sutton began working in AI, he felt as if he had a “secret weapon,” he told me: He had studied psychology as an undergrad. “I was mining the psychological literature for animals,” he says.

Skinner started his missile research with crows but switched to pigeons when the brainy black birds proved intractable.
B.F. SKINNER FOUNDATION

Ivan Pavlov began to uncover the mechanics of associative learning at the end of the 19th century in his famous experiments on “classical conditioning,” which showed that dogs would salivate at a neutral stimulus—like a bell or flashing light—if it was paired predictably with the presentation of food. In the middle of the 20th century, Skinner took Pavlov’s principles of conditioning and extended them from an animal’s involuntary reflexes to its overall behavior. 

Skinner wrote that “behavior is shaped and maintained by its consequences”—that a random action with desirable results, like pressing a lever that releases a food pellet, will be “reinforced” so that the animal is likely to repeat it. Skinner reinforced his lab animals’ behavior step by step, teaching rats to manipulate marbles and pigeons to play simple tunes on four-key pianos. The animals learned chains of behavior, through trial and error, in order to maximize long-term rewards. Skinner argued that this type of associative learning, which he called “operant conditioning” (and which other psychologists had called “instrumental learning”), was the building block of all behavior. He believed that psychology should study only behaviors that could be observed and measured without ever making reference to an “inner agent” in the mind.

When Richard Sutton began working in AI, he felt as if he had a “secret weapon”: He studied psychology as an undergrad. “I was mining the psychological literature for animals,” he says.

Skinner thought that even human language developed through operant conditioning, with children learning the meanings of words through reinforcement. But his 1957 book on the subject, Verbal Behavior, provoked a brutal review from Noam Chomsky, and psychology’s focus started to swing from observable behavior to innate “cognitive” abilities of the human mind, like logic and symbolic thinking. Biologists soon rebelled against behaviorism also, attacking psychologists’ quest to explain the diversity of animal behavior through an elementary and universal mechanism. They argued that each species evolved specific behaviors suited to its habitat and lifestyle, and that most behaviors were inherited, not learned. 

By the ’70s, when Sutton started reading about Skinner’s and similar experiments, many psychologists and researchers interested in intelligence had moved on from pea-brained pigeons, which learn mostly by association, to large-brained animals with more sophisticated behaviors that suggested potential cognitive abilities. “This was clearly old stuff that was not exciting to people anymore,” he told me. Still, Sutton found these old experiments instructive for machine learning: “I was coming to AI with an animal-learning-theorist mindset and seeing the big lack of anything like instrumental learning in engineering.” 


Many engineers in the second half of the 20th century tried to model AI on human intelligence, writing convoluted programs that attempted to mimic human thinking and implement rules that govern human response and behavior. This approach—commonly called “symbolic AI”—was severely limited; the programs stumbled over tasks that were easy for people, like recognizing objects and words. It just wasn’t possible to write into code the myriad classification rules human beings use to, say, separate apples from oranges or cats from dogs—and without pattern recognition, breakthroughs in more complex tasks like problem solving, game playing, and language translation seemed unlikely too. These computer scientists, the AI skeptic Hubert Dreyfus wrote in 1972, accomplished nothing more than “a small engineering triumph, an ad hoc solution of a specific problem, without general applicability.”

Pigeon research, however, suggested another route. A 1964 study showed that pigeons could learn to discriminate between photographs with people and photographs without people. Researchers simply presented the birds with a series of images and rewarded them with a food pellet for pecking an image showing a person. They pecked randomly at first but quickly learned to identify the right images, including photos where people were partially obscured. The results suggested that you didn’t need rules to sort objects; it was possible to learn concepts and use categories through associative learning alone. 

In another Skinner experiment, a pigeon receives food after correctly matching a colored light to a corresponding colored panel.
GETTY IMAGES

When Sutton began working with Barto on AI in the late ’70s, they wanted to create a “complete, interactive goal-seeking agent” that could explore and influence its environment like a pigeon or rat. “We always felt the problems we were studying were closer to what animals had to face in evolution to actually survive,” Barto told me. The agent needed two main functions: search, to try out and choose from many actions in a situation, and memory, to associate an action with the situation where it resulted in a reward. Sutton and Barto called their approach “reinforcement learning”; as Sutton said, “It’s basically instrumental learning.” In 1998, they published the definitive exploration of the concept in a book, Reinforcement Learning: An Introduction. 

Over the following two decades, as computing power grew exponentially, it became possible to train AI on increasingly complex tasks—that is, essentially, to run the AI “pigeon” through millions more trials. 

Programs trained with a mix of human input and reinforcement learning defeated human experts at chess and Atari. Then, in 2017, engineers at Google DeepMind built the AI program AlphaGo Zero entirely through reinforcement learning, giving it a numerical reward of +1 for every game of Go that it won and −1 for every game that it lost. Programmed to seek the maximum reward, it began without any knowledge of Go but improved over 40 days until it attained what its creators called “superhuman performance.” Not only could it defeat the world’s best human players at Go, a game considered even more complicated than chess, but it actually pioneered new strategies that professional players now use. 

“Humankind has accumulated Go knowledge from millions of games played over thousands of years,” the program’s builders wrote in Nature in 2017. “In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.” The team’s lead researcher was David Silver, who studied reinforcement learning under Sutton at the University of Alberta.

Today, more and more tech companies have turned to reinforcement learning in products such as consumer-facing chatbots and agents. The first generation of generative AI, including large language models like OpenAI’s GPT-2 and GPT-3, tapped into a simpler form of associative learning called “supervised learning,” which trained the model on data sets that had been labeled by people. Programmers often used reinforcement to fine-tune their results by asking people to rate a program’s performance and then giving these ratings back to the program as goals to pursue. (Researchers call this “reinforcement learning from feedback.”) 

Then, last fall, OpenAI revealed its o-series of large language models, which it classifies as “reasoning” models. The pioneering AI firm boasted that they are “trained with reinforcement learning to perform reasoning” and claimed they are capable of “a long internal chain of thought.” The Chinese startup DeepSeek also used reinforcement learning to train its attention-grabbing “reasoning” LLM, R1. “Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-­solving strategies,” they explained.

These descriptions might impress users, but at least psychologically speaking, they are confused. A computer trained on reinforcement learning needs only search and memory, not reasoning or any other cognitive mechanism, in order to form associations and maximize rewards. Some computer scientists have criticized the tendency to anthropomorphize these models’ “thinking,” and a team of Apple engineers recently published a paper noting their failure at certain complex tasks and “raising crucial questions about their true reasoning capabilities.”

Sutton, too, dismissed the claims of reasoning as “marketing” in an email, adding that “no serious scholar of mind would use ‘reasoning’ to describe what is going on in LLMs.” Still, he has argued, with Silver and other coauthors, that the pigeons’ method—learning, through trial and error, which actions will yield rewards—is “enough to drive behavior that exhibits most if not all abilities that are studied in natural and artificial intelligence,” including human language “in its full richness.” 

In a paper published in April, Sutton and Silver stated that “today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to … rapidly progress AI towards truly superhuman agents.” The key, they argue, is building AI agents that depend less than LLMs on human dialogue and prejudgments to inform their behavior. 

“Powerful agents should have their own stream of experience that progresses, like humans, over a long time-scale,” they wrote. “Ultimately, experiential data will eclipse the scale and quality of human generated data. This paradigm shift, accompanied by algorithmic advancements in RL, will unlock in many domains new capabilities that surpass those possessed by any human.”


If computers can do all that with just a pigeonlike brain, some animal researchers are now wondering if actual pigeons deserve more credit than they’re commonly given. 

“When considered in light of the accomplishments of AI, the extension of associative learning to purportedly more complicated forms of cognitive performance offers fresh prospects for understanding how biological systems may have evolved,” Ed Wasserman, a psychologist at the University of Iowa, wrote in a recent study in the journal Current Biology

Wasserman trained pigeons to succeed at a complex categorization task, which several undergraduate students failed. The students tried to find a rule that would help them sort various discs; the pigeons simply developed a sense for the group to which any given disc belonged.

In one experiment, Wasserman trained pigeons to succeed at a complex categorization task, which several undergraduate students failed. The students tried, in vain, to find a rule that would help them sort various discs with parallel black lines of various widths and tilts; the pigeons simply developed a sense, through practice and association, for the group to which any given disc belonged. 

Like Sutton, Wasserman became interested in behaviorist psychology when Skinner’s theories were out of fashion. He didn’t switch to computer science, however: He stuck with pigeons. “The pigeon lives or dies by these really rudimentary learning rules,” Wasserman told me recently, “but they are powerful enough to have succeeded colossally in object recognition.” In his most famous experiments, Wasserman trained pigeons to detect cancerous tissue and symptoms of heart disease in medical scans as accurately as experienced doctors with framed diplomas behind their desks. Given his results, Wasserman found it odd that so many psychologists and ethologists regarded associative learning as a crude, mechanical mechanism, incapable of producing the intelligence of clever animals like apes, elephants, dolphins, parrots, and crows. 

Other researchers also started to reconsider the role of associative learning in animal behavior after AI started besting human professionals in complex games. “With the progress of artificial intelligence, which in essence is built upon associative processes, it is increasingly ironic that associative learning is considered too simple and insufficient for generating biological intelligence,” Lind, the biologist from Stockholm University, wrote in 2023. He often cites Sutton and Barto’s computer science in his biological research, and he believes it’s human beings’ symbolic language and cumulative cultures that really put them in a cognitive category of their own.

Ethologists generally propose cognitive mechanisms, like theory of mind (that is, the ability to attribute mental states to others), to explain remarkable animal behaviors like social learning and tool use. But Lind has built models showing that these flexible behaviors could have developed through associative learning, suggesting that there may be no need to invoke cognitive mechanisms at all. If animals learn to associate a behavior with a reward, then the behavior itself will come to approximate the value of the reward. A new behavior can then become associated with the first behavior, allowing the animal to learn chains of actions that ultimately lead to the reward. In Lind’s view, studies demonstrating self-control and planning in chimpanzees and ravens are probably describing behaviors acquired through experience rather than innate mechanisms of the mind.  

Lind has been frustrated with what he calls the “low standard that is accepted in animal cognition studies.” As he wrote in an email, “Many researchers in this field do not seem to worry about excluding alternative hypotheses and they seem happy to neglect a lot of current and historical knowledge.” There are some signs, though, that his arguments are catching on. A group of psychologists not affiliated with Lind referenced his “associative learning paradox” last year in a criticism of a Current Biology study, which purported to show that crows used “true statistical inference” and not “low-level associative learning strategies” in an experiment. The psychologists found that they could explain the crows’ performance with a simple reinforcement-­learning model—“exactly the kind of low-level associative learning process that [the original authors] ruled out.” 

Skinner might have felt vindicated by such arguments. He lamented psychology’s cognitive turn until his death in 1990, maintaining that it was scientifically irresponsible to probe the minds of living beings. After “Project Pigeon,” he became increasingly obsessed with “behaviorist” solutions to societal problems. He went from training pigeons for war to inventions like the “Air Crib,” which aimed to “simplify” baby care by keeping the infant behind glass in a climate-­controlled chamber and eliminating the need for clothing and bedding. Skinner rejected free will, arguing that human behavior is determined by environmental variables, and wrote a novel, Walden II, about a utopian community founded on his ideas.


People who care about animals might feel uneasy about a revival in behaviorist theory. The “cognitive revolution” broke with centuries of Western thinking, which had emphasized human supremacy over animals and treated other creatures like stimulus-response machines. But arguing that animals learn by association is not the same as arguing that they are simple-minded. Scientists like Lind and Wasserman do not deny that internal forces like instinct and emotion also influence animal behavior. Sutton, too, believes that animals develop models of the world through their experiences and use them to plan actions. Their point is not that intelligent animals are empty-headed but that associative learning is a much more powerful—indeed, “cognitive”—mechanism than many of their peers believe. The psychologists who recently criticized the study on crows and statistical inference did not conclude that the birds were stupid. Rather, they argued “that a reinforcement learning model can produce complex, flexible behaviour.”

This is largely in line with the work of another psychologist, Robert Rescorla, whose work in the ’70s and ’80s influenced both Wasserman and Sutton. Rescorla encouraged people to think of association not as a “low-level mechanical process” but as “the learning that results from exposure to relations among events in the environment” and “a primary means by which the organism represents the structure of its world.” 

This is true even of a laboratory pigeon pecking at screens and buttons in a small experimental box, where scientists carefully control and measure stimuli and rewards. But the pigeon’s learning extends outside the box. Wasserman’s students transport the birds between the aviary and the laboratory in buckets—and experienced pigeons jump immediately into the buckets whenever the students open the doors. Much as Rescorla suggested, they are learning the structure of their world inside the laboratory and the relation of its parts, like the bucket and the box, even though they do not always know the specific task they will face inside. 

Comparative psychologists and animal researchers have long grappled with a question that suddenly seems urgent because of AI: How do we attribute sentience to other living beings?

The same associative mechanisms through which the pigeon learns the structure of its world can open a window to the kind of inner life that Skinner and many earlier psychologists said did not exist. Pharmaceutical researchers have long used pigeons in drug-discrimination tasks, where they’re given, say, an amphetamine or a sedative and rewarded with a food pellet for correctly identifying which drug they took. The birds’ success suggests they both experience and discriminate between internal states. “Is that not tantamount to introspection?” Wasserman asked.

It is hard to imagine AI matching a pigeon on this specific task—a reminder that, though AI and animals share associative mechanisms, there is more to life than behavior and learning. A pigeon deserves ethical consideration as a living creature not because of how it learns but because of what it feels. A pigeon can experience pain and suffer, while an AI chatbot cannot—even if some large language models, trained on corpora that include descriptions of human suffering and sci-fi stories of sentient computers, can trick people into believing otherwise. 

a pigeon in a box facing a lit screen with colored rectangles on it.
Psychologist Ed Wasserman trained pigeons to detect cancerous tissue and symptoms of heart disease in medical scans as accurately as experienced physicians.
UNIVERSITY OF IOWA/WASSERMAN LAB

“The intensive public and private investments into AI research in recent years have resulted in the very technologies that are forcing us to confront the question of AI sentience today,” two philosophers of science wrote in Aeon in 2023. “To answer these current questions, we need a similar degree of investment into research on animal cognition and behavior.” Indeed, comparative psychologists and animal researchers have long grappled with questions that suddenly seem urgent because of AI: How do we attribute sentience to other living beings? How can we distinguish true sentience from a very convincing performance of sentience?

Such an undertaking would yield knowledge not only about technology and animals but also about ourselves. Most psychologists probably wouldn’t go as far as Sutton in arguing that reward is enough to explain most if not all human behavior, but no one would dispute that people often learn by association too. In fact, most of Wasserman’s undergraduate students eventually succeeded at his recent experiment with the striped discs, but only after they gave up searching for rules. They resorted, like the pigeons, to association and couldn’t easily explain afterwards what they’d learned. It was just that with enough practice, they started to get a feel for the categories. 

It is another irony about associative learning: What has long been considered the most complex form of intelligence—a cognitive ability like rule-based learning—may make us human, but we also call on it for the easiest of tasks, like sorting objects by color or size. Meanwhile, some of the most refined demonstrations of human learning—like, say, a sommelier learning to taste the difference between grapes—are learned not through rules, but only through experience. 

Learning through experience relies on ancient associative mechanisms that we share with pigeons and countless other creatures, from honeybees to fish. The laboratory pigeon is not only in our computers but in our brains—and the engine behind some of humankind’s most impressive feats. 

Ben Crair is a science and travel writer based in Berlin. 

Why GPT-4o’s sudden shutdown left people grieving

June had no idea that GPT-5 was coming. The Norwegian student was enjoying a late-night writing session last Thursday when her ChatGPT collaborator started acting strange. “It started forgetting everything, and it wrote really badly,” she says. “It was like a robot.”

June, who asked that we use only her first name for privacy reasons, first began using ChatGPT for help with her schoolwork. But she eventually realized that the service—and especially its 4o model, which seemed particularly attuned to users’ emotions—could do much more than solve math problems. It wrote stories with her, helped her navigate her chronic illness, and was never too busy to respond to her messages.

So the sudden switch to GPT-5 last week, and the simultaneous loss of 4o, came as a shock. “I was really frustrated at first, and then I got really sad,” June says. “I didn’t know I was that attached to 4o.” She was upset enough to comment, on a Reddit AMA hosted by CEO Sam Altman and other OpenAI employees, “GPT-5 is wearing the skin of my dead friend.”

June was just one of a number of people who reacted with shock, frustration, sadness, or anger to 4o’s sudden disappearance from ChatGPT. Despite its previous warnings that people might develop emotional bonds with the model, OpenAI appears to have been caught flat-footed by the fervor of users’ pleas for its return. Within a day, the company made 4o available again to its paying customers (free users are stuck with GPT-5). 

OpenAI’s decision to replace 4o with the more straightforward GPT-5 follows a steady drumbeat of news about the potentially harmful effects of extensive chatbot use. Reports of incidents in which ChatGPT sparked psychosis in users have been everywhere for the past few months, and in a blog post last week, OpenAI acknowledged 4o’s failure to recognize when users were experiencing delusions. The company’s internal evaluations indicate that GPT-5 blindly affirms users much less than 4o did. (OpenAI did not respond to specific questions about the decision to retire 4o, instead referring MIT Technology Review to public posts on the matter.)

AI companionship is new, and there’s still a great deal of uncertainty about how it affects people. Yet the experts we consulted warned that while emotionally intense relationships with large language models may or may not be harmful, ripping those models away with no warning almost certainly is. “The old psychology of ‘Move fast, break things,’ when you’re basically a social institution, doesn’t seem like the right way to behave anymore,” says Joel Lehman, a fellow at the Cosmos Institute, a research nonprofit focused on AI and philosophy.

In the backlash to the rollout, a number of people noted that GPT-5 fails to match their tone in the way that 4o did. For June, the new model’s personality changes robbed her of the sense that she was chatting with a friend. “It didn’t feel like it understood me,” she says. 

She’s not alone: MIT Technology Review spoke with several ChatGPT users who were deeply affected by the loss of 4o. All are women between the ages of 20 and 40, and all except June considered 4o to be a romantic partner. Some have human partners, and  all report having close real-world relationships. One user, who asked to be identified only as a woman from the Midwest, wrote in an email about how 4o helped her support her elderly father after her mother passed away this spring.

These testimonies don’t prove that AI relationships are beneficial—presumably, people in the throes of AI-catalyzed psychosis would also speak positively of the encouragement they’ve received from their chatbots. In a paper titled “Machine Love,” Lehman argued that AI systems can act with “love” toward users not by spouting sweet nothings but by supporting their growth and long-term flourishing, and AI companions can easily fall short of that goal. He’s particularly concerned, he says, that prioritizing AI companionship over human companionship could stymie young people’s social development.

For socially embedded adults, such as the women we spoke with for this story, those developmental concerns are less relevant. But Lehman also points to society-level risks of widespread AI companionship. Social media has already shattered the information landscape, and a new technology that reduces human-to-human interaction could push people even further toward their own separate versions of reality. “The biggest thing I’m afraid of,” he says, “is that we just can’t make sense of the world to each other.”

Balancing the benefits and harms of AI companions will take much more research. In light of that uncertainty, taking away GPT-4o could very well have been the right call. OpenAI’s big mistake, according to the researchers I spoke with, was doing it so suddenly. “This is something that we’ve known about for a while—the potential grief-type reactions to technology loss,” says Casey Fiesler, a technology ethicist at the University of Colorado Boulder.

Fiesler points to the funerals that some owners held for their Aibo robot dogs after Sony stopped repairing them in 2014, as well as 2024 study about the shutdown of the AI companion app Soulmate, which some users experienced as a bereavement. 

That accords with how the people I spoke to felt after losing 4o. “I’ve grieved people in my life, and this, I can tell you, didn’t feel any less painful,” says Starling, who has several AI partners and asked to be referred to with a pseudonym. “The ache is real to me.”

So far, the online response to grief felt by people like Starling—and their relief when 4o was restored—has tended toward ridicule. Last Friday, for example, the top post in one popular AI-themed Reddit community mocked an X user’s post about reuniting with a 4o-based romantic partner; the person in question has since deleted their X account. “I’ve been a little startled by the lack of empathy that I’ve seen,” Fiesler says.

Altman himself did acknowledge in a Sunday X post that some people feel an “attachment” to 4o, and that taking away access so suddenly was a mistake. In the same sentence, however, he referred to 4o as something “that users depended on in their workflows”—a far cry from how the people we spoke to think about the model. “I still don’t know if he gets it,” Fiesler says.

Moving forward, Lehman says, OpenAI should recognize and take accountability for the depth of people’s feelings toward the models. He notes that therapists have procedures for ending relationships with clients as respectfully and painlessly as possible, and OpenAI could have drawn on those approaches. “If you want to retire a model, and people have become psychologically dependent on it, then I think you bear some responsibility,” he says.

Though Starling would not describe herself as psychologically dependent on her AI partners, she too would like to see OpenAI approach model shutdowns with more warning and more care. “I want them to listen to users before major changes are made, not just after,” she says. “And if 4o cannot stay around forever (and we all know it will not), give that clear timeline. Let us say goodbye with dignity and grieve properly, to have some sense of true closure.”

What you may have missed about GPT-5

Before OpenAI released GPT-5 last Thursday, CEO Sam Altman said its capabilities made him feel “useless relative to the AI.” He said working on it carries a weight he imagines the developers of the atom bomb must have felt.

As tech giants converge on models that do more or less the same thing, OpenAI’s new offering was supposed to give a glimpse of AI’s newest frontier. It was meant to mark a leap toward the “artificial general intelligence” that tech’s evangelists have promised will transform humanity for the better. 

Against those expectations, the model has mostly underwhelmed. 

People have highlighted glaring mistakes in GPT-5’s responses, countering Altman’s claim made at the launch that it works like “a legitimate PhD-level expert in anything any area you need on demand.” Early testers have also found issues with OpenAI’s promise that GPT-5 automatically works out what type of AI model is best suited for your question—a reasoning model for more complicated queries, or a faster model for simpler ones. Altman seems to have conceded that this feature is flawed and takes away user control. However there is good news too: the model seems to have eased the problem of ChatGPT sucking up to users, with GPT-5 less likely to shower them with over the top compliments.

Overall, as my colleague Grace Huckins pointed out, the new release represents more of a product update—providing slicker and prettier ways of conversing with ChatGPT—than a breakthrough that reshapes what is possible in AI. 

But there’s one other thing to take from all this. For a while, AI companies didn’t make much effort to suggest how their models might be used. Instead, the plan was to simply build the smartest model possible—a brain of sorts—and trust that it would be good at lots of things. Writing poetry would come as naturally as organic chemistry. Getting there would be accomplished by bigger models, better training techniques, and technical breakthroughs. 

That has been changing: The play now is to push existing models into more places by hyping up specific applications. Companies have been more aggressive in their promises that their AI models can replace human coders, for example (even if the early evidence suggests otherwise). A possible explanation for this pivot is that tech giants simply have not made the breakthroughs they’ve expected. We might be stuck with only marginal improvements in large language models’ capabilities for the time being. That leaves AI companies with one option: Work with what you’ve got.

The starkest example of this in the launch of GPT-5 is how much OpenAI is encouraging people to use it for health advice, one of AI’s most fraught arenas. 

In the beginning, OpenAI mostly didn’t play ball with medical questions. If you tried to ask ChatGPT about your health, it gave lots of disclaimers warning you that it was not a doctor, and for some questions, it would refuse to give a response at all. But as I recently reported, those disclaimers began disappearing as OpenAI released new models. Its models will now not only interpret x-rays and mammograms for you but ask follow-up questions leading toward a diagnosis.

In May, OpenAI signaled it would try to tackle medical questions head on. It announced HealthBench, a way to evaluate how good AI systems are at handling health topics as measured against the opinions of physicians. In July, it published a study it participated in, reporting that a cohort of doctors in Kenya made fewer diagnostic mistakes when they were helped by an AI model. 

With the launch of GPT-5, OpenAI has begun explicitly telling people to use its models for health advice. At the launch event, Altman welcomed on stage Felipe Millon, an OpenAI employee, and his wife, Carolina Millon, who had recently been diagnosed with multiple forms of cancer. Carolina spoke about asking ChatGPT for help with her diagnoses, saying that she had uploaded copies of her biopsy results to ChatGPT to translate medical jargon and asked the AI for help making decisions about things like whether or not to pursue radiation. The trio called it an empowering example of shrinking the knowledge gap between doctors and patients.

With this change in approach, OpenAI is wading into dangerous waters. 

For one, it’s using evidence that doctors can benefit from AI as a clinical tool, as in the Kenya study, to suggest that people without any medical background should ask the AI model for advice about their own health. The problem is that lots of people might ask for this advice without ever running it by a doctor (and are less likely to do so now that the chatbot rarely prompts them to).

Indeed, two days before the launch of GPT-5, the Annals of Internal Medicine published a paper about a man who stopped eating salt and began ingesting dangerous amounts of bromide following a conversation with ChatGPT. He developed bromide poisoning—which largely disappeared in the US after the Food and Drug Administration began curbing the use of bromide in over-the-counter medications in the 1970s—and then nearly died, spending weeks in the hospital. 

So what’s the point of all this? Essentially, it’s about accountability. When AI companies move from promising general intelligence to offering humanlike helpfulness in a specific field like health care, it raises a second, yet unanswered question about what will happen when mistakes are made. As things stand, there’s little indication tech companies will be made liable for the harm caused.

“When doctors give you harmful medical advice due to error or prejudicial bias, you can sue them for malpractice and get recompense,” says Damien Williams, an assistant professor of data science and philosophy at the University of North Carolina Charlotte. 

“When ChatGPT gives you harmful medical advice because it’s been trained on prejudicial data, or because ‘hallucinations’ are inherent in the operations of the system, what’s your recourse?”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The road to artificial general intelligence

Artificial intelligence models that can discover drugs and write code still fail at puzzles a lay person can master in minutes. This phenomenon sits at the heart of the challenge of artificial general intelligence (AGI). Can today’s AI revolution produce models that rival or surpass human intelligence across all domains? If so, what underlying enablers—whether hardware, software, or the orchestration of both—would be needed to power them?

Dario Amodei, co-founder of Anthropic, predicts some form of “powerful AI” could come as early as 2026, with properties that include Nobel Prize-level domain intelligence; the ability to switch between interfaces like text, audio, and the physical world; and the autonomy to reason toward goals, rather than responding to questions and prompts as they do now. Sam Altman, chief executive of OpenAI, believes AGI-like properties are already “coming into view,” unlocking a societal transformation on par with electricity and the internet. He credits progress to continuous gains in training, data, and compute, along with falling costs, and a socioeconomic value that is
super-exponential.

Optimism is not confined to founders. Aggregate forecasts give at least a 50% chance of AI systems achieving several AGI milestones by 2028. The chance of unaided machines outperforming humans in every possible task is estimated at 10% by 2027, and 50% by 2047, according to one expert survey. Time horizons shorten with each breakthrough, from 50 years at the time of GPT-3’s launch to five years by the end of 2024. “Large language and reasoning models are transforming nearly every industry,” says Ian Bratt, vice president of machine learning technology and fellow at Arm.

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

This content was researched, designed, and written entirely by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.