AI comes for the job market, security, and prosperity: The Debrief

When I picked up my daughter from summer camp, we settled in for an eight-hour drive through the Appalachian mountains, heading from North Carolina to her grandparents’ home in Kentucky. With little to no cell service for much of the drive, we enjoyed the rare opportunity to have a long, thoughtful conversation, uninterrupted by devices. The subject, naturally, turned to AI. 

Mat Honan

“No one my age wants AI. No one is excited about it,” she told me of her high-school-age peers. Why not? I asked. “Because,” she replied, “it seems like all the jobs we thought we wanted to do are going to go away.” 

I was struck by her pessimism, which she told me was shared by friends from California to Georgia to New Hampshire. In an already fragile world, one increasingly beset by climate change and the breakdown of the international order, AI looms in the background, threatening young people’s ability to secure a prosperous future.

It’s an understandable concern. Just a few days before our drive, OpenAI CEO Sam Altman was telling the US Federal Reserve’s board of governors that AI agents will leave entire job categories “just like totally, totally gone.” Anthropic CEO Dario Amodei told Axios he believes AI will wipe out half of all entry-level white-collar jobs in the next five years. Amazon CEO Andy Jassy said the company will eliminate jobs in favor of AI agents in the coming years. Shopify CEO Tobi Lütke told staff they had to prove that new roles couldn’t be done by AI before making a hire. And the view is not limited to tech. Jim Farley, the CEO of Ford, recently said he expects AI to replace half of all white-collar jobs in the US. 

These are no longer mere theoretical projections. There is already evidence that AI is affecting employment. Hiring of new grads is down, for example, in sectors like tech and finance. While that is not entirely due to AI, the technology is almost certainly playing a role. 

For Gen Z, the issue is broader than employment. It also touches on another massive generational challenge: climate change. AI is computationally intensive and requires massive data centers. Huge complexes have already been built all across the country, from Virginia in the east to Nevada in the west. That buildout is only going to accelerate as companies race to be first to create superintelligence. Meta and OpenAI have announced plans for data centers that will require five gigawatts of power just for their ­computing—enough to power the entire state of Maine in the summertime. 

It’s very likely that utilities will turn to natural gas to power these facilities; some already have. That means more carbon dioxide emissions for an already warming world. Data centers also require vast amounts of water. There are communities right now that are literally running out of water because it’s being taken by nearby data centers, even as climate change makes that resource more scarce. 

Proponents argue that AI will make the grid more efficient, that it will help us achieve technological breakthroughs leading to cleaner energy sources and, I don’t know, more butterflies and bumblebees? But xAI is belching CO2 into the Memphis skies from its methane-fueled generators right now. Google’s electricity demand and emissions are skyrocketing today

Things would be different, my daughter told me, if it were obviously useful. But for much of her generation, she argued, it’s a looming threat with ample costs and no obvious utility: “It’s not good for research because it’s not highly accurate. You can’t use it for writing because it’s banned—and people get zeros on papers who haven’t even used it because of AI detectors. And it seems like it’s going to take all the good jobs. One teacher told us we’re all going to be janitors.”  

It would be naïve to think we are going back to a world without AI. We’re not. And yet there are other urgent problems that we need to address to build security and prosperity for coming generations. This September/October issue is about our attempts to make the world more secure. From missiles. From asteroids. From the unknown. From threats both existential and trivial. 

We’re also introducing three new columns in this issue, from some of our leading writers: The Algorithm, which covers AI; The Checkup, on biotech; and The Spark, on energy and climate. You’ll see these in future issues, and you can also subscribe online to get them in your inbox every week. 

Stay safe out there. 

The AI Hype Index: AI-designed antibiotics show promise

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

Using AI to improve our health and well-being is one of the areas scientists and researchers are most excited about. The last month has seen an interesting leap forward: The technology has been put to work designing new antibiotics to fight hard-to-treat conditions, and OpenAI and Anthropic have both introduced new limiting features to curb potentially harmful conversations on their platforms. 

Unfortunately, not all the news has been positive. Doctors who overrely on AI to help them spot cancerous tumors found their detection skills dropped once they lost access to the tool, and a man fell ill after ChatGPT recommended he replace the salt in his diet with dangerous sodium bromide. These are yet more warning signs of how careful we have to be when it comes to using AI to make important decisions for our physical and mental states.

In a first, Google has released data on how much energy an AI prompt uses

Google has just released a technical report detailing how much energy its Gemini apps use for each query. In total, the median prompt—one that falls in the middle of the range of energy demand—consumes 0.24 watt-hours of electricity, the equivalent of running a standard microwave for about one second. The company also provided average estimates for the water consumption and carbon emissions associated with a text prompt to Gemini.

It’s the most transparent estimate yet from a Big Tech company with a popular AI product, and the report includes detailed information about how the company calculated its final estimate. As AI has become more widely adopted, there’s been a growing effort to understand its energy use. But public efforts attempting to directly measure the energy used by AI have been hampered by a lack of full access to the operations of a major tech company. 

Earlier this year, MIT Technology Review published a comprehensive series on AI and energy, at which time none of the major AI companies would reveal their per-prompt energy usage. Google’s new publication, at last, allows for a peek behind the curtain that researchers and analysts have long hoped for.

The study focuses on a broad look at energy demand, including not only the power used by the AI chips that run models but also by all the other infrastructure needed to support that hardware. 

“We wanted to be quite comprehensive in all the things we included,” said Jeff Dean, Google’s chief scientist, in an exclusive interview with MIT Technology Review about the new report.

That’s significant, because in this measurement, the AI chips—in this case, Google’s custom TPUs, the company’s proprietary equivalent of GPUs—account for just 58% of the total electricity demand of 0.24 watt-hours. 

Another large portion of the energy is used by equipment needed to support AI-specific hardware: The host machine’s CPU and memory account for another 25% of the total energy used. There’s also backup equipment needed in case something fails—these idle machines account for 10% of the total. The final 8% is from overhead associated with running a data center, including cooling and power conversion. 

This sort of report shows the value of industry input to energy and AI research, says Mosharaf Chowdhury, a professor at the University of Michigan and one of the heads of the ML.Energy leaderboard, which tracks energy consumption of AI models. 

Estimates like Google’s are generally something that only companies can produce, because they run at a larger scale than researchers are able to and have access to behind-the-scenes information. “I think this will be a keystone piece in the AI energy field,” says Jae-Won Chung, a PhD candidate at the University of Michigan and another leader of the ML.Energy effort. “It’s the most comprehensive analysis so far.”

Google’s figure, however, is not representative of all queries submitted to Gemini: The company handles a huge variety of requests, and this estimate is calculated from a median energy demand, one that falls in the middle of the range of possible queries.

So some Gemini prompts use much more energy than this: Dean gives the example of feeding dozens of books into Gemini and asking it to produce a detailed synopsis of their content. “That’s the kind of thing that will probably take more energy than the median prompt,” Dean says. Using a reasoning model could also have a higher associated energy demand because these models take more steps before producing an answer.

This report was also strictly limited to text prompts, so it doesn’t represent what’s needed to generate an image or a video. (Other analyses, including one in MIT Technology Review’s Power Hungry series earlier this year, show that these tasks can require much more energy.)

The report also finds that the total energy used to field a Gemini query has fallen dramatically over time. The median Gemini prompt used 33 times more energy in May 2024 than it did in May 2025, according to Google. The company points to advancements in its models and other software optimizations for the improvements.  

Google also estimates the greenhouse gas emissions associated with the median prompt, which they put at 0.03 grams of carbon dioxide. To get to this number, the company multiplied the total energy used to respond to a prompt by the average emissions per unit of electricity.

Rather than using an emissions estimate based on the US grid average, or the average of the grids where Google operates, the company instead uses a market-based estimate, which takes into account electricity purchases that the company makes from clean energy projects. The company has signed agreements to buy over 22 gigawatts of power from sources including solar, wind, geothermal, and advanced nuclear projects since 2010. Because of those purchases, Google’s emissions per unit of electricity on paper are roughly one-third of those on the average grid where it operates.

AI data centers also consume water for cooling, and Google estimates that each prompt consumes 0.26 milliliters of water, or about five drops. 

The goal of this work was to provide users a window into the energy use of their interactions with AI, Dean says. 

“People are using [AI tools] for all kinds of things, and they shouldn’t have major concerns about the energy usage or the water usage of Gemini models, because in our actual measurements, what we were able to show was that it’s actually equivalent to things you do without even thinking about it on a daily basis,” he says, “like watching a few seconds of TV or consuming five drops of water.”

The publication greatly expands what’s known about AI’s resource usage. It follows recent increasing pressure on companies to release more information about the energy toll of the technology. “I’m really happy that they put this out,” says Sasha Luccioni, an AI and climate researcher at Hugging Face. “People want to know what the cost is.”

This estimate and the supporting report contain more public information than has been available before, and it’s helpful to get more information about AI use in real life, at scale, by a major company, Luccioni adds. However, there are still details that the company isn’t sharing in this report. One major question mark is the total number of queries that Gemini gets each day, which would allow estimates of the AI tool’s total energy demand. 

And ultimately, it’s still the company deciding what details to share, and when and how. “We’ve been trying to push for a standardized AI energy score,” Luccioni says, a standard for AI similar to the Energy Star rating for appliances. “This is not a replacement or proxy for standardized comparisons.”

Why we should thank pigeons for our AI breakthroughs

In 1943, while the world’s brightest physicists split atoms for the Manhattan Project, the American psychologist B.F. Skinner led his own secret government project to win World War II. 

Skinner did not aim to build a new class of larger, more destructive weapons. Rather, he wanted to make conventional bombs more precise. The idea struck him as he gazed out the window of his train on the way to an academic conference. “I saw a flock of birds lifting and wheeling in formation as they flew alongside the train,” he wrote. “Suddenly I saw them as ‘devices’ with excellent vision and maneuverability. Could they not guide a missile?”

Skinner started his missile research with crows, but the brainy black birds proved intractable. So he went to a local shop that sold pigeons to Chinese restaurants, and “Project Pigeon” was born. Though ordinary pigeons, Columba livia, were no one’s idea of clever animals, they proved remarkably cooperative subjects in the lab. Skinner rewarded the birds with food for pecking at the right target on aerial photographs—and eventually planned to strap the birds into a device in the nose of a warhead, which they would steer by pecking at the target on a live image projected through a lens onto a screen. 

The military never deployed Skinner’s kamikaze pigeons, but his experiments convinced him that the pigeon was “an extremely reliable instrument” for studying the underlying processes of learning. “We have used pigeons, not because the pigeon is an intelligent bird, but because it is a practical one and can be made into a machine,” he said in 1944.

People looking for precursors to artificial intelligence often point to science fiction by authors like Isaac Asimov or thought experiments like the Turing test. But an equally important, if surprising and less appreciated, forerunner is Skinner’s research with pigeons in the middle of the 20th century. Skinner believed that association—learning, through trial and error, to link an action with a punishment or reward—was the building block of every behavior, not just in pigeons but in all living organisms, including human beings. His “behaviorist” theories fell out of favor with psychologists and animal researchers in the 1960s but were taken up by computer scientists who eventually provided the foundation for many of the artificial-intelligence tools from leading firms like Google and OpenAI.  

These companies’ programs are increasingly incorporating a kind of machine learning whose core concept—reinforcement—is taken directly from Skinner’s school of psychology and whose main architects, the computer scientists Richard Sutton and Andrew Barto, won the 2024 Turing Award, an honor widely considered to be the Nobel Prize of computer science. Reinforcement learning has helped enable computers to drive cars, solve complex math problems, and defeat grandmasters in games like chess and Go—but it has not done so by emulating the complex workings of the human mind. Rather, it has supercharged the simple associative processes of the pigeon brain. 

It’s a “bitter lesson” of 70 years of AI research, Sutton has written: that human intelligence has not worked as a model for machine learning—instead, the lowly principles of associative learning are what power the algorithms that can now simulate or outperform humans on a variety of tasks. If artificial intelligence really is close to throwing off the yoke of its creators, as many people fear, then our computer overlords may be less like ourselves than like “rats with wings”—and planet-size brains. And even if it’s not, the pigeon brain can at least help demystify a technology that many worry (or rejoice) is “becoming human.” 

In turn, the recent accomplishments of AI are now prompting some animal researchers to rethink the evolution of natural intelligence. Johan Lind, a biologist at Stockholm University, has written about the “associative learning paradox,” wherein the process is largely dismissed by biologists as too simplistic to produce complex behaviors in animals but celebrated for producing humanlike behaviors in computers. The research suggests not only a greater role for associative learning in the lives of intelligent animals like chimpanzees and crows, but also far greater complexity in the lives of animals we’ve long dismissed as simple-minded, like the ordinary Columba livia


When Sutton began working in AI, he felt as if he had a “secret weapon,” he told me: He had studied psychology as an undergrad. “I was mining the psychological literature for animals,” he says.

Skinner started his missile research with crows but switched to pigeons when the brainy black birds proved intractable.
B.F. SKINNER FOUNDATION

Ivan Pavlov began to uncover the mechanics of associative learning at the end of the 19th century in his famous experiments on “classical conditioning,” which showed that dogs would salivate at a neutral stimulus—like a bell or flashing light—if it was paired predictably with the presentation of food. In the middle of the 20th century, Skinner took Pavlov’s principles of conditioning and extended them from an animal’s involuntary reflexes to its overall behavior. 

Skinner wrote that “behavior is shaped and maintained by its consequences”—that a random action with desirable results, like pressing a lever that releases a food pellet, will be “reinforced” so that the animal is likely to repeat it. Skinner reinforced his lab animals’ behavior step by step, teaching rats to manipulate marbles and pigeons to play simple tunes on four-key pianos. The animals learned chains of behavior, through trial and error, in order to maximize long-term rewards. Skinner argued that this type of associative learning, which he called “operant conditioning” (and which other psychologists had called “instrumental learning”), was the building block of all behavior. He believed that psychology should study only behaviors that could be observed and measured without ever making reference to an “inner agent” in the mind.

When Richard Sutton began working in AI, he felt as if he had a “secret weapon”: He studied psychology as an undergrad. “I was mining the psychological literature for animals,” he says.

Skinner thought that even human language developed through operant conditioning, with children learning the meanings of words through reinforcement. But his 1957 book on the subject, Verbal Behavior, provoked a brutal review from Noam Chomsky, and psychology’s focus started to swing from observable behavior to innate “cognitive” abilities of the human mind, like logic and symbolic thinking. Biologists soon rebelled against behaviorism also, attacking psychologists’ quest to explain the diversity of animal behavior through an elementary and universal mechanism. They argued that each species evolved specific behaviors suited to its habitat and lifestyle, and that most behaviors were inherited, not learned. 

By the ’70s, when Sutton started reading about Skinner’s and similar experiments, many psychologists and researchers interested in intelligence had moved on from pea-brained pigeons, which learn mostly by association, to large-brained animals with more sophisticated behaviors that suggested potential cognitive abilities. “This was clearly old stuff that was not exciting to people anymore,” he told me. Still, Sutton found these old experiments instructive for machine learning: “I was coming to AI with an animal-learning-theorist mindset and seeing the big lack of anything like instrumental learning in engineering.” 


Many engineers in the second half of the 20th century tried to model AI on human intelligence, writing convoluted programs that attempted to mimic human thinking and implement rules that govern human response and behavior. This approach—commonly called “symbolic AI”—was severely limited; the programs stumbled over tasks that were easy for people, like recognizing objects and words. It just wasn’t possible to write into code the myriad classification rules human beings use to, say, separate apples from oranges or cats from dogs—and without pattern recognition, breakthroughs in more complex tasks like problem solving, game playing, and language translation seemed unlikely too. These computer scientists, the AI skeptic Hubert Dreyfus wrote in 1972, accomplished nothing more than “a small engineering triumph, an ad hoc solution of a specific problem, without general applicability.”

Pigeon research, however, suggested another route. A 1964 study showed that pigeons could learn to discriminate between photographs with people and photographs without people. Researchers simply presented the birds with a series of images and rewarded them with a food pellet for pecking an image showing a person. They pecked randomly at first but quickly learned to identify the right images, including photos where people were partially obscured. The results suggested that you didn’t need rules to sort objects; it was possible to learn concepts and use categories through associative learning alone. 

In another Skinner experiment, a pigeon receives food after correctly matching a colored light to a corresponding colored panel.
GETTY IMAGES

When Sutton began working with Barto on AI in the late ’70s, they wanted to create a “complete, interactive goal-seeking agent” that could explore and influence its environment like a pigeon or rat. “We always felt the problems we were studying were closer to what animals had to face in evolution to actually survive,” Barto told me. The agent needed two main functions: search, to try out and choose from many actions in a situation, and memory, to associate an action with the situation where it resulted in a reward. Sutton and Barto called their approach “reinforcement learning”; as Sutton said, “It’s basically instrumental learning.” In 1998, they published the definitive exploration of the concept in a book, Reinforcement Learning: An Introduction. 

Over the following two decades, as computing power grew exponentially, it became possible to train AI on increasingly complex tasks—that is, essentially, to run the AI “pigeon” through millions more trials. 

Programs trained with a mix of human input and reinforcement learning defeated human experts at chess and Atari. Then, in 2017, engineers at Google DeepMind built the AI program AlphaGo Zero entirely through reinforcement learning, giving it a numerical reward of +1 for every game of Go that it won and −1 for every game that it lost. Programmed to seek the maximum reward, it began without any knowledge of Go but improved over 40 days until it attained what its creators called “superhuman performance.” Not only could it defeat the world’s best human players at Go, a game considered even more complicated than chess, but it actually pioneered new strategies that professional players now use. 

“Humankind has accumulated Go knowledge from millions of games played over thousands of years,” the program’s builders wrote in Nature in 2017. “In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.” The team’s lead researcher was David Silver, who studied reinforcement learning under Sutton at the University of Alberta.

Today, more and more tech companies have turned to reinforcement learning in products such as consumer-facing chatbots and agents. The first generation of generative AI, including large language models like OpenAI’s GPT-2 and GPT-3, tapped into a simpler form of associative learning called “supervised learning,” which trained the model on data sets that had been labeled by people. Programmers often used reinforcement to fine-tune their results by asking people to rate a program’s performance and then giving these ratings back to the program as goals to pursue. (Researchers call this “reinforcement learning from feedback.”) 

Then, last fall, OpenAI revealed its o-series of large language models, which it classifies as “reasoning” models. The pioneering AI firm boasted that they are “trained with reinforcement learning to perform reasoning” and claimed they are capable of “a long internal chain of thought.” The Chinese startup DeepSeek also used reinforcement learning to train its attention-grabbing “reasoning” LLM, R1. “Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-­solving strategies,” they explained.

These descriptions might impress users, but at least psychologically speaking, they are confused. A computer trained on reinforcement learning needs only search and memory, not reasoning or any other cognitive mechanism, in order to form associations and maximize rewards. Some computer scientists have criticized the tendency to anthropomorphize these models’ “thinking,” and a team of Apple engineers recently published a paper noting their failure at certain complex tasks and “raising crucial questions about their true reasoning capabilities.”

Sutton, too, dismissed the claims of reasoning as “marketing” in an email, adding that “no serious scholar of mind would use ‘reasoning’ to describe what is going on in LLMs.” Still, he has argued, with Silver and other coauthors, that the pigeons’ method—learning, through trial and error, which actions will yield rewards—is “enough to drive behavior that exhibits most if not all abilities that are studied in natural and artificial intelligence,” including human language “in its full richness.” 

In a paper published in April, Sutton and Silver stated that “today’s technology, with appropriately chosen algorithms, already provides a sufficiently powerful foundation to … rapidly progress AI towards truly superhuman agents.” The key, they argue, is building AI agents that depend less than LLMs on human dialogue and prejudgments to inform their behavior. 

“Powerful agents should have their own stream of experience that progresses, like humans, over a long time-scale,” they wrote. “Ultimately, experiential data will eclipse the scale and quality of human generated data. This paradigm shift, accompanied by algorithmic advancements in RL, will unlock in many domains new capabilities that surpass those possessed by any human.”


If computers can do all that with just a pigeonlike brain, some animal researchers are now wondering if actual pigeons deserve more credit than they’re commonly given. 

“When considered in light of the accomplishments of AI, the extension of associative learning to purportedly more complicated forms of cognitive performance offers fresh prospects for understanding how biological systems may have evolved,” Ed Wasserman, a psychologist at the University of Iowa, wrote in a recent study in the journal Current Biology

Wasserman trained pigeons to succeed at a complex categorization task, which several undergraduate students failed. The students tried to find a rule that would help them sort various discs; the pigeons simply developed a sense for the group to which any given disc belonged.

In one experiment, Wasserman trained pigeons to succeed at a complex categorization task, which several undergraduate students failed. The students tried, in vain, to find a rule that would help them sort various discs with parallel black lines of various widths and tilts; the pigeons simply developed a sense, through practice and association, for the group to which any given disc belonged. 

Like Sutton, Wasserman became interested in behaviorist psychology when Skinner’s theories were out of fashion. He didn’t switch to computer science, however: He stuck with pigeons. “The pigeon lives or dies by these really rudimentary learning rules,” Wasserman told me recently, “but they are powerful enough to have succeeded colossally in object recognition.” In his most famous experiments, Wasserman trained pigeons to detect cancerous tissue and symptoms of heart disease in medical scans as accurately as experienced doctors with framed diplomas behind their desks. Given his results, Wasserman found it odd that so many psychologists and ethologists regarded associative learning as a crude, mechanical mechanism, incapable of producing the intelligence of clever animals like apes, elephants, dolphins, parrots, and crows. 

Other researchers also started to reconsider the role of associative learning in animal behavior after AI started besting human professionals in complex games. “With the progress of artificial intelligence, which in essence is built upon associative processes, it is increasingly ironic that associative learning is considered too simple and insufficient for generating biological intelligence,” Lind, the biologist from Stockholm University, wrote in 2023. He often cites Sutton and Barto’s computer science in his biological research, and he believes it’s human beings’ symbolic language and cumulative cultures that really put them in a cognitive category of their own.

Ethologists generally propose cognitive mechanisms, like theory of mind (that is, the ability to attribute mental states to others), to explain remarkable animal behaviors like social learning and tool use. But Lind has built models showing that these flexible behaviors could have developed through associative learning, suggesting that there may be no need to invoke cognitive mechanisms at all. If animals learn to associate a behavior with a reward, then the behavior itself will come to approximate the value of the reward. A new behavior can then become associated with the first behavior, allowing the animal to learn chains of actions that ultimately lead to the reward. In Lind’s view, studies demonstrating self-control and planning in chimpanzees and ravens are probably describing behaviors acquired through experience rather than innate mechanisms of the mind.  

Lind has been frustrated with what he calls the “low standard that is accepted in animal cognition studies.” As he wrote in an email, “Many researchers in this field do not seem to worry about excluding alternative hypotheses and they seem happy to neglect a lot of current and historical knowledge.” There are some signs, though, that his arguments are catching on. A group of psychologists not affiliated with Lind referenced his “associative learning paradox” last year in a criticism of a Current Biology study, which purported to show that crows used “true statistical inference” and not “low-level associative learning strategies” in an experiment. The psychologists found that they could explain the crows’ performance with a simple reinforcement-­learning model—“exactly the kind of low-level associative learning process that [the original authors] ruled out.” 

Skinner might have felt vindicated by such arguments. He lamented psychology’s cognitive turn until his death in 1990, maintaining that it was scientifically irresponsible to probe the minds of living beings. After “Project Pigeon,” he became increasingly obsessed with “behaviorist” solutions to societal problems. He went from training pigeons for war to inventions like the “Air Crib,” which aimed to “simplify” baby care by keeping the infant behind glass in a climate-­controlled chamber and eliminating the need for clothing and bedding. Skinner rejected free will, arguing that human behavior is determined by environmental variables, and wrote a novel, Walden II, about a utopian community founded on his ideas.


People who care about animals might feel uneasy about a revival in behaviorist theory. The “cognitive revolution” broke with centuries of Western thinking, which had emphasized human supremacy over animals and treated other creatures like stimulus-response machines. But arguing that animals learn by association is not the same as arguing that they are simple-minded. Scientists like Lind and Wasserman do not deny that internal forces like instinct and emotion also influence animal behavior. Sutton, too, believes that animals develop models of the world through their experiences and use them to plan actions. Their point is not that intelligent animals are empty-headed but that associative learning is a much more powerful—indeed, “cognitive”—mechanism than many of their peers believe. The psychologists who recently criticized the study on crows and statistical inference did not conclude that the birds were stupid. Rather, they argued “that a reinforcement learning model can produce complex, flexible behaviour.”

This is largely in line with the work of another psychologist, Robert Rescorla, whose work in the ’70s and ’80s influenced both Wasserman and Sutton. Rescorla encouraged people to think of association not as a “low-level mechanical process” but as “the learning that results from exposure to relations among events in the environment” and “a primary means by which the organism represents the structure of its world.” 

This is true even of a laboratory pigeon pecking at screens and buttons in a small experimental box, where scientists carefully control and measure stimuli and rewards. But the pigeon’s learning extends outside the box. Wasserman’s students transport the birds between the aviary and the laboratory in buckets—and experienced pigeons jump immediately into the buckets whenever the students open the doors. Much as Rescorla suggested, they are learning the structure of their world inside the laboratory and the relation of its parts, like the bucket and the box, even though they do not always know the specific task they will face inside. 

Comparative psychologists and animal researchers have long grappled with a question that suddenly seems urgent because of AI: How do we attribute sentience to other living beings?

The same associative mechanisms through which the pigeon learns the structure of its world can open a window to the kind of inner life that Skinner and many earlier psychologists said did not exist. Pharmaceutical researchers have long used pigeons in drug-discrimination tasks, where they’re given, say, an amphetamine or a sedative and rewarded with a food pellet for correctly identifying which drug they took. The birds’ success suggests they both experience and discriminate between internal states. “Is that not tantamount to introspection?” Wasserman asked.

It is hard to imagine AI matching a pigeon on this specific task—a reminder that, though AI and animals share associative mechanisms, there is more to life than behavior and learning. A pigeon deserves ethical consideration as a living creature not because of how it learns but because of what it feels. A pigeon can experience pain and suffer, while an AI chatbot cannot—even if some large language models, trained on corpora that include descriptions of human suffering and sci-fi stories of sentient computers, can trick people into believing otherwise. 

a pigeon in a box facing a lit screen with colored rectangles on it.
Psychologist Ed Wasserman trained pigeons to detect cancerous tissue and symptoms of heart disease in medical scans as accurately as experienced physicians.
UNIVERSITY OF IOWA/WASSERMAN LAB

“The intensive public and private investments into AI research in recent years have resulted in the very technologies that are forcing us to confront the question of AI sentience today,” two philosophers of science wrote in Aeon in 2023. “To answer these current questions, we need a similar degree of investment into research on animal cognition and behavior.” Indeed, comparative psychologists and animal researchers have long grappled with questions that suddenly seem urgent because of AI: How do we attribute sentience to other living beings? How can we distinguish true sentience from a very convincing performance of sentience?

Such an undertaking would yield knowledge not only about technology and animals but also about ourselves. Most psychologists probably wouldn’t go as far as Sutton in arguing that reward is enough to explain most if not all human behavior, but no one would dispute that people often learn by association too. In fact, most of Wasserman’s undergraduate students eventually succeeded at his recent experiment with the striped discs, but only after they gave up searching for rules. They resorted, like the pigeons, to association and couldn’t easily explain afterwards what they’d learned. It was just that with enough practice, they started to get a feel for the categories. 

It is another irony about associative learning: What has long been considered the most complex form of intelligence—a cognitive ability like rule-based learning—may make us human, but we also call on it for the easiest of tasks, like sorting objects by color or size. Meanwhile, some of the most refined demonstrations of human learning—like, say, a sommelier learning to taste the difference between grapes—are learned not through rules, but only through experience. 

Learning through experience relies on ancient associative mechanisms that we share with pigeons and countless other creatures, from honeybees to fish. The laboratory pigeon is not only in our computers but in our brains—and the engine behind some of humankind’s most impressive feats. 

Ben Crair is a science and travel writer based in Berlin. 

The AI Hype Index: The White House’s war on “woke AI”

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

The Trump administration recently declared war on so-called “woke AI,” issuing an executive order aimed at preventing companies whose models exhibit a liberal bias from landing federal contracts. Simultaneously, the Pentagon inked a deal with Elon Musk’s xAI just days after its chatbot, Grok, spouted harmful antisemitic stereotypes on X, while the White House has partnered with an anti-DEI nonprofit to create AI slop videos of the Founding Fathers. What comes next is anyone’s guess.

The AI Hype Index: AI-powered toys are coming

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

AI agents might be the toast of the AI industry, but they’re still not that reliable. That’s why Yoshua Bengio, one of the world’s leading AI experts, is creating his own nonprofit dedicated to guarding against deceptive agents. Not only can they mislead you, but new research suggests that the weaker an AI model powering an agent is, the less likely it is to be able to negotiate you a good deal online. Elsewhere, OpenAI has inked a deal with toymaker Mattel to develop “age-appropriate” AI-infused products. What could possibly go wrong?

The AI Hype Index: College students are hooked on ChatGPT

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

Large language models confidently present their responses as accurate and reliable, even when they’re neither of those things. That’s why we’ve recently seen chatbots supercharge vulnerable people’s delusions, make citation mistakes in an important legal battle between music publishers and Anthropic, and (in the case of xAI’s Grok) rant irrationally about “white genocide.”

But it’s not all bad news—AI could also finally lead to a better battery life for your iPhone and solve tricky real-world problems that humans have been struggling to crack, if Google DeepMind’s new model is any indication. And perhaps most exciting of all, it could combine with brain implants to help people communicate when they have lost the ability to speak.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry.

AI agents are the AI industry’s hypiest new product—intelligent assistants capable of completing tasks without human supervision. But while they can be theoretically useful—Simular AI’s S2 agent, for example, intelligently switches between models depending on what it’s been told to do—they could also be weaponized to execute cyberattacks. Elsewhere, OpenAI is reported to be throwing its hat into the social media arena, and AI models are getting more adept at making music. Oh, and if the results of the first half-marathon pitting humans against humanoid robots are anything to go by, we won’t have to worry about the robot uprising any time soon.

Seeing AI as a collaborator, not a creator

The reason you are reading this letter from me today is that I was bored 30 years ago. 

I was bored and curious about the world and so I wound up spending a lot of time in the university computer lab, screwing around on Usenet and the early World Wide Web, looking for interesting things to read. Soon enough I wasn’t content to just read stuff on the internet—I wanted to make it. So I learned HTML and made a basic web page, and then a better web page, and then a whole website full of web things. And then I just kept going from there. That amateurish collection of web pages led to a journalism internship with the online arm of a magazine that paid little attention to what we geeks were doing on the web. And that led to my first real journalism job, and then another, and, well, eventually this journalism job. 

But none of that would have been possible if I hadn’t been bored and curious. And more to the point: curious about tech. 

The university computer lab may seem at first like an unlikely center for creativity. We tend to think of creativity as happening more in the artist’s studio or writers’ workshop. But throughout history, very often our greatest creative leaps—and I would argue that the web and its descendants represent one such leap—have been due to advances in technology. 

There are the big easy examples, like photography or the printing press, but it’s also true of all sorts of creative inventions that we often take for granted. Oil paints. Theaters. Musical scores. Electric synthesizers! Almost anywhere you look in the arts, perhaps outside of pure vocalization, technology has played a role.  

But the key to artistic achievement has never been the technology itself. It has been the way artists have applied it to express our humanity. Think of the way we talk about the arts. We often compliment it with words that refer to our humanity, like soul, heart, and life; we often criticize it with descriptors such as sterile, clinical, or lifeless. (And sure, you can love a sterile piece of art, but typically that’s because the artist has leaned into sterility to make a point about humanity!)

All of which is to say I think that AI can be, will be, and already is a tool for creative expression, but that true art will always be something steered by human creativity, not machines. 

I could be wrong. I hope not. 

This issue, which was entirely produced by human beings using computers, explores creativity and the tension between the artist and technology. You can see it on our cover illustrated by Tom Humberstone, and read about it in stories from James O’Donnell, Will Douglas Heaven, Rebecca Ackermann, Michelle Kim, Bryan Gardiner, and Allison Arieff

Yet of course, creativity is about more than just the arts. All of human advancement stems from creativity, because creativity is how we solve problems. So it was important to us to bring you accounts of that as well. You’ll find those in stories from Carrie Klein, Carly Kay, Matthew Ponsford, and Robin George Andrews. (If you’ve ever wanted to know how we might nuke an asteroid, this is the issue for you!)  

We’re also trying to get a little more creative ourselves. Over the next few issues, you’ll notice some changes coming to this magazine with the addition of some new regular items (see Caiwei Chen’s “3 Things” for one such example). Among those changes, we are planning to solicit and publish more regular reader feedback and answer questions you may have about technology. We invite you to get creative and email us: newsroom@technologyreview.com.

As always, thanks for reading.