A US court just put ownership of CRISPR back in play

The CRISPR patents are back in play.

On Monday, the US Court of Appeals for the Federal Circuit said scientists Jennifer Doudna and Emmanuelle Charpentier will get another chance to show they ought to own the key patents on what many consider the defining biotechnology invention of the 21st century.

The pair shared a 2020 Nobel Prize for developing the versatile gene-editing system, which is already being used to treat various genetic disorders, including sickle cell disease

But when key US patent rights were granted in 2014 to researcher Feng Zhang of the Broad Institute of MIT and Harvard, the decision set off a bitter dispute in which hundreds of millions of dollars—as well as scientific bragging rights—are at stake.

The new decision is a boost for the Nobelists, who had previously faced a string of demoralizing reversals over the patent rights in both the US and Europe.

“This goes to who was the first to invent, who has priority, and who is entitled to the broadest patents,” says Jacob Sherkow, a law professor at the University of Illinois. 

He says there is now at least a chance that Doudna and Charpentier “could walk away as the clear winner.”

The CRISPR patent battle is among the most byzantine ever, putting the technology alongside the steam engine, the telephone, the lightbulb, and the laser among the most hotly contested inventions in history.

In 2012, Doudna and Charpentier were first to publish a description of a CRISPR gene editor that could be programmed to precisely cut DNA in a test tube. There’s no dispute about that.

However, the patent fight relates to the use of CRISPR to edit inside animal cells—like those of human beings. That’s considered a distinct invention, and one both sides say they were first to come up with that very same year. 

In patent law, this moment is known as conception—the instant a lightbulb appears over an inventor’s head, revealing a definite and workable plan for how an invention is going to function.

In 2022, a specialized body called the Patent Trial and Appeal Board, or PTAB, decided that Doudna and Charpentier hadn’t fully conceived the invention because they initially encountered trouble getting their editor to work in fish and other species. Indeed, they had so much trouble that Zhang scooped them with a 2013 publication demonstrating he could use CRISPR to edit human cells.

The Nobelists appealed the finding, and yesterday the appeals court vacated it, saying the patent board applied the wrong standard and needs to reconsider the case. 

According to the court, Doudna and Charpentier didn’t have to “know their invention would work” to get credit for conceiving it. What could matter more, the court said, is that it actually did work in the end. 

In a statement, the University of California, Berkeley, applauded the call for a do-over.  

“Today’s decision creates an opportunity for the PTAB to reevaluate the evidence under the correct legal standard and confirm what the rest of the world has recognized: that the Doudna and Charpentier team were the first to develop this groundbreaking technology for the world to share,” Jeff Lamken, one of Berkeley’s attorneys, said in the statement.

The Broad Institute posted a statement saying it is “confident” the appeals board “will again confirm Broad’s patents, because the underlying facts have not changed.”

The decision is likely to reopen the investigation into what was written in 13-year-old lab notebooks and whether Zhang based his research, in part, on what he learned from Doudna and Charpentier’s publications. 

The case will now return to the patent board for a further look, although Sherkow says the court finding can also be appealed directly to the US Supreme Court. 

Police tech can sidestep facial recognition bans now

Six months ago I attended the largest gathering of chiefs of police in the US to see how they’re using AI. I found some big developments, like officers getting AI to write their police reports. Today, I published a new story that shows just how far AI for police has developed since then. 

It’s about a new method police departments and federal agencies have found to track people: an AI tool that uses attributes like body size, gender, hair color and style, clothing, and accessories instead of faces. It offers a way around laws curbing the use of facial recognition, which are on the rise. 

Advocates from the ACLU, after learning of the tool through MIT Technology Review, said it was the first instance they’d seen of such a tracking system used at scale in the US, and they say it has a high potential for abuse by federal agencies. They say the prospect that AI will enable more powerful surveillance is especially alarming at a time when the Trump administration is pushing for more monitoring of protesters, immigrants, and students. 

I hope you read the full story for the details, and to watch a demo video of how the system works. But first, let’s talk for a moment about what this tells us about the development of police tech and what rules, if any, these departments are subject to in the age of AI.

As I pointed out in my story six months ago, police departments in the US have extraordinary independence. There are more than 18,000 departments in the country, and they generally have lots of discretion over what technology they spend their budgets on. In recent years, that technology has increasingly become AI-centric. 

Companies like Flock and Axon sell suites of sensors—cameras, license plate readers, gunshot detectors, drones—and then offer AI tools to make sense of that ocean of data (at last year’s conference I saw schmoozing between countless AI-for-police startups and the chiefs they sell to on the expo floor). Departments say these technologies save time, ease officer shortages, and help cut down on response times. 

Those sound like fine goals, but this pace of adoption raises an obvious question: Who makes the rules here? When does the use of AI cross over from efficiency into surveillance, and what type of transparency is owed to the public?

In some cases, AI-powered police tech is already driving a wedge between departments and the communities they serve. When the police in Chula Vista, California, were the first in the country to get special waivers from the Federal Aviation Administration to fly their drones farther than normal, they said the drones would be deployed to solve crimes and get people help sooner in emergencies. They’ve had some successes

But the department has also been sued by a local media outlet alleging it has reneged on its promise to make drone footage public, and residents have said the drones buzzing overhead feel like an invasion of privacy. An investigation found that these drones were deployed more often in poor neighborhoods, and for minor issues like loud music. 

Jay Stanley, a senior policy analyst at the ACLU, says there’s no overarching federal law that governs how local police departments adopt technologies like the tracking software I wrote about. Departments usually have the leeway to try it first, and see how their communities react after the fact. (Veritone, which makes the tool I wrote about, said they couldn’t name or connect me with departments using it so the details of how it’s being deployed by police are not yet clear). 

Sometimes communities take a firm stand; local laws against police use of facial recognition have been passed around the country. But departments—or the police tech companies they buy from—can find workarounds. Stanley says the new tracking software I wrote about poses lots of the same issues as facial recognition while escaping scrutiny because it doesn’t technically use biometric data.

“The community should be very skeptical of this kind of tech and, at a minimum, ask a lot of questions,” he says. He laid out a road map of what police departments should do before they adopt AI technologies: have hearings with the public, get community permission, and make promises about how the systems will and will not be used. He added that the companies making this tech should also allow it to be tested by independent parties. 

“This is all coming down the pike,” he says—and so quickly that policymakers and the public have little time to keep up. He adds, “Are these powers we want the police—the authorities that serve us—to have, and if so, under what conditions?”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Why climate researchers are taking the temperature of mountain snow

On a crisp morning in early April, Dan McEvoy and Bjoern Bingham cut clean lines down a wide run at the Heavenly Ski Resort in South Lake Tahoe, then ducked under a rope line cordoning off a patch of untouched snow. 

They side-stepped up a small incline, poled past a row of Jeffrey pines, then dropped their packs. 

The pair of climate researchers from the Desert Research Institute (DRI) in Reno, Nevada, skied down to this research plot in the middle of the resort to test out a new way to take the temperature of the Sierra Nevada snowpack. They were equipped with an experimental infrared device that can take readings as it’s lowered down a hole in the snow to the ground.

The Sierra’s frozen reservoir provides about a third of California’s water and most of what comes out of the faucets, shower heads, and sprinklers in the towns and cities of northwestern Nevada. As it melts through the spring and summer, dam operators, water agencies, and communities have to manage the flow of billions of gallons of runoff, storing up enough to get through the inevitable dry summer months without allowing reservoirs and canals to flood.

The need for better snowpack temperature data has become increasingly critical for predicting when the water will flow down the mountains, as climate change fuels hotter weather, melts snow faster, and drives rapid swings between very wet and very dry periods. 

In the past, it has been arduous work to gather such snowpack observations. Now, a new generation of tools, techniques, and models promises to ease that process, improve water forecasts, and help California and other states safely manage one of their largest sources of water in the face of increasingly severe droughts and flooding.

Observers, however, fear that any such advances could be undercut by the Trump administration’s cutbacks across federal agencies, including the one that oversees federal snowpack monitoring and survey work. That could jeopardize ongoing efforts to produce the water data and forecasts on which Western communities rely.

“If we don’t have those measurements, it’s like driving your car around without a fuel gauge,” says Larry O’Neill, Oregon’s state climatologist. “We won’t know how much water is up in the mountains, and whether there’s enough to last through the summer.”

The birth of snow surveys

The snow survey program in the US was born near Lake Tahoe, the largest alpine lake in North America, around the turn of the 20th century. 

Without any reliable way of knowing how much water would flow down the mountain each spring, lakefront home and business owners, fearing floods, implored dam operators to release water early in the spring. Downstream communities and farmers pushed back, however, demanding that the dam was used to hold onto as much water as possible to avoid shortages later in the year. 

In 1908, James Church, a classics professor at the University of Nevada, Reno, whose passion for hiking around the mountains sparked an interest in the science of snow, invented a device that helped resolve the so-called Lake Tahoe Water Wars: the Mt. Rose snow sampler, named after the peak of a Sierra spur that juts into Nevada.

Professor James E. Church wearing goggles and snowshoes, standing on a snowy hillside
James Church, a professor of classics at the University of Nevada, Reno, became a pioneer in the field of snow surveys.
COURTESY OF UNIVERSITY OF NEVADA, RENO

It’s a simple enough device, with sections of tube that screw together, a sharpened end, and measurement ticks along the side. Snow surveyors measure the depth of the snow by plunging the sampler down to the ground. They then weigh the filled tube on a specialized scale to calculate the water content of the snow. 

Church used the device to take measurements at various points across the range, and calibrated his water forecasts by comparing his readings against the rising and falling levels of Lake Tahoe. 

It worked so well that the US began a federal snow survey program in the mid-1930s, which evolved into the one carried on today by the Department of Agriculture’s Natural Resources Conservation Service (NRCS). Throughout the winter, hundreds of snow surveyors across the American West head up to established locations on snowshoes, backcountry skis, or snowmobiles to deploy their Mt. Rose samplers, which have barely changed over more than a century. 

In the 1960s, the US government also began setting up a network of permanent monitoring sites across the mountains, now known as the SNOTEL network. There are more than 900 stations continuously transmitting readings from across Western states and Alaska. They’re equipped with sensors that measure air temperature, snow depth, and soil moisture, and include pressure-sensitive “snow pillows” that weigh the snow to determine the water content. 

The data from the snow surveys and SNOTEL sites all flows into snow depth and snow water content reports that the NRCS publishes, along with forecasts of the amount of water that will fill the streams and reservoirs through the spring and summer.

Taking the temperature

None of these survey and monitoring programs, however, provide the temperature throughout the snowpack. 

The Sierra Nevada snowpack can reach more than 6 meters (20 feet), and the temperature within it may vary widely, especially toward the top. Readings taken at increments throughout can determine what’s known as the cold content, or the amount of energy required to shift the snowpack to a uniform temperature of 32˚F. 

Knowing the cold content of the snowpack helps researchers understand the conditions under which it will begin to rapidly melt, particularly as it warms up in the spring or after rain falls on top of the snow.

If the temperature of the snow, for example, is close to 32˚F even at several feet deep, a few warm days could easily set it melting. If, on the other hand, the temperature measurements show a colder profile throughout the middle, the snowpack is more stable and will hold up longer as the weather warms.

a person with raising a snow shovel up at head height
Bjoern Bingham, a research scientist at the Desert Research Institute, digs at snowpit at a research plot within the Heavenly Ski Resort, near South Lake Tahoe, California.
JAMES TEMPLE

The problem is that taking the temperature of the entire snowpack has been, until now, tough and time-consuming work. When researchers do it at all, they mainly do so by digging snow pits down to the ground and then taking readings with probe thermometers along an inside wall.

There have been a variety of efforts to take continuous remote readings from sensors attached to fences, wires, or towers, which the snowpack eventually buries. But the movement and weight of the dense shifting snow tends to break the devices or snap the structures they’re assembled upon.

“They rarely last a season,” McAvoy says.

Anne Heggli, a professor of mountain hydrometeorology at DRI, happened upon the idea of using an infrared device to solve this problem during a tour of the institute’s campus in 2019, when she learned that researchers there were using an infrared meat thermometer to take contactless readings of the snow surface.

In 2021, Heggli began collaborating with RPM Systems, a gadget manufacturing company, to design an infrared device optimized for snowpack field conditions. The resulting snow temperature profiler is skinny enough to fit down a hole dug by snow surveyors and dangles on a cord marked off at 10-centimeter (4-inch) increments.

a researcher stands in a snowy trench taking notes, while a second researcher drops a yellow measure down from the surface level
Bingham and Daniel McEvoy, an associate research professor at the Desert Research Institute, work together to take temperature readings from inside the snowpit as well as from within the hole left behind by a snow sampler.
JAMES TEMPLE

At Heavenly on that April morning, Bingham, a staff scientist at DRI, slowly fed the device down a snow sampler hole, calling out temperature readings at each marking. McEvoy scribbled them down on a worksheet fastened to his clipboard as he used a probe thermometer to take readings of his own from within a snow pit the pair had dug down to the ground.

They were comparing the measurements to assess the reliability of the infrared device in the field, but the eventual aim is to eliminate the need to dig snow pits. The hope is that state and federal surveyors could simply carry along a snow temperature profiler and drop it into the snowpack survey holes they’re creating anyway, to gather regular snowpack temperature readings from across the mountains.

In 2023, the US Bureau of Reclamation, the federal agency that operates many of the nation’s dams, funded a three-year research project to explore the use of the infrared gadgets in determining snowpack temperatures. Through it, the DRI research team has now handed devices out to 20 snow survey teams across California, Colorado, Idaho, Montana, Nevada, and Utah to test their use in the field and supplement the snowpack data they’re collecting.

The Snow Lab

The DRI research project is one piece of a wider effort to obtain snowpack temperature data across the mountains of the West.

By early May, the snow depth had dropped from an April peak of 114 inches to 24 inches (2.9 meters to 0.6 meters) at the UC Berkeley Central Sierra Snow Lab, an aging wooden structure perched in the high mountains northwest of Lake Tahoe.

Megan Mason, a research scientist at the lab, used a backcountry ski shovel to dig out a trio of instruments from what was left of the pitted snowpack behind the building. Each one featured different types of temperature sensors, arrayed along a strong polymer beam meant to hold up under the weight and movement of the Sierra snowpack.  

She was pulling up the devices after running the last set of observations for the season, as part of an effort to develop a resilient system that can survive the winter and transmit hourly temperature readings.

The lab is working on the project, dubbed the California Cold Content Initiative, in collaboration with the state’s Department of Water Resources. California is the only western state that opted to maintain its own snow survey program and run its own permanent monitoring stations, all of which are managed by the water department. 

The plan is to determine which instruments held up and functioned best this winter. Then, they can begin testing the most promising approaches at several additional sites next season. Eventually, the goal is to attach the devices at more than 100 of California’s snow monitoring stations, says Andrew Schwartz, the director of the lab.

The NRCS is conducting a similar research effort at select SNOTEL sites equipped with a beaded temperature cable. One such cable is visible at the Heavenly SNOTEL station, next to where McEvoy and Bingham dug their snow pit, strung vertically between an arm extended from the main tower and the snow-covered ground. 

a gloved hand inserts a probe wire into a hole in the snow
DRI’s Bjoern Bingham feeds the snow temperature profiler, an infrared device, down a hole in the Sierra snowpack.
JAMES TEMPLE

Schwartz said that the different research groups are communicating and collaborating openly on the projects, all of which promise to provide complementary information, expanding the database of snowpack temperature readings across the West.

For decades, agencies and researchers generally produced water forecasts using relatively simple regression models that translated the amount of water in the snowpack into the amount of water that will flow down the mountain, based largely on the historic relationships between those variables. 

But these models are becoming less reliable as climate change alters temperatures, snow levels, melt rates, and evaporation, and otherwise drives alpine weather patterns outside of historic patterns.

“As we have years that scatter further and more frequently from the norm, our models aren’t prepared,” Heggli says.

Plugging direct temperature observations into more sophisticated models that have emerged in recent years, Schwartz says, promises to significantly improve the accuracy of water forecasts. That, in turn, should help communities manage through droughts and prevent dams from overtopping even as climate change fuels alternately wetter, drier, warmer, and weirder weather.

About a quarter of the world’s population relies on water stored in mountain snow and glaciers, and climate change is disrupting the hydrological cycles that sustain these natural frozen reservoirs in many parts of the world. So any advances in observations and modeling could deliver broader global benefits.

Ominous weather

There’s an obvious threat to this progress, though.

Even if these projects work as well as hoped, it’s not clear how widely these tools and techniques will be deployed at a time when the White House is gutting staff across federal agencies, terminating thousands of scientific grants, and striving to eliminate tens of billions of dollars in funding at research departments. 

The Trump administration has fired or put on administrative leave nearly 6,000 employees across the USDA, or 6% of the department’s workforce. Those cutbacks have reached regional NRCS offices, according to reporting by local and trade outlets.

That includes more than half of the roles at the Portland office, according to O’Neill, the state climatologist. Those reductions prompted a bipartisan group of legislators to call on the Secretary of Agriculture to restore the positions, warning the losses could impair water data and analyses that are crucial for the state’s “agriculture, wildland fire, hydropower, timber, and tourism sectors,” as the Statesman Journal reported.

There are more than 80 active SNOTEL stations in Oregon.

The fear is there won’t be enough people left to reach all the sites this summer to replace batteries, solar panels, and drifting or broken sensors, which could quickly undermine the reliability of the data or cut off the flow of information. 

“Staff and budget reductions at NRCS will make it impossible to maintain SNOTEL instruments and conduct routine manual observations, leading to inoperability of the network within a year,” the lawmakers warned.

The USDA and NRCS didn’t respond to inquiries from MIT Technology Review

looking down at a researcher standing in a snowy trench with a clipboard of notes
DRI’s Daniel McEvoy scribbles down temperature readings at the Heavenly site.
JAMES TEMPLE

If the federal cutbacks deplete the data coming back from SNOTEL stations or federal snow survey work, the DRI infrared method could at least “still offer a simplistic way of measuring the snowpack temperatures” in places where state and regional agencies continue to carry out surveys, McEvoy says.

But most researchers stress the field needs more surveys, stations, sensors, and readings to understand how the climate and water cycles are changing from month to month and season to season. Heggli stresses that there should be broad bipartisan support for programs that collect snowpack data and provide the water forecasts that farmers and communities rely on. 

“This is how we account for one of, if not the, most valuable resource we have,” she says. “In the West, we go into a seasonal drought every summer; our snowpack is what trickles down and gets us through that drought. We need to know how much we have.”

The first US hub for experimental medical treatments is coming

A bill that allows medical clinics to sell unproven treatments has been passed in Montana. 

Under the legislation, doctors can apply for a license to open an experimental treatment clinic and recommend and sell therapies not approved by the Food and Drug Administration (FDA) to their patients. Once it’s signed by the governor, the law will be the most expansive in the country in allowing access to drugs that have not been fully tested. 

The bill allows for any drug produced in the state to be sold in it, providing it has been through phase I clinical trials—the initial, generally small, first-in-human studies that are designed to check that a new treatment is not harmful. These trials do not determine if the drug is effective.

The bill, which was passed by the state legislature on April 29 and is expected to be signed by Governor Greg Gianforte, essentially expands on existing Right to Try legislation in the state. But while that law was originally designed to allow terminally ill people to access experimental drugs, the new bill was drafted and lobbied for by people interested in extending human lifespans—a group of longevity enthusiasts that includes scientists, libertarians, and influencers.  

These longevity enthusiasts are hoping Montana will serve as a test bed for opening up access to experimental drugs. “I see no reason why it couldn’t be adopted by most of the other states,” said Todd White, speaking to an audience of policymakers and others interested in longevity at an event late last month in Washington, DC. White, who helped develop the bill and directs a research organization focused on aging, added that “there are some things that can be done at the federal level to allow Right to Try laws to proliferate more readily.” 

Supporters of the bill say it gives individuals the freedom to make choices about their own bodies. At the same event, bioethicist Jessica Flanigan of the University of Richmond said she was “optimistic” about the measure, because “it’s great any time anybody is trying to give people back their medical autonomy.” 

Ultimately, they hope that the new law will enable people to try unproven drugs that might help them live longer, make it easier for Americans to try experimental treatments without having to travel abroad, and potentially turn Montana into a medical tourism hub.

But ethicists and legal scholars aren’t as optimistic. “I hate it,” bioethicist Alison Bateman-House of New York University says of the bill. She and others are worried about the ethics of promoting and selling unproven treatments—and the risks of harm should something go wrong.

Easy access?

No drugs have been approved to treat human aging. Some in the longevity field believe that regulation has held back the development of such drugs. In the US, federal law requires that drugs be shown to be both safe and effective before they can be sold. That requirement was made law in the 1960s following the thalidomide tragedy, in which women who took the drug for morning sickness had babies with sometimes severe disabilities. Since then, the FDA has been responsible for the approval of new drugs.  

Typically, new drugs are put through a series of human trials. Phase I trials generally involve between 20 and 100 volunteers and are designed to check that the drug is safe for humans. If it is, the drug is then tested in larger groups of hundreds, and then thousands, of volunteers to assess the dose and whether it actually works. Once a drug is approved, people who are prescribed it are monitored for side effects. The entire process is slow, and it can last more than a decade—a particular pain point for people who are acutely aware of their own aging. 

But some exceptions have been made for people who are terminally ill under Right to Try laws. Those laws allow certain individuals to apply for access to experimental treatments that have been through phase I clinical trials but have not received FDA approval.

Montana first passed a Right to Try law in 2015 (a federal law was passed around three years later). Then in 2023, the state expanded the law to include all patients there, not just those with terminal illnesses—meaning that any person in Montana could, in theory, take a drug that had been through only a phase I trial.

At the time, this was cheered by many longevity enthusiasts—some of whom had helped craft the expanded measure.

But practically, the change hasn’t worked out as they envisioned. “There was no licensing, no processing, no registration” for clinics that might want to offer those drugs, says White. “There needed to be another bill that provided regulatory clarity for service providers.” 

So the new legislation addresses “how clinics can set up shop in Montana,” says Dylan Livingston, founder and CEO of the Alliance for Longevity Initiatives, which hosted the DC event. Livingston built A4LI, as it’s known, a few years ago, as a lobbying group for the science of human aging and longevity.

Livingston, who is exploring multiple approaches to improve both funding for scientific research and to change drug regulation, helped develop and push the 2023 bill in Montana with the support of State Senator Kenneth Bogner, he says. “I gave [Bogner] a menu of things that could be done at the state level … and he loved the idea” of turning Montana into a medical tourism hub, he says. 

After all, as things stand, plenty of Americans travel abroad to receive experimental treatments that cannot legally be sold in the US, including expensive, unproven stem cell and gene therapies, says Livingston. 

“If you’re going to go and get an experimental gene therapy, you might as well keep it in the country,” he says. Livingston has suggested that others might be interested in trying a novel drug designed to clear aged “senescent” cells from the body, which is currently entering phase II trials for an eye condition caused by diabetes. “One: let’s keep the money in the country, and two: if I was a millionaire getting an experimental gene therapy, I’d rather be in Montana than Honduras.”

“Los Alamos for longevity”

Honduras, in particular, has become something of a home base for longevity experiments. The island of Roatán is home to the Global Alliance for Regenerative Medicine clinic, which, along with various stem cell products, sells a controversial unproven “anti-aging” gene therapy for around $20,000 to customers including wealthy longevity influencer Bryan Johnson

Tech entrepreneur and longevity enthusiast Niklas Anzinger has also founded the city of Infinita in the region’s special economic zone of Próspera, a private city where residents are able to make their own suggestions for medical regulations. It’s the second time he’s built a community there as part of his effort to build a “Los Alamos for longevity” on the island, a place where biotech companies can develop therapies that slow or reverse human aging “at warp speed,” and where individuals are free to take those experimental treatments. (The first community, Vitalia, featured a biohacking lab, but came to an end following a disagreement between the two founders.) 

Anzinger collaborated with White, the longevity enthusiast who spoke at the A4LI event (and is an advisor to Infinita VC, Anzinger’s investment company), to help put together the new Montana bill. “He asked if I would help him try to advance the new bill, so that’s what we did for the last few months,” says White, who trained as an electrical engineer but left his career in telecommunications to work with an organization that uses blockchain to fund research into extending human lifespans. 

“Right to Try has always been this thing [for people] who are terminal[ly ill] and trying a Hail Mary approach to solving these things; now Right to Try laws are being used to allow you to access treatments earlier,” White told the audience at the A4LI event. “Making it so that people can use longevity medicines earlier is, I think, a very important thing.”

The new bill largely sets out the “infrastructure” for clinics that want to sell experimental treatments, says White. It states that clinics will need to have a license, for example, and that this must be renewed on an annual basis. 

“Now somebody who actually wants to deliver drugs under the Right to Try law will be able to do so,” he says. The new legislation also protects prescribing doctors from disciplinary action.

And it sets out requirements for informed consent that go further than those of existing Right to Try laws. Before a person takes an experimental drug under the new law, they will be required to provide a written consent that includes a list of approved alternative drugs and a description of the worst potential outcome.

On the safe side

“In the Montana law, we explicitly enhanced the requirements for informed consent,” Anzinger told an audience at the same A4LI event. This, along with the fact that the treatments will have been through phase I clinical trials, will help to keep people safe, he argued. “We have to treat this with a very large degree of responsibility,” he added.

“We obviously don’t want to be killing people,” says Livingston. 

But he also adds that he, personally, won’t be signing up for any experimental treatments. “I want to be the 10 millionth, or even the 50 millionth, person to get the gene therapy,” he says. “I’m not that adventurous … I’ll let other people go first.”

Others are indeed concerned that, for the “adventurous” people, these experimental treatments won’t necessarily be safe. Phase I trials are typically tiny, and they often involve less than 50 people, all of whom are typically in good health. A trial like that won’t tell you much about side effects that only show up in 5% of people, for example, or about interactions the drug might have with other medicines.

Around 90% of drug candidates in clinical trials fail. And around 17% of drugs fail late-stage clinical trials because of safety concerns. Even those that make it all the way through clinical trials and get approved by the FDA can still end up being withdrawn from the market when rare but serious side effects show up. Between 1992 and 2023, 23 drugs that were given accelerated approval for cancer indications were later withdrawn from the market. And between 1950 and 2013, the reason for the withdrawal of 95 drugs was “death.”

“It’s disturbing that they want to make drugs available after phase I testing,” says Sharona Hoffman, professor of law and bioethics at Case Western Reserve University in Cleveland, Ohio. “This could endanger patients.”

“Famously, the doctor’s first obligation is to first do no harm,” says Bateman-House. “If [a drug] has not been through clinical trials, how do you have any standing on which to think it isn’t going to do any harm?”

But supporters of the bill argue that individuals can make their own decisions about risk. When speaking at the A4LI event, Flanigan introduced herself as a bioethicist before adding “but don’t hold it against me; we’re not all so bad.” She argued that current drug regulations impose a “massive amount of restrictions on your bodily rights and your medical freedom.” Why should public officials be the ones making decisions about what’s safe for people? Individuals, she argued, should be empowered to make those judgments themselves.

Other ethicists counter that this isn’t an issue of people’s rights. There are lots of generally accepted laws about when we can access drugs, says Hoffman; people aren’t allowed to drink and drive because they might kill someone. “So, no, you don’t have a right to ingest everything you want if there are risks associated with it.”

The idea that individuals have a right to access experimental treatments has in fact failed in US courts in the past, says Carl Coleman, a bioethicist and legal scholar at Seton Hall in New Jersey. 

He points to a case from 20 years ago: In the early 2000s, Frank Burroughs founded the Abigail Alliance for Better Access to Developmental Drugs. His daughter, Abigail Burroughs, had head and neck cancer, and she had tried and failed to access experimental drugs. In 2003, about two years after Abigail’s death, the group sued the FDA, arguing that people with terminal cancer have a constitutionally protected right to access experimental, unapproved treatments, once those treatments have been through phase I trials. In 2007, however, a court rejected that argument, determining  that terminally ill individuals do not have a constitutional right to experimental drugs.

Bateman-House also questions a provision in the Montana bill that claims to make treatments more equitable. It states that “experimental treatment centers” should allocate 2% of their net annual profits “to support access to experimental treatments and healthcare for qualifying Montana residents.” Bateman-House says she’s never seen that kind of language in a bill before. It may sound positive, but it could in practice introduce even more risk to the local community. “On the one hand, I like equity,” she says. “On the other hand, I don’t like equity to snake oil.”

After all, the doctors prescribing these drugs won’t know if they will work. It is never ethical to make somebody pay for a treatment when you don’t have any idea whether it will work, Bateman-House adds. “That’s how the US system has been structured: There’s no profit without evidence of safety and efficacy.”

The clinics are coming

Any clinics that offer experimental treatments in Montana will only be allowed to sell drugs that have been made within the state, says Coleman. “Federal law requires any drug that is going to be distributed in interstate commerce to have FDA approval,” he says.

White isn’t too worried about that. Montana already has manufacturing facilities for biotech and pharmaceutical companies, including Pfizer. “That was one of the specific advantages [of focusing] on Montana, because everything can be done in state,” he says. He also believes that the current administration is “predisposed” to change federal laws around interstate drug manufacturing. (FDA commissioner Marty Makary has been a vocal critic of the agency and the pace at which it approves new drugs.)

At any rate, the clinics are coming to Montana, says Livingston. “We have half a dozen that are interested, and maybe two or three that are definitively going to set up shop out there.” He won’t name names, but he says some of the interested clinicians already have clinics in the US, while others are abroad. 

Mac Davis—founder and CEO of Minicircle, the company that developed the controversial “anti-aging” gene therapy—told MIT Technology Review he was “looking into it.”

“I think this can be an opportunity for America and Montana to really kind of corner the market when it comes to medical tourism,” says Livingston. “There is no other place in the world with this sort of regulatory environment.”

Google DeepMind’s new AI agent uses large language models to crack real-world problems

Google DeepMind has once again used large language models to discover new solutions to long-standing problems in math and computer science. This time the firm has shown that its approach can not only tackle unsolved theoretical puzzles, but improve a range of important real-world processes as well.

Google DeepMind’s new tool, called AlphaEvolve, uses the Gemini 2.0 family of large language models (LLMs) to produce code for a wide range of different tasks. LLMs are known to be hit and miss at coding. The twist here is that AlphaEvolve scores each of Gemini’s suggestions, throwing out the bad and tweaking the good, in an iterative process, until it has produced the best algorithm it can. In many cases, the results are more efficient or more accurate than the best existing (human-written) solutions.

“You can see it as a sort of super coding agent,” says Pushmeet Kohli, a vice president at Google DeepMind who leads its AI for Science teams. “It doesn’t just propose a piece of code or an edit, it actually produces a result that maybe nobody was aware of.”

In particular, AlphaEvolve came up with a way to improve the software Google uses to allocate jobs to its many millions of servers around the world. Google DeepMind claims the company has been using this new software across all of its data centers for more than a year, freeing up 0.7% of Google’s total computing resources. That might not sound like much, but at Google’s scale it’s huge.

Jakob Moosbauer, a mathematician at the University of Warwick in the UK, is impressed. He says the way AlphaEvolve searches for algorithms that produce specific solutions—rather than searching for the solutions themselves—makes it especially powerful. “It makes the approach applicable to such a wide range of problems,” he says. “AI is becoming a tool that will be essential in mathematics and computer science.”

AlphaEvolve continues a line of work that Google DeepMind has been pursuing for years. Its vision is that AI can help to advance human knowledge across math and science. In 2022, it developed AlphaTensor, a model that found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years. In 2023, it revealed AlphaDev, which discovered faster ways to perform a number of basic calculations performed by computers trillions of times a day. AlphaTensor and AlphaDev both turn math problems into a kind of game, then search for a winning series of moves.

FunSearch, which arrived in late 2023, swapped out game-playing AI and replaced it with LLMs that can generate code. Because LLMs can carry out a range of tasks, FunSearch can take on a wider variety of problems than its predecessors, which were trained to play just one type of game. The tool was used to crack a famous unsolved problem in pure mathematics.

AlphaEvolve is the next generation of FunSearch. Instead of coming up with short snippets of code to solve a specific problem, as FunSearch did, it can produce programs that are hundreds of lines long. This makes it applicable to a much wider variety of problems.    

In theory, AlphaEvolve could be applied to any problem that can be described in code and that has solutions that can be evaluated by a computer. “Algorithms run the world around us, so the impact of that is huge,” says Matej Balog, a researcher at Google DeepMind who leads the algorithm discovery team.

Survival of the fittest

Here’s how it works: AlphaEvolve can be prompted like any LLM. Give it a description of the problem and any extra hints you want, such as previous solutions, and AlphaEvolve will get Gemini 2.0 Flash (the smallest, fastest version of Google DeepMind’s flagship LLM) to generate multiple blocks of code to solve the problem.

It then takes these candidate solutions, runs them to see how accurate or efficient they are, and scores them according to a range of relevant metrics. Does this code produce the correct result? Does it run faster than previous solutions? And so on.

AlphaEvolve then takes the best of the current batch of solutions and asks Gemini to improve them. Sometimes AlphaEvolve will throw a previous solution back into the mix to prevent Gemini from hitting a dead end.

When it gets stuck, AlphaEvolve can also call on Gemini 2.0 Pro, the most powerful of Google DeepMind’s LLMs. The idea is to generate many solutions with the faster Flash but add solutions from the slower Pro when needed.

These rounds of generation, scoring, and regeneration continue until Gemini fails to come up with anything better than what it already has.

Number games

The team tested AlphaEvolve on a range of different problems. For example, they looked at matrix multiplication again to see how a general-purpose tool like AlphaEvolve compared to the specialized AlphaTensor. Matrices are grids of numbers. Matrix multiplication is a basic computation that underpins many applications, from AI to computer graphics, yet nobody knows the fastest way to do it. “It’s kind of unbelievable that it’s still an open question,” says Balog.

The team gave AlphaEvolve a description of the problem and an example of a standard algorithm for solving it. The tool not only produced new algorithms that could calculate 14 different sizes of matrix faster than any existing approach, it also improved on AlphaTensor’s record-beating result for multipying two four-by-four matrices.

AlphaEvolve scored 16,000 candidates suggested by Gemini to find the winning solution, but that’s still more efficient than AlphaTensor, says Balog. AlphaTensor’s solution also only worked when a matrix was filled with 0s and 1s. AlphaEvolve solves the problem with other numbers too.

“The result on matrix multiplication is very impressive,” says Moosbauer. “This new algorithm has the potential to speed up computations in practice.”

Manuel Kauers, a mathematician at Johannes Kepler University in Linz, Austria, agrees: “The improvement for matrices is likely to have practical relevance.”

By coincidence, Kauers and a colleague have just used a different computational technique to find some of the speedups AlphaEvolve came up with. The pair posted a paper online reporting their results last week.

“It is great to see that we are moving forward with the understanding of matrix multiplication,” says Kauers. “Every technique that helps is a welcome contribution to this effort.”

Real-world problems

Matrix multiplication was just one breakthrough. In total, Google DeepMind tested AlphaEvolve on more than 50 different types of well-known math puzzles, including problems in Fourier analysis (the math behind data compression, essential to applications such as video streaming), the minimum overlap problem (an open problem in number theory proposed by mathematician Paul Erdős in 1955), and kissing numbers (a problem introduced by Isaac Newton that has applications in materials science, chemistry, and cryptography). AlphaEvolve matched the best existing solutions in 75% of cases and found better solutions in 20% of cases.  

Google DeepMind then applied AlphaEvolve to a handful of real-world problems. As well as coming up with a more efficient algorithm for managing computational resources across data centers, the tool found a way to reduce the power consumption of Google’s specialized tensor processing unit chips.

AlphaEvolve even found a way to speed up the training of Gemini itself, by producing a more efficient algorithm for managing a certain type of computation used in the training process.

Google DeepMind plans to continue exploring potential applications of its tool. One limitation is that AlphaEvolve can’t be used for problems with solutions that need to be scored by a person, such as lab experiments that are subject to interpretation.   

Moosbauer also points out that while AlphaEvolve may produce impressive new results across a wide range of problems, it gives little theoretical insight into how it arrived at those solutions. That’s a drawback when it comes to advancing human understanding.  

Even so, tools like AlphaEvolve are set to change the way researchers work. “I don’t think we are finished,” says Kohli. “There is much further that we can go in terms of how powerful this type of approach is.”

How a new type of AI is helping police skirt facial recognition bans

Police and federal agencies have found a controversial new way to skirt the growing patchwork of laws that curb how they use facial recognition: an AI model that can track people using attributes like body size, gender, hair color and style, clothing, and accessories. 

The tool, called Track and built by the video analytics company Veritone, is used by 400 customers, including state and local police departments and universities all over the US. It is also expanding federally: US attorneys at the Department of Justice began using Track for criminal investigations last August. Veritone’s broader suite of AI tools, which includes bona fide facial recognition, is also used by the Department of Homeland Security—which houses immigration agencies—and the Department of Defense, according to the company. 

“The whole vision behind Track in the first place,” says Veritone CEO Ryan Steelberg, was “if we’re not allowed to track people’s faces, how do we assist in trying to potentially identify criminals or malicious behavior or activity?” In addition to tracking individuals where facial recognition isn’t legally allowed, Steelberg says, it allows for tracking when faces are obscured or not visible. 

The product has drawn criticism from the American Civil Liberties Union, which—after learning of the tool through MIT Technology Review—said it was the first instance they’d seen of a nonbiometric tracking system used at scale in the US. They warned that it raises many of the same privacy concerns as facial recognition but also introduces new ones at a time when the Trump administration is pushing federal agencies to ramp up monitoring of protesters, immigrants, and students.

Veritone gave us a demonstration of Track in which it analyzed people in footage from different environments, ranging from the January 6 riots to subway stations. You can use it to find people by specifying body size, gender, hair color and style, shoes, clothing, and various accessories. The tool can then assemble timelines, tracking a person across different locations and video feeds. It can be accessed through Amazon and Microsoft cloud platforms.

VERITONE; MIT TECHNOLOGY REVIEW (CAPTIONS)

In an interview, Steelberg said that the number of attributes Track uses to identify people will continue to grow. When asked if Track differentiates on the basis of skin tone, a company spokesperson said it’s one of the attributes the algorithm uses to tell people apart but that the software does not currently allow users to search for people by skin color. Track currently operates only on recorded video, but Steelberg claims the company is less than a year from being able to run it on live video feeds.

Agencies using Track can add footage from police body cameras, drones, public videos on YouTube, or so-called citizen upload footage (from Ring cameras or cell phones, for example) in response to police requests.

“We like to call this our Jason Bourne app,” Steelberg says. He expects the technology to come under scrutiny in court cases but says, “I hope we’re exonerating people as much as we’re helping police find the bad guys.” The public sector currently accounts for only 6% of Veritone’s business (most of its clients are media and entertainment companies), but the company says that’s its fastest-growing market, with clients in places including California, Washington, Colorado, New Jersey, and Illinois. 

That rapid expansion has started to cause alarm in certain quarters. Jay Stanley, a senior policy analyst at the ACLU, wrote in 2019 that artificial intelligence would someday expedite the tedious task of combing through surveillance footage, enabling automated analysis regardless of whether a crime has occurred. Since then, lots of police-tech companies have been building video analytics systems that can, for example, detect when a person enters a certain area. However, Stanley says, Track is the first product he’s seen make broad tracking of particular people technologically feasible at scale.

“This is a potentially authoritarian technology,” he says. “One that gives great powers to the police and the government that will make it easier for them, no doubt, to solve certain crimes, but will also make it easier for them to overuse this technology, and to potentially abuse it.”

Chances of such abusive surveillance, Stanley says, are particularly high right now in the federal agencies where Veritone has customers. The Department of Homeland Security said last month that it will monitor the social media activities of immigrants and use evidence it finds there to deny visas and green cards, and Immigrations and Customs Enforcement has detained activists following pro-Palestinian statements or appearances at protests. 

In an interview, Jon Gacek, general manager of Veritone’s public-sector business, said that Track is a “culling tool” meant to speed up the task of identifying important parts of videos, not a general surveillance tool. Veritone did not specify which groups within the Department of Homeland Security or other federal agencies use Track. The Departments of Defense, Justice, and Homeland Security did not respond to requests for comment.

For police departments, the tool dramatically expands the amount of video that can be used in investigations. Whereas facial recognition requires footage in which faces are clearly visible, Track doesn’t have that limitation. Nathan Wessler, an attorney for the ACLU, says this means police might comb through videos they had no interest in before. 

“It creates a categorically new scale and nature of privacy invasion and potential for abuse that was literally not possible any time before in human history,” Wessler says. “You’re now talking about not speeding up what a cop could do, but creating a capability that no cop ever had before.”

Track’s expansion comes as laws limiting the use of facial recognition have spread, sparked by wrongful arrests in which officers have been overly confident in the judgments of algorithms.  Numerous studies have shown that such algorithms are less accurate with nonwhite faces. Laws in Montana and Maine sharply limit when police can use it—it’s not allowed in real time with live video—while San Francisco and Oakland, California have near-complete bans on facial recognition. Track provides an alternative. 

Though such laws often reference “biometric data,” Wessler says this phrase is far from clearly defined. It generally refers to immutable characteristics like faces, gait and fingerprints rather than things that change, like clothing. But certain attributes, such as body size, blur this distinction. 

Consider also, Wessler says, someone in winter who frequently wears the same boots, coat, and backpack. “Their profile is going to be the same day after day,” Wessler says. “The potential to track somebody over time based on how they’re moving across a whole bunch of different saved video feeds is pretty equivalent to face recognition.”

In other words, Track might provide a way of following someone that raises many of the same concerns as facial recognition, but isn’t subject to laws restricting use of facial recognition because it does not technically involve biometric data. Steelberg said there are several ongoing cases that include video evidence from Track, but that he couldn’t name the cases or comment further. So for now, it’s unclear whether it’s being adopted in jurisdictions where facial recognition is banned. 

Did solar power cause Spain’s blackout?

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some (including the US energy secretary, Chris Wright) have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insisted that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. And even as we wait for the bigger picture, there are a few takeaways that could help our future grid.

Let’s start with what we know so far about what happened, according to the Spanish grid operator Red Eléctrica:

  • A disruption in electricity generation took place a little after 12:30 p.m. This may have been a power plant flipping off or some transmission equipment going down.
  • A little over a second later, the grid lost another bit of generation.
  • A few seconds after that, the main interconnector between Spain and southwestern France got disconnected as a result of grid instability.
  • Immediately after, virtually all of Spain’s electricity generation tripped offline.

One of the theories floating around is that things went wrong because the grid diverged from its normal frequency. (All power grids have a set frequency: In Europe the standard is 50 hertz, which means the current switches directions 50 times per second.) The frequency needs to be constant across the grid to keep things running smoothly.

There are signs that the outage could be frequency-related. Some experts pointed out that strange oscillations in the grid frequency occurred shortly before the blackout.

Normally, our grid can handle small problems like an oscillation in frequency or a drop that comes from a power plant going offline. But some of the grid’s ability to stabilize itself is tied up in old ways of generating electricity.

Power plants like those that run on coal and natural gas have massive rotating generators. If there are brief issues on the grid that upset the balance, those physical bits of equipment have inertia: They’ll keep moving at least for a few seconds, providing some time for other power sources to respond and pick up the slack. (I’m simplifying here—for more details I’d highly recommend this report from the National Renewable Energy Laboratory.)

Solar panels don’t have inertia—they rely on inverters to change electricity into a form that’s compatible with the grid and matches its frequency. Generally, these inverters are “grid-following,” meaning if frequency is dropping, they follow that drop.

In the case of the blackout in Spain, it’s possible that having a lot of power on the grid coming from sources without inertia made it more possible for a small problem to become a much bigger one.

Some key questions here are still unanswered. The order matters, for example. During that drop in generation, did wind and solar plants go offline first? Or did everything go down together?

Whether or not solar and wind contributed to the blackout as a root cause, we do know that wind and solar don’t contribute to grid stability in the same way that some other power sources do, says Seaver Wang, climate lead of the Breakthrough Institute, an environmental research organization. Regardless of whether renewables are to blame, more capability to stabilize the grid would only help, he adds.

It’s not that a renewable-heavy grid is doomed to fail. As Wang put it in an analysis he wrote last week: “This blackout is not the inevitable outcome of running an electricity system with substantial amounts of wind and solar power.”

One solution: We can make sure the grid includes enough equipment that does provide inertia, like nuclear power and hydropower. Reversing a plan to shut down Spain’s nuclear reactors beginning in 2027 would be helpful, Wang says. Other options include building massive machines that lend physical inertia and using inverters that are “grid-forming,” meaning they can actively help regulate frequency and provide a sort of synthetic inertia.

Inertia isn’t everything, though. Grid operators can also rely on installing a lot of batteries that can respond quickly when problems arise. (Spain has much less grid storage than other places with a high level of renewable penetration, like Texas and California.)

Ultimately, if there’s one takeaway here, it’s that as the grid evolves, our methods to keep it reliable and stable will need to evolve too.

If you’re curious to hear more on this story, I’d recommend this Q&A from Carbon Brief about the event and its aftermath and this piece from Heatmap about inertia, renewables, and the blackout.

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

Your gut microbes might encourage criminal behavior

A few years ago, a Belgian man in his 30s drove into a lamppost. Twice. Local authorities found that his blood alcohol level was four times the legal limit. Over the space of a few years, the man was apprehended for drunk driving three times. And on all three occasions, he insisted he hadn’t been drinking.

He was telling the truth. A doctor later diagnosed auto-brewery syndrome—a rare condition in which the body makes its own alcohol. Microbes living inside the man’s body were fermenting the carbohydrates in his diet to create ethanol. Last year, he was acquitted of drunk driving.

His case, along with several other scientific studies, raises a fascinating question for microbiology, neuroscience, and the law: How much of our behavior can we blame on our microbes?

Each of us hosts vast communities of tiny bacteria, archaea (which are a bit like bacteria), fungi, and even viruses all over our bodies. The largest collection resides in our guts, which play home to trillions of them. You have more microbial cells than human cells in your body. In some ways, we’re more microbe than human.

Microbiologists are still getting to grips with what all these microbes do. Some seem to help us break down food. Others produce chemicals that are important for our health in some way. But the picture is extremely complicated, partly because of the myriad ways microbes can interact with each other.

But they also interact with the human nervous system. Microbes can produce compounds that affect the way neurons work. They also influence the functioning of the immune system, which can have knock-on effects on the brain. And they seem to be able to communicate with the brain via the vagus nerve.

If microbes can influence our brains, could they also explain some of our behavior, including the criminal sort? Some microbiologists think so, at least in theory. “Microbes control us more than we think they do,” says Emma Allen-Vercoe, a microbiologist at the University of Guelph in Canada.

Researchers have come up with a name for applications of microbiology to criminal law: the legalome. A better understanding of how microbes influence our behavior could not only affect legal proceedings but also shape crime prevention and rehabilitation efforts, argue Susan Prescott, a pediatrician and immunologist at the University of Western Australia, and her colleagues.

“For the person unaware that they have auto-brewery syndrome, we can argue that microbes are like a marionettist pulling the strings in what would otherwise be labeled as criminal behavior,” says Prescott.

Auto-brewery syndrome is a fairly straightforward example (it has been involved in the acquittal of at least two people so far), but other brain-microbe relationships are likely to be more complicated. We do know a little about one microbe that seems to influence behavior: Toxoplasmosis gondii, a parasite that reproduces in cats and spreads to other animals via cat feces.

The parasite is best known for changing the behavior of rodents in ways that make them easier prey—an infection seems to make mice permanently lose their fear of cats. Research in humans is nowhere near conclusive, but some studies have linked infections with the parasite to personality changes, increased aggression, and impulsivity.

“That’s an example of microbiology that we know affects the brain and could potentially affect the legal standpoint of someone who’s being tried for a crime,” says Allen-Vercoe. “They might say ‘My microbes made me do it,’ and I might believe them.”

There’s more evidence linking gut microbes to behavior in mice, which are some of the most well-studied creatures. One study involved fecal transplants—a procedure that involves inserting fecal matter from one animal into the intestines of another. Because feces contain so much gut bacteria, fecal transplants can go some way to swap out a gut microbiome. (Humans are doing this too—and it seems to be a remarkably effective way to treat persistent C. difficile infections in people.)

Back in 2013, scientists at McMaster University in Canada performed fecal transplants between two strains of mice, one that is known for being timid and another that tends to be rather gregarious. This swapping of gut microbes also seemed to swap their behavior—the timid mice became more gregarious, and vice versa.

Microbiologists have since held up this study as one of the clearest demonstrations of how changing gut microbes can change behavior—at least in mice. “But the question is: How much do they control you, and how much is the human part of you able to overcome that control?” says Allen-Vercoe. “And that’s a really tough question to answer.”

After all, our gut microbiomes, though relatively stable, can change. Your diet, exercise routine, environment, and even the people you live with can shape the communities of microbes that live on and in you. And the ways these communities shift and influence behavior might be slightly different for everyone. Pinning down precise links between certain microbes and criminal behaviors will be extremely difficult, if not impossible. 

“I don’t think you’re going to be able to take someone’s microbiome and say ‘Oh, look—you’ve got bug X, and that means you’re a serial killer,” says Allen-Vercoe.

Either way, Prescott hopes that advances in microbiology and metabolomics might help us better understand the links between microbes, the chemicals they produce, and criminal behaviors—and potentially even treat those behaviors.

“We could get to a place where microbial interventions are a part of therapeutic programming,” she says.

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

A new AI translation system for headphones clones multiple voices simultaneously

Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

The system, called Spatial Speech Translation, tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. 

“There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate,” says Shyam Gollakota, a professor at the University of Washington, who worked on the project. “My mom has such incredible ideas when she’s speaking in Telugu, but it’s so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her.”

While there are plenty of other live AI translation systems out there, such as the one running on Meta’s Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple’s M2 silicon chip, which can support neural networks. The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month.

Over the past few years, large language models have driven big improvements in speech translation. As a result, translation between languages for which lots of training data is available (such as the four languages used in this study) is close to perfect on apps like Google Translate or in ChatGPT. But it’s still not seamless and instant across many languages. That’s a goal a lot of companies are working toward, says Alina Karakanta, an assistant professor at Leiden University in the Netherlands, who studies computational linguistics and was not involved in the project. “I feel that this is a useful application. It can help people,” she says. 

Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”