The Pentagon is gutting the team that tests AI and weapons systems

The Trump administration’s chainsaw approach to federal spending lives on, even as Elon Musk turns on the president. On May 28, Secretary of Defense Pete Hegseth announced he’d be gutting a key office at the Department of Defense responsible for testing and evaluating the safety of weapons and AI systems.

As part of a string of moves aimed at “reducing bloated bureaucracy and wasteful spending in favor of increased lethality,” Hegseth cut the size of the Office of the Director of Operational Test and Evaluation in half. The group was established in the 1980s—following orders from Congress—after criticisms that the Pentagon was fielding weapons and systems that didn’t perform as safely or effectively as advertised. Hegseth is reducing the agency’s staff to about 45, down from 94, and firing and replacing its director. He gave the office just seven days to implement the changes.

It is a significant overhaul of a department that in 40 years has never before been placed so squarely on the chopping block. Here’s how today’s defense tech companies, which have fostered close connections to the Trump administration, stand to gain, and why safety testing might suffer as a result. 

The Operational Test and Evaluation office is “the last gate before a technology gets to the field,” says Missy Cummings, a former fighter pilot for the US Navy who is now a professor of engineering and computer science at George Mason University. Though the military can do small experiments with new systems without running it by the office, it has to test anything that gets fielded at scale.

“In a bipartisan way—up until now—everybody has seen it’s working to help reduce waste, fraud, and abuse,” she says. That’s because it provides an independent check on companies’ and contractors’ claims about how well their technology works. It also aims to expose the systems to more rigorous safety testing.

The gutting comes at a particularly pivotal time for AI and military adoption: The Pentagon is experimenting with putting AI into everything, mainstream companies like OpenAI are now more comfortable working with the military, and defense giants like Anduril are winning big contracts to launch AI systems (last Thursday, Anduril announced a whopping $2.5 billion funding round, doubling its valuation to over $30 billion). 

Hegseth claims his cuts will “make testing and fielding weapons more efficient,” saving $300 million. But Cummings is concerned that they are paving a way to faster adoption while increasing the chances that new systems won’t be as safe or effective as promised. “The firings in DOTE send a clear message that all perceived obstacles for companies favored by Trump are going to be removed,” she says.

Anduril and Anthropic, which have launched AI applications for military use, did not respond to my questions about whether they pushed for or approve of the cuts. A representative for OpenAI said that the company was not involved in lobbying for the restructuring. 

“The cuts make me nervous,” says Mark Cancian, a senior advisor at the Center for Strategic and International Studies who previously worked at the Pentagon in collaboration with the testing office. “It’s not that we’ll go from effective to ineffective, but you might not catch some of the problems that would surface in combat without this testing step.”

It’s hard to say precisely how the cuts will affect the office’s ability to test systems, and Cancian admits that those responsible for getting new technologies out onto the battlefield sometimes complain that it can really slow down adoption. But still, he says, the office frequently uncovers errors that weren’t previously caught.

It’s an especially important step, Cancian says, whenever the military is adopting a new type of technology like generative AI. Systems that might perform well in a lab setting almost always encounter new challenges in more realistic scenarios, and the Operational Test and Evaluation group is where that rubber meets the road.

So what to make of all this? It’s true that the military was experimenting with artificial intelligence long before the current AI boom, particularly with computer vision for drone feeds, and defense tech companies have been winning big contracts for this push across multiple presidential administrations. But this era is different. The Pentagon is announcing ambitious pilots specifically for large language models, a relatively nascent technology that by its very nature produces hallucinations and errors, and it appears eager to put much-hyped AI into everything. The key independent group dedicated to evaluating the accuracy of these new and complex systems now only has half the staff to do it. I’m not sure that’s a win for anyone.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

IBM aims to build the world’s first large-scale, error-corrected quantum computer by 2028

IBM announced detailed plans today to build an error-corrected quantum computer with significantly more computational capability than existing machines by 2028. It hopes to make the computer available to users via the cloud by 2029. 

The proposed machine, named Starling, will consist of a network of modules, each of which contains a set of chips, housed within a new data center in Poughkeepsie, New York. “We’ve already started building the space,” says Jay Gambetta, vice president of IBM’s quantum initiative.

IBM claims Starling will be a leap forward in quantum computing. In particular, the company aims for it to be the first large-scale machine to implement error correction. If Starling achieves this, IBM will have solved arguably the biggest technical hurdle facing the industry today to beat competitors including Google, Amazon Web Services, and smaller startups such as Boston-based QuEra and PsiQuantum of Palo Alto, California. 

IBM, along with the rest of the industry, has years of work ahead. But Gambetta thinks it has an edge because it has all the building blocks to build error correction capabilities in a large-scale machine. That means improvements in everything from algorithm development to chip packaging. “We’ve cracked the code for quantum error correction, and now we’ve moved from science to engineering,” he says. 

Correcting errors in a quantum computer has been an engineering challenge, owing to the unique way the machines crunch numbers. Whereas classical computers encode information in the form of bits, or binary 1 and 0, quantum computers instead use qubits, which can represent “superpositions” of both values at once. IBM builds qubits made of tiny superconducting circuits, kept near absolute zero, in an interconnected layout on chips. Other companies have built qubits out of other materials, including neutral atoms, ions, and photons.

Quantum computers sometimes commit errors, such as when the hardware operates on one qubit but accidentally also alters a neighboring qubit that should not be involved in the computation. These errors add up over time. Without error correction, quantum computers cannot accurately perform the complex algorithms that are expected to be the source of their scientific or commercial value, such as extremely precise chemistry simulations for discovering new materials and pharmaceutical drugs. 

But error correction requires significant hardware overhead. Instead of encoding a single unit of information in a single “physical” qubit, error correction algorithms encode a unit of information in a constellation of physical qubits, referred to collectively as a “logical qubit.”

Currently, quantum computing researchers are competing to develop the best error correction scheme. Google’s surface code algorithm, while effective at correcting errors, requires on the order of 100 qubits to store a single logical qubit in memory. AWS’s Ocelot quantum computer uses a more efficient error correction scheme that requires nine physical qubits per logical qubit in memory. (The overhead is higher for qubits performing computations for storing data.) IBM’s error correction algorithm, known as a low-density parity check code, will make it possible to use 12 physical qubits per logical qubit in memory, a ratio comparable to AWS’s. 

One distinguishing characteristic of Starling’s design will be its anticipated ability to diagnose errors, known as decoding, in real time. Decoding involves determining whether a measured signal from the quantum computer corresponds to an error. IBM has developed a decoding algorithm that can be quickly executed by a type of conventional chip known as an FPGA. This work bolsters the “credibility” of IBM’s error correction method, says Neil Gillespie of the UK-based quantum computing startup Riverlane. 

However, other error correction schemes and hardware designs aren’t out of the running yet. “It’s still not clear what the winning architecture is going to be,” says Gillespie. 

IBM intends Starling to be able to perform computational tasks beyond the capability of classical computers. Starling will have 200 logical qubits, which will be constructed using the company’s chips. It should be able to perform 100 million logical operations consecutively with accuracy; existing quantum computers can do so for only a few thousand. 

The system will demonstrate error correction at a much larger scale than anything done before, claims Gambetta. Previous error correction demonstrations, such as those done by Google and Amazon, involve a single logical qubit, built from a single chip. Gambetta calls them “gadget experiments,” saying “They’re small-scale.” 

Still, it’s unclear whether Starling will be able to solve practical problems. Some experts think that you need a billion error-corrected logical operations to execute any useful algorithm. Starling represents “an interesting stepping-stone regime,” says Wolfgang Pfaff, a physicist at the University of Illinois Urbana-Champaign. “But it’s unlikely that this will generate economic value.” (Pfaff, who studies quantum computing hardware, has received research funding from IBM but is not involved with Starling.) 

The timeline for Starling looks feasible, according to Pfaff. The design is “based in experimental and engineering reality,” he says. “They’ve come up with something that looks pretty compelling.” But building a quantum computer is hard, and it’s possible that IBM will encounter delays due to unforeseen technical complications. “This is the first time someone’s doing this,” he says of making a large-scale error-corrected quantum computer.

IBM’s road map involves first building smaller machines before Starling. This year, it plans to demonstrate that error-corrected information can be stored robustly in a chip called Loon. Next year the company will build Kookaburra, a module that can both store information and perform computations. By the end of 2027, it plans to connect two Kookaburra-type modules together into a larger quantum computer, Cockatoo. After demonstrating that successfully, the next step is to scale up and connect around 100 modules to create Starling.

This strategy, says Pfaff, reflects the industry’s recent embrace of “modularity” when scaling up quantum computers—networking multiple modules together to create a larger quantum computer rather than laying out qubits on a single chip, as researchers did in earlier designs. 

IBM is also looking beyond 2029. After Starling, it plans to build another, Blue Jay. (“I like birds,” says Gambetta.) Blue Jay will contain 2000 logical qubits and is expected to be capable of a billion logical operations.

Inside the race to find GPS alternatives

Later this month, an inconspicuous 150-kilogram satellite is set to launch into space aboard the SpaceX Transporter 14 mission. Once in orbit, it will test super-accurate next-generation satnav technology designed to make up for the shortcomings of the US Global Positioning System (GPS). 

The satellite is the first of a planned constellation called Pulsar, which is being developed by California-based Xona Space Systems. The company ultimately plans to have a constellation of 258 satellites in low Earth orbit. Although these satellites will operate much like those used to create GPS, they will orbit about 12,000 miles closer to Earth’s surface, beaming down a much stronger signal that’s more accurate—and harder to jam. 

“Just because of this shorter distance, we will put down signals that will be approximately a hundred times stronger than the GPS signal,” says Tyler Reid, chief technology officer and cofounder of Xona. “That means the reach of jammers will be much smaller against our system, but we will also be able to reach deeper into indoor locations, penetrating through multiple walls.”

A satnav system for the 21st century

The first GPS system went live in 1993. In the decades since, it has become one of the foundational technologies that the world depends on. The precise positioning, navigation, and timing (PNT) signals beamed by its  satellites underpin much more than Google Maps in your phone. They guide drill heads at offshore oil rigs, time-stamp financial transactions, and help sync power grids all over the world.

But despite the system’s indispensable nature, the GPS signal is easily suppressed or disrupted by everything from space weather to 5G cell towers to phone-size jammers worth a few tens of dollars. The problem has been whispered about among experts for years, but it has really come to the fore in the last three years, since Russia invaded Ukraine. The boom in drone warfare that came to characterize that war also triggered a race to develop technology for thwarting drone attacks by jamming the GPS signals they need to navigate—or spoofing the signal, creating convincing but fake positioning data. 

The crucial problem is one of distance: The GPS constellation, which consists of 24 satellites plus a handful of spares, orbits 12,550 miles (20,200 kilometers) above Earth, in a region known as medium Earth orbit. By the time their signals get all the way down to ground-based receivers, they are so faint that they can easily be overridden by jammers.

Other existing Global Navigation Satellite System constellations, such as Europe’s Galileo, Russia’s GLONASS, and China’s Beidou, have similar architectures and experience the same problems.

But when Reid and cofounder Brian Manning founded Xona Space Systems in 2019, they didn’t think about jamming and spoofing. Their goal was to make autonomous driving ready for prime time. 

assembled GPS unit on a wheeled stand in a clean room
Xona Space System’s completed Pulsar-0 satellite is launching this June.
AEROSPACELAB

Dozens of robocars from Uber and Waymo were already cruising American freeways at that time, equipped with expensive suites of sensors like high-resolution cameras and lidar. The engineers figured a more precise satellite navigation system could reduce the need for those sensors, making it possible to create a safe autonomous vehicle affordable enough to go mainstream. One day, cars might even be able to share their positioning data with one another, Reid says. But they knew that GPS was nowhere near accurate enough to keep self-driving cars within the lane lines and away from other objects on the road. That is especially true in densely built-up urban environments that provide many chances for signals to bounce off walls, creating errors.

“GPS has the superpower of being a ubiquitous system that works the same anywhere in the world,” Reid says. “But it’s a system that was designed primarily to support military missions, virtually to enable them to drop five bombs in the same bowl. But this meter-level accuracy is not enough to guide machines where they need to go and share that physical space with humans safely.”

Reid and Manning began to think about how to build a space-based PNT system that would do what GPS does but better, with accuracy of three inches (10 centimeters) or less and ironclad reliability in all sorts of challenging conditions.

The easiest way to do that is to bring the satellites closer to Earth so that data reaches receivers in real time without inaccuracy-causing delays. The stronger signal of satellites in low Earth orbit is more resistant to disruptions of all sorts. 

When GPS was conceived, none of that was possible. Constellations in low Earth orbit—altitudes up to 1,200 miles (2,000 km)—require hundreds of satellites to provide constant coverage over the entire globe. For a long time, space technology was too bulky and expensive to make such large constellations viable. Over the past decade, however, smaller electronics and lower launch costs have changed the equation.

“In 2019, when we started, the ecosystem of low Earth orbit was really exploding,” Reid says. “We could see things like Starlink, OneWeb, and other constellations take off.”

Matter of urgency

In the few years since Xona launched, concerns about GPS’s vulnerability have begun to grow amid rising geopolitical tensions. As a result, finding a reliable replacement has become a matter of strategic importance. 

In Ukraine especially, GPS jamming and spoofing have become so common that prized US precision munitions such as the High Mobility Artillery Rocket System became effectively blind. Makers of first-person-view drones, which came to symbolize the war, had to refocus on AI-driven autonomous navigation to keep those drones in the game. 

The problem quickly spilled beyond Ukraine. Countries bordering Russia, such as Finland and Estonia, complained that the increasing prevalence of GPS jamming and spoofing was affecting commercial flights and ships in the region.

But Clémence Poirier, a space security researcher at ETH Zurich, says that the problem of GPS disruption isn’t limited to the vicinity of war zones.

“Basic jammers are very cheap and super easily accessible to everyone online,” Poirier says. “Even with the simplest ones, which can be the size of your phone, you can disrupt GPS signals in [an] area of a hundred or more meters.”

In 2013, a truck driver using such a device to conceal his location from his boss accidently disrupted GPS signals around the Newark airport in New Jersey. In 2022, the Dallas Fort Worth International Airport reported a 24-hour GPS outage, which prompted a temporary closure of one of its runways. The source of the interference was never identified. That same year, Denver International Airport experienced a 33-hour GPS disruption. 

Race to securing PNT

“Xona is a promising solution to enhance the resilience of GPS-dependent critical infrastructures and mitigate the threat of GPS jamming and spoofing,” Poirier says. But, she adds, there is no “magic wand,” and a “variety of different approaches will be needed” to solve the problem.

And indeed, Xona is not the only company hoping to provide a backup for the indispensable yet increasingly vulnerable GPS. Companies such as Anello Photonics, based in Santa Clara, California, and Sydney-based Advanced Navigation are testing terrestrial solutions: inertial navigation devices that are small and affordable enough for use beyond high-end military tech. These systems rely on gyroscopes and accelerometers to deduce a vehicle’s position from its own motions. 

When integrated into PNT receivers, these technologies can help detect GPS spoofing and take over for the duration of the interference. Inertial navigation has been around for decades, but recent advances in photonic technologies and microelectromechanical systems have brought it into the mainstream.

The French aerospace and defense conglomerate Safran is developing a system that distributes PNT data via  optical-fiber networks, which form the backbone of the global internet infrastructure. But the allure of space remains strong: The ability to reach any place at any time is what turned GPS from an obscure military system into a piece of taken-for-granted infrastructure that most people today can hardly live without.

And Xona could have some space-based competition. Virginia-based TrustPoint is currently raising funds to build its own low-Earth-orbit PNT constellation, and some have proposed that signals from SpaceX’s Starlink could be repurposed to provide PNT services as well.

Xona hopes to secure its spot in the market by designing its signal to be compatible with that of GPS, allowing manufacturers of GPS receivers to easily slot the new constellation into existing tech. 

Although it will take at least until 2030 for the entire constellation to be up and running, Reid says Xona’s system will provide a valuable addition to the existing GPS infrastructure as soon as 16 of its satellites are in orbit. 

The upcoming launch comes three years after a demonstration mission known as Huginn tested the basics of the technology. The new satellite, called Pulsar-0, will be used to see how well the system can resist jamming or spoofing.

Xona plans to launch an additional four spacecraft next year and hopes to have most of the constellation deployed by 2030. 

Crypto billionaire Brian Armstrong is ready to invest in CRISPR baby tech

Brian Armstrong, the billionaire CEO of the cryptocurrency exchange Coinbase, says he’s ready to fund a US startup focused on gene-editing human embryos. If he goes forward, it would be the first major commercial investment in one of medicine’s most fraught ideas.

In a post on X June 2, Armstrong announced he was looking for gene-editing scientists and bioinformatics specialists to form a founding team for an “embryo editing” effort targeting an unmet medical need, such as a genetic disease.

“I think the time is right for the defining company in the US to be built in this area,” Armstrong posted. 

The announcement from a deep-pocketed backer is a striking shift for a field considered taboo following the 2018 birth of the world’s first genetically edited children in China—a secretive experiment that led to international outrage and prison time for the lead scientist.

According to Dieter Egli, a gene-editing scientist at Columbia University whose team has briefed Armstrong, his plans may be motivated in part by recent improvements in editing technology that have opened up a safer, more precise way to change the DNA of embryos.

That technique, called base editing, can deftly change a single DNA letter. Earlier methods, on the other hand, actually cut the double helix, damaging it and causing whole genes to disappear. “We know much better now what to do,” says Egli. “It doesn’t mean the work is all done, but it’s a very different game now—entirely different.”  

Shoestring budget

Embryo editing, which ultimately aims to produce humans with genes tailored by design, is an idea that has been heavily stigmatized and starved of funding. While it’s legal to study embryos in the lab, actually producing a gene-edited baby is flatly illegal in most countries.

In the US, the CRISPR baby ban operates via a law that forbids the Food and Drug Administration from considering, or even acknowledging, any application it gets to attempt a gene-edited baby. But that rule could be changed, especially if scientists can demonstrate a compelling use of the technique—or perhaps if a billionaire lobbies for it.

In his post, Armstrong included an image of a seven-year-old Pew Research Center poll showing Americans were strongly favorable to altering a baby’s genes if it could treat disease, although the same poll found most opposed experimentation on embryos.  

Up until this point, no US company has openly pursued embryo editing, and the federal government doesn’t fund studies on embryos at all. Instead, research on gene editing in embryos has been carried forward in the US by just two academic centers, Egli’s and one at the Oregon Health & Science University.

Those efforts have operated on a shoestring, held together by private grants and university funds. Researchers at those centers said they support the idea of a well-financed company that could advance the technology. “We would honestly welcome that,” says Paula Amato, a fertility doctor at Oregon Health & Science University and the past president of the American Society for Reproductive Medicine. 

“More research is needed, and that takes people and money,” she says, adding that she doesn’t mind if it comes from “tech bros.”

Editing embryos can, in theory, be used to correct genetic errors likely to cause serious childhood conditions. But since in most cases genetic testing of embryos can also be used to avoid those errors, many argue it will be hard to find a true unmet need where the DNA-altering technique is actually necessary.

Instead, it’s easy to conclude that the bigger market for the technology would be to intervene in embryos in ways that could make humans resistant to common conditions, such as heart disease or Alzheimer’s. But that is more controversial because it’s a type of enhancement, and the changes would also be passed through the generations.

Only last week, several biotech trade and academic groups demanded a 10-year moratorium on heritable human genome editing, saying the technology has few real medical uses and “introduces long-term risks with unknown consequences.”

They said the ability to “program” desired traits or eliminate bad ones risked a new form of “eugenics,” one that would have the effect of “potentially altering the course of evolution.”

No limits

Armstrong did not reply to an email from MIT Technology Review seeking comment about his plans. Nor did his company Coinbase, a cryptocurrency trading platform that went public in 2021 and is the source of his fortune, estimated at $10 billion by Forbes.

The billionaire is already part of a wave of tech entrepreneurs who’ve made a splash in science and biology by laying down outsize investments, sometimes in far-out ideas. Armstrong previously cofounded NewLimit, which Bloomberg calls a “life extension venture” and which this year raised a further $130 million to explore methods to reprogram old cells into an embryonic-like state.

He started that company with Blake Byers, an investor who has said a significant portion of global GDP should be spent on “immortality” research, including biotech approaches and ways of uploading human minds to computers.

Then, starting late last year, Armstrong began publicly telegraphing his interest in exploring a new venture, this time connected to assisted reproduction. In December, he announced on X that he and Byers were ready to meet with entrepreneurs working on “artificial wombs,” “embryo editing,” and “next-gen IVF.”

The post invited people to apply to attend an off-the-record dinner—a kind of forbidden-technologies soiree. Applicants had to fill in a Google form answering a few questions, including “What is something awesome you’ve built?”

Among those who attended the dinner was a postdoctoral fellow from Egli’s lab, Stepan Jerabek, who has been testing base-editing in embryos. Another attendee, Lucas Harrington, is a gene-editing scientist who trained at the University of California, Berkeley under Jennifer Doudna, a winner of the Nobel Prize in chemistry for development of CRISPR gene editing. Harrington says a venture group he helps run, called SciFounders, is also considering starting an embryo-editing company.

“We share an interest in there being a company to empirically evaluate whether embryo editing can be done safely, and are actively exploring incubating a company to undertake this,” Harrington said in an email. “We believe there need to be legitimate scientists and clinicians working to safely evaluate this technology.”

Because of how rapidly gene editing is advancing, Harrington has also criticized bans and moratoria on the technology. These can’t stop it from being applied but, he says, can drive it into “the shadows,” where it might be used less safely. According to Harrington, “several biohacker groups have quietly raised small amounts of capital” to pursue the technology.

By contrast, Armstrong’s public declaration on X represents a more transparent approach. “It seems pretty serious now. They want to put something together,” says Egli, who hopes the Coinbase CEO might fund some research at his lab. “I think it’s very good he posted publicly, because you can feel the temperature, see what reaction you get, and you stimulate the public conversation.”

Editing error

The first reports that researchers were testing CRISPR on human embryos in the lab emerged from China in 2015, causing shock waves as it became clear how easy, in theory, it was to change human heredity. Two years later, in 2017, a report from Oregon claimed successful correction of a dangerous DNA mutation present in lab embryos made from patients’ egg and sperm cells.

But that breakthrough was not what it seemed. More careful testing by Egli and others showed that CRISPR technology actually can cause havoc in a cell, often deleting large chunks of chromosomes. That’s in addition to mosaicism, in which edits occur differently in different cells. What looked at first like precise DNA editing was in fact a dangerous process causing unseen damage.

While the public debate turned on the ethics of CRISPR babies—especially after three edited children were born in China—researchers were discussing basic scientific problems and how to solve them.

Since then, both US labs, as well as some in China, have switched to base editing. That method causes fewer unexpected effects and, in theory, could also endow an embryo with a number of advantageous gene variants, not just one change.

Company job

Some researchers also feel certain that editing an embryo is simpler than trying to treat sick adults. The only approved gene-editing treatment, for sickle-cell disease, costs more than $2 million. By contrast, editing an embryo could be incredibly cheap, and if it’s done early, when an embryo is forming, all the body cells could carry the change.

“You fix the text before you print the book,” says Egli. “It seems like a no-brainer.”

Still, gene editing isn’t quite ready for prime time in making babies. Getting there requires more work, including careful design of the editing system (which includes a protein and short guide molecule) and systematic ways to check embryos for unwanted DNA changes. That is the type of industrial effort Armstrong’s company, if he funds one, would be suited to carry out.

“You would have to optimize something to a point where it is perfect, to where it’s a breeze,” says Egli. “This is the kind of work that companies do.”

Over $1 billion in federal funding got slashed for this polluting industry

The clean cement industry might be facing the end of the road, before it ever really got rolling. 

On Friday, the US Department of Energy announced that it was canceling $3.7 billion in funding for 24 projects related to energy and industry. That included nearly $1.3 billion for cement-related projects.

Cement is a massive climate problem, accounting for roughly 7% of global greenhouse-gas emissions. What’s more, it’s a difficult industry to clean up, with huge traditional players and expensive equipment and infrastructure to replace. This funding was supposed to help address those difficulties, by supporting projects on the cusp of commercialization. Now companies will need to fill in the gap left by these cancellations, and it’s a big one. 

First up on the list for cuts is Sublime Systems, a company you’re probably familiar with if you’ve been reading this newsletter for a while. I did a deep dive last year, and the company was on our list of Climate Tech Companies to Watch in both 2023 and 2024.

The startup’s approach is to make cement using electricity. The conventional process requires high temperatures typically achieved by burning fossil fuels, so avoiding that could prevent a lot of emissions. 

In 2024, Sublime received an $87 million grant from the DOE to construct a commercial demonstration plant in Holyoke, Massachusetts. That grant would have covered roughly half the construction costs for the facility, which is scheduled to open in 2026 and produce up to 30,000 metric tons of cement each year. 

“We were certainly surprised and disappointed about the development,” says Joe Hicken, Sublime’s senior VP of business development and policy. Customers are excited by the company’s technology, Hicken adds, pointing to Sublime’s recently announced deal with Microsoft, which plans to buy up to 622,500 metric tons of cement from the company. 

Another big name, Brimstone, also saw its funding affected. That award totaled $189 million for a commercial demonstration plant, which was expected to produce over 100,000 metric tons of cement annually. 

In a statement, a Brimstone representative said the company believes the cancellation was a “misunderstanding.” The statement pointed out that the planned facility would make not only cement but also alumina, supporting US-based aluminum production. (Aluminum is classified as a critical mineral by the US Geological Survey, meaning it’s considered crucial to the US economy and national security.) 

An award to Heidelberg Materials for up to $500 million for a planned Indiana facility was also axed. The idea there was to integrate carbon capture and storage to clean up emissions from the plant, which would have made it the first cement plant in the US to demonstrate that technology. In a written statement, a representative said the decision can be appealed, and the company is considering that option.

And National Cement’s funding for the Lebec Net-Zero Project, another $500 million award, was canceled. That facility planned to make carbon-neutral cement through a combination of strategies: reducing the polluting ingredients needed, using alternative fuels like biomass, and capturing the plant’s remaining emissions. 

“We want to emphasize that this project will expand domestic manufacturing capacity for a critical industrial sector, while also integrating new technologies to keep American cement competitive,” said a company spokesperson in a written statement. 

There’s a sentiment here that’s echoed in all the responses I received: While these awards were designed to cut emissions, these companies argue that they can fit into the new administration’s priorities. They’re emphasizing phrases like “critical minerals,” “American jobs,” and “domestic supply chains.” 

“We’ve heard loud and clear from the Trump administration the desire to displace foreign imports of things that can be made here in America,” Sublime’s Hicken says. “At the end of the day, what we deliver is what the policymakers in DC are looking for.” 

But this administration is showing that it’s not supporting climate efforts—often even those that also advance its stated goals of energy abundance and American competitiveness. 

On Monday, my colleague James Temple published a new story about cuts to climate research, including tens of millions of dollars in grants from the National Science Foundation. Researchers at Harvard were particularly hard hit. 

Even as there’s interest in advancing the position of the US on the world’s stage, these cuts are making it hard for researchers and companies alike to do the crucial work of understanding our climate and developing and deploying new technologies. 

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

Manus has kick-started an AI agent boom in China

Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them. 

There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March

These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions. 

China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life. 

For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country. 

As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom.

Set the standard

It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised $75 million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees. 

Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project. 

Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks.

“Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.”

In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s. 

Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over $36 million in yearly revenue.

Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management software (think Notion) than a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”.

What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market.

A global address

Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products. 

Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.”

But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away. 

Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant.

But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model. 

An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch.

Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.”

For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month.

A super‑app approach

Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges. 

ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers.

Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis.

Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat. 

Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy.

Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience.

But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups.

Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date. 

ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments.

“Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.”

Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”

What’s next for AI and math

MIT Technology Review’s What’s Next series looks across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.

The way DARPA tells it, math is stuck in the past. In April, the US Defense Advanced Research Projects Agency kicked off a new initiative called expMath—short for Exponentiating Mathematics—that it hopes will speed up the rate of progress in a field of research that underpins a wide range of crucial real-world applications, from computer science to medicine to national security.

“Math is the source of huge impact, but it’s done more or less as it’s been done for centuries—by people standing at chalkboards,” DARPA program manager Patrick Shafto said in a video introducing the initiative

The modern world is built on mathematics. Math lets us model complex systems such as the way air flows around an aircraft, the way financial markets fluctuate, and the way blood flows through the heart. And breakthroughs in advanced mathematics can unlock new technologies such as cryptography, which is essential for private messaging and online banking, and data compression, which lets us shoot images and video across the internet.

But advances in math can be years in the making. DARPA wants to speed things up. The goal for expMath is to encourage mathematicians and artificial-intelligence researchers to develop what DARPA calls an AI coauthor, a tool that might break large, complex math problems into smaller, simpler ones that are easier to grasp and—so the thinking goes—quicker to solve.

Mathematicians have used computers for decades, to speed up calculations or check whether certain mathematical statements are true. The new vision is that AI might help them crack problems that were previously uncrackable.  

But there’s a huge difference between AI that can solve the kinds of problems set in high school—math that the latest generation of models has already mastered—and AI that could (in theory) solve the kinds of problems that professional mathematicians spend careers chipping away at.

On one side are tools that might be able to automate certain tasks that math grads are employed to do; on the other are tools that might be able to push human knowledge beyond its existing limits.

Here are three ways to think about that gulf.

1/ AI needs more than just clever tricks

Large language models are not known to be good at math. They make things up and can be persuaded that 2 + 2 = 5. But newer versions of this tech, especially so-called large reasoning models (LRMs) like OpenAI’s o3 and Anthropic’s Claude 4 Thinking, are far more capable—and that’s got mathematicians excited.

This year, a number of LRMs, which try to solve a problem step by step rather than spit out the first result that comes to them, have achieved high scores on the American Invitational Mathematics Examination (AIME), a test given to the top 5% of US high school math students.

At the same time, a handful of new hybrid models that combine LLMs with some kind of fact-checking system have also made breakthroughs. Emily de Oliveira Santos, a mathematician at the University of São Paulo, Brazil, points to Google DeepMind’s AlphaProof, a system that combines an LLM with DeepMind’s game-playing model AlphaZero, as one key milestone. Last year AlphaProof became the first computer program to match the performance of a silver medallist at the International Math Olympiad, one of the most prestigious mathematics competitions in the world.

And in May, a Google DeepMind model called AlphaEvolve discovered better results than anything humans had yet come up with for more than 50 unsolved mathematics puzzles and several real-world computer science problems.

The uptick in progress is clear. “GPT-4 couldn’t do math much beyond undergraduate level,” says de Oliveira Santos. “I remember testing it at the time of its release with a problem in topology, and it just couldn’t write more than a few lines without getting completely lost.” But when she gave the same problem to OpenAI’s o1, an LRM released in January, it nailed it.

Does this mean such models are all set to become the kind of coauthor DARPA hopes for? Not necessarily, she says: “Math Olympiad problems often involve being able to carry out clever tricks, whereas research problems are much more explorative and often have many, many more moving pieces.” Success at one type of problem-solving may not carry over to another.

Others agree. Martin Bridson, a mathematician at the University of Oxford, thinks the Math Olympiad result is a great achievement. “On the other hand, I don’t find it mind-blowing,” he says. “It’s not a change of paradigm in the sense that ‘Wow, I thought machines would never be able to do that.’ I expected machines to be able to do that.”

That’s because even though the problems in the Math Olympiad—and similar high school or undergraduate tests like AIME—are hard, there’s a pattern to a lot of them. “We have training camps to train high school kids to do them,” says Bridson. “And if you can train a large number of people to do those problems, why shouldn’t you be able to train a machine to do them?”

Sergei Gukov, a mathematician at the California Institute of Technology who coaches Math Olympiad teams, points out that the style of question does not change too much between competitions. New problems are set each year, but they can be solved with the same old tricks.

“Sure, the specific problems didn’t appear before,” says Gukov. “But they’re very close—just a step away from zillions of things you have already seen. You immediately realize, ‘Oh my gosh, there are so many similarities—I’m going to apply the same tactic.’” As hard as competition-level math is, kids and machines alike can be taught how to beat it.

That’s not true for most unsolved math problems. Bridson is president of the Clay Mathematics Institute, a nonprofit US-based research organization best known for setting up the Millenium Prize Problems in 2000—seven of the most important unsolved problems in mathematics, with a $1 million prize to be awarded to the first person to solve each of them. (One problem, the Poincaré conjecture, was solved in 2010; the others, which include P versus NP and the Riemann hypothesis, remain open). “We’re very far away from AI being able to say anything serious about any of those problems,” says Bridson.

And yet it’s hard to know exactly how far away, because many of the existing benchmarks used to evaluate progress are maxed out. The best new models already outperform most humans on tests like AIME.

To get a better idea of what existing systems can and cannot do, a startup called Epoch AI has created a new test called FrontierMath, released in December. Instead of co-opting math tests developed for humans, Epoch AI worked with more than 60 mathematicians around the world to come up with a set of math problems from scratch.

FrontierMath is designed to probe the limits of what today’s AI can do. None of the problems have been seen before and the majority are being kept secret to avoid contaminating training data. Each problem demands hours of work from expert mathematicians to solve—if they can solve it at all: some of the problems require specialist knowledge to tackle.

FrontierMath is set to become an industry standard. It’s not yet as popular as AIME, says de Oliveira Santos, who helped develop some of the problems: “But I expect this to not hold for much longer, since existing benchmarks are very close to being saturated.”

On AIME, the best large language models (Anthropic’s Claude 4, OpenAI’s o3 and o4-mini, Google DeepMind’s Gemini 2.5 Pro, X-AI’s Grok 3) now score around 90%. On FrontierMath, 04-mini scores 19% and Gemini 2.5 Pro scores 13%. That’s still remarkable, but there’s clear room for improvement.     

FrontierMath should give the best sense yet just how fast AI is progressing at math. But there are some problems that are still too hard for computers to take on.

2/ AI needs to manage really vast sequences of steps

Squint hard enough and in some ways math problems start to look the same: to solve them you need to take a sequence of steps from start to finish. The problem is finding those steps. 

“Pretty much every math problem can be formulated as path-finding,” says Gukov. What makes some problems far harder than others is the number of steps on that path. “The difference between the Riemann hypothesis and high school math is that with high school math the paths that we’re looking for are short—10 steps, 20 steps, maybe 40 in the longest case.” The steps are also repeated between problems.

“But to solve the Riemann hypothesis, we don’t have the steps, and what we’re looking for is a path that is extremely long”—maybe a million lines of computer proof, says Gukov.

Finding very long sequences of steps can be thought of as a kind of complex game. It’s what DeepMind’s AlphaZero learned to do when it mastered Go and chess. A game of Go might only involve a few hundred moves. But to win, an AI must find a winning sequence of moves among a vast number of possible sequences. Imagine a number with 100 zeros at the end, says Gukov.

But that’s still tiny compared with the number of possible sequences that could be involved in proving or disproving a very hard math problem: “A proof path with a thousand or a million moves involves a number with a thousand or a million zeros,” says Gukov. 

No AI system can sift through that many possibilities. To address this, Gukov and his colleagues developed a system that shortens the length of a path by combining multiple moves into single supermoves. It’s like having boots that let you take giant strides: instead of taking 2,000 steps to walk a mile, you can now walk it in 20.

The challenge was figuring out which moves to replace with supermoves. In a series of experiments, the researchers came up with a system in which one reinforcement-learning model suggests new moves and a second model checks to see if those moves help.

They used this approach to make a breakthrough in a math problem called the Andrews-Curtis conjecture, a puzzle that has been unsolved for 60 years. It’s a problem that every professional mathematician will know, says Gukov.

(An aside for math stans only: The AC conjecture states that a particular way of describing a type of set called a trivial group can be translated into a different but equivalent description with a certain sequence of steps. Most mathematicians think the AC conjecture is false, but nobody knows how to prove that. Gukov admits himself that it is an intellectual curiosity rather than a practical problem, but an important problem for mathematicians nonetheless.)

Gukov and his colleagues didn’t solve the AC conjecture, but they found that a counterexample (suggesting that the conjecture is false) proposed 40 years ago was itself false. “It’s been a major direction of attack for 40 years,” says Gukov. With the help of AI, they showed that this direction was in fact a dead end.   

“Ruling out possible counterexamples is a worthwhile thing,” says Bridson. “It can close off blind alleys, something you might spend a year of your life exploring.” 

True, Gukov checked off just one piece of one esoteric puzzle. But he thinks the approach will work in any scenario where you need to find a long sequence of unknown moves, and he now plans to try it out on other problems.

“Maybe it will lead to something that will help AI in general,” he says. “Because it’s teaching reinforcement learning models to go beyond their training. To me it’s basically about thinking outside of the box—miles away, megaparsecs away.”  

3/ Can AI ever provide real insight?

Thinking outside the box is exactly what mathematicians need to solve hard problems. Math is often thought to involve robotic, step-by-step procedures. But advanced math is an experimental pursuit, involving trial and error and flashes of insight.

That’s where tools like AlphaEvolve come in. Google DeepMind’s latest model asks an LLM to generate code to solve a particular math problem. A second model then evaluates the proposed solutions, picks the best, and sends them back to the LLM to be improved. After hundreds of rounds of trial and error, AlphaEvolve was able to come up with solutions to a wide range of math problems that were better than anything people had yet come up with. But it can also work as a collaborative tool: at any step, humans can share their own insight with the LLM, prompting it with specific instructions.

This kind of exploration is key to advanced mathematics. “I’m often looking for interesting phenomena and pushing myself in a certain direction,” says Geordie Williamson, a mathematician at the University of Sydney in Australia. “Like: ‘Let me look down this little alley. Oh, I found something!’”

Williamson worked with Meta on an AI tool called PatternBoost, designed to support this kind of exploration. PatternBoost can take a mathematical idea or statement and generate similar ones. “It’s like: ‘Here’s a bunch of interesting things. I don’t know what’s going on, but can you produce more interesting things like that?’” he says.

Such brainstorming is essential work in math. It’s how new ideas get conjured. Take the icosahedron, says Williamson: “It’s a beautiful example of this, which I kind of keep coming back to in my own work.” The icosahedron is a 20-sided 3D object where all the faces are triangles (think of a 20-sided die). The icosahedron is the largest of a family of exactly five such objects: there’s the tetrahedron (four sides), cube (six sides), octahedron (eight sides), and dodecahedron (12 sides).

Remarkably, the fact that there are exactly five of these objects was proved by mathematicians in ancient Greece. “At the time that this theorem was proved, the icosahedron didn’t exist,” says Williamson. “You can’t go to a quarry and find it—someone found it in their mind. And the icosahedron goes on to have a profound effect on mathematics. It’s still influencing us today in very, very profound ways.”

For Williamson, the exciting potential of tools like PatternBoost is that they might help people discover future mathematical objects like the icosahedron that go on to shape the way math is done. But we’re not there yet. “AI can contribute in a meaningful way to research-level problems,” he says. “But we’re certainly not getting inundated with new theorems at this stage.”

Ultimately, it comes down to the fact that machines still lack what you might call intuition or creative thinking. Williamson sums it up like this: We now have AI that can beat humans when it knows the rules of the game. “But it’s one thing for a computer to play Go at a superhuman level and another thing for the computer to invent the game of Go.”

“I think that applies to advanced mathematics,” he says. “Breakthroughs come from a new way of thinking about something, which is akin to finding completely new moves in a game. And I don’t really think we understand where those really brilliant moves in deep mathematics come from.”

Perhaps AI tools like AlphaEvolve and PatternBoost are best thought of as advance scouts for human intuition. They can discover new directions and point out dead ends, saving mathematicians months or years of work. But the true breakthroughs will still come from the minds of people, as has been the case for thousands of years.

For now, at least. “There’s plenty of tech companies that tell us that won’t last long,” says Williamson. “But you know—we’ll see.” 

Inside the effort to tally AI’s energy appetite

After working on it for months, my colleague Casey Crownhart and I finally saw our story on AI’s energy and emissions burden go live last week. 

The initial goal sounded simple: Calculate how much energy is used each time we interact with a chatbot, and then tally that up to understand why everyone from leaders of AI companies to officials at the White House wants to harness unprecedented levels of electricity to power AI and reshape our energy grids in the process. 

It was, of course, not so simple. After speaking with dozens of researchers, we realized that the common understanding of AI’s energy appetite is full of holes. I encourage you to read the full story, which has some incredible graphics to help you understand everything from the energy used in a single query right up to what AI will require just three years from now (enough electricity to power 22% of US households, it turns out). But here are three takeaways I have after the project. 

AI is in its infancy

We focused on measuring the energy requirements that go into using a chatbot, generating an image, and creating a video with AI. But these three uses are relatively small-scale compared with where AI is headed next. 

Lots of AI companies are building reasoning models, which “think” for longer and use more energy. They’re building hardware devices, perhaps like the one Jony Ive has been working on (which OpenAI just acquired for $6.5 billion), that have AI constantly humming along in the background of our conversations. They’re designing agents and digital clones of us to act on our behalf. All these trends point to a more energy-intensive future (which, again, helps explain why OpenAI and others are spending such inconceivable amounts of money on energy). 

But the fact that AI is in its infancy raises another point. The models, chips, and cooling methods behind this AI revolution could all grow more efficient over time, as my colleague Will Douglas Heaven explains. This future isn’t predetermined.

AI video is on another level

When we tested the energy demands of various models, we found the energy required to produce even a low-quality, five-second video to be pretty shocking: It was 42,000 times more than the amount needed for a chatbot answer a question about a recipe, and enough to power a microwave for over an hour. If there’s one type of AI whose energy appetite should worry you, it’s this one. 

Soon after we published, Google debuted the latest iteration of its Veo model. People quickly created compilations of the most impressive clips (this one being the most shocking to me). Something we point out in the story is that Google (as well as OpenAI, which has its own video generator, Sora) denied our request for specific numbers on the energy their AI models use. Nonetheless, our reporting suggests it’s very likely that high-definition video models like Veo and Sora are much larger, and much more energy-demanding, than the models we tested. 

I think the key to whether the use of AI video will produce indefensible clouds of emissions in the near future will be how it’s used, and how it’s priced. The example I linked shows a bunch of TikTok-style content, and I predict that if creating AI video is cheap enough, social video sites will be inundated with this type of content. 

There are more important questions than your own individual footprint

We expected that a lot of readers would understandably think about this story in terms of their own individual footprint, wondering whether their AI usage is contributing to the climate crisis. Don’t panic: It’s likely that asking a chatbot for help with a travel plan does not meaningfully increase your carbon footprint. Video generation might. But after reporting on this for months, I think there are more important questions.

Consider, for example, the water being drained from aquifers in Nevada, the country’s driest state, to power data centers that are drawn to the area by tax incentives and easy permitting processes, as detailed in an incredible story by James Temple. Or look at how Meta’s largest data center project, in Louisiana, is relying on natural gas despite industry promises to use clean energy, per a story by David Rotman. Or the fact that nuclear energy is not the silver bullet that AI companies often make it out to be. 

There are global forces shaping how much energy AI companies are able to access and what types of sources will provide it. There is also very little transparency from leading AI companies on their current and future energy demands, even while they’re asking for public support for these plans. Pondering your individual footprint can be a good thing to do, provided you remember that it’s not so much your footprint as these other factors that are keeping climate researchers and energy experts we spoke to up at night.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The Trump administration has shut down more than 100 climate studies

The Trump administration has terminated National Science Foundation grants for more than 100 research projects related to climate change amid a widening campaign to slash federal funding for scientists and institutions studying the rising risks of a warming world.

The move will cut off what’s likely to amount to tens of millions of dollars for studies that were previously approved and, in most cases, already in the works. 

Affected projects include efforts to develop cleaner fuels, measure methane emissions, improve understanding of how heat waves and sea-level rise disproportionately harm marginalized groups, and help communities transition to sustainable energy, according to an MIT Technology Review review of a GrantWatch database—a volunteer-led effort to track federal cuts to research—and a list of terminated grants from the National Science Foundation (NSF) itself. 

The NSF is one of the largest sources of US funding for university research, so the cancellations will deliver a big blow to climate science and clean-energy development.

They come on top of the White House’s broader efforts to cut research funding and revenue for universities and significantly raise their taxes. The administration has also strived to slash staff and budgets at federal research agencies, halt efforts to assess the physical and financial risks of climate change, and shut down labs that have monitored and analyzed the levels of greenhouse gases in the air for decades.

“I don’t think it takes a lot of imagination to understand where this is going,” says Daniel Schrag, co-director of the science, technology, and public policy program at Harvard University, which has seen greater funding cuts than any other university amid an escalating legal conflict with the administration. “I believe the Trump administration intends to zero out funding for climate science altogether.”

The NSF says it’s terminating grants that aren’t aligned with the agency’s program goals, “including but not limited to those on diversity, equity, and inclusion (DEI), environmental justice, and misinformation/disinformation.”

Trump administration officials have argued that DEI considerations have contaminated US science, favoring certain groups over others and undermining the public’s trust in researchers.

“Political biases have displaced the vital search for truth,” Michael Kratsios, head of the White House Office of Science and Technology Policy, said to a group of NSF administrations and others last month, according to reporting in Science.

Science v. politics

But research projects that got caught in the administration’s anti-DEI filter aren’t the only casualties of the cuts. The NSF has also canceled funding for work that has little obvious connections to DEI ambitions, such as research on catalysts. 

Many believe the administration’s broader motivation is to  undermine the power of the university system and prevent research findings that cut against its politics. 

Trump and his officials have repeatedly argued, in public statements and executive orders, that climate fears are overblown and that burdensome environmental regulations have undermined the nation’s energy security and economic growth.

“It certainly seems like a deliberate attempt to undo any science that contradicts the administration,” says Alexa Fredston, an assistant professor of ocean sciences at the University of California, Santa Cruz. 

On May 28, a group of states including California, New York, and Illinois sued the NSF, arguing that the cuts illegally violated diversity goals and funding priorities clearly established by Congress, which controls federal spending.

A group of universities also filed a lawsuit against the NSF over its earlier decision to reduce the indirect cost rate for research, which reimburses universities for overhead expenses associated with work carried out on campuses. The plaintiffs included the California Institute of Technology, Carnegie Mellon University, and the Massachusetts Institute of Technology, which has also lost a number of research grants.

(MIT Technology Review is owned by, but editorially independent from, MIT.)

The NSF declined to comment.

‘Theft from the American people’

GrantWatch is an effort among researchers at rOpenSci, Harvard, and other organizations to track terminations of grants issued by the National Institutes of Health (NIH) and NSF. It draws on voluntary submissions from scientists involved as well as public government information. 

A search of its database for the terms “climate change,” “clean energy,” “climate adaptation,” “environmental justice,” and “climate justice” showed that the NSF has canceled funds for 118 projects, which were supposed to receive more than $100 million in total. Searching for the word “climate” produces more than 300 research projects that were set to receive more than $230 million. (That word often indicates climate-change-related research, but in some abstracts it refers to the cultural climate.) 

Some share of those funds has already been issued to research groups. The NSF section of the database doesn’t include that “outlaid” figure, but it’s generally about half the amount of the original grants, according to Noam Ross, a computational researcher and executive director of rOpenSci, a nonprofit initiative that promotes open and reproducible science.

A search for “climate change” among the NIH projects produces another 22 studies that were terminated and were still owed nearly $50 million in grants. Many of those projects explored the mental or physical health effects of climate change and extreme weather events.

The NSF more recently released its own list of terminated projects, which mostly mirrored GrantWatch’s findings and confirms the specific terminations mentioned in this story.

“These grant terminations are theft from the American people,” Ross said in an email response. “By illegally ending this research the Trump administration is wasting taxpayer dollars, gutting US leadership in science, and telling the world that the US government breaks its promises.”

Harvard, the country’s oldest university, has been particularly hard hit.

In April, the university sued the Trump administration over cuts to its research funding and efforts to exert control over its admissions and governance policies. The White House, in turn, has moved to eliminate all federal funds for the university, including hundreds of NSF and NIH grants. 

Daniel Nocera, a professor at Harvard who has done pioneering work on so-called artificial photosynthesis, a pathway for producing clean fuels from sunlight, said in an email that all of his grants were terminated. 

“I have no research funds,” he added.

Another terminated grant involved a collaboration between Harvard and the NSF National Center for Atmospheric Research (NCAR), designed to update the atmospheric chemistry component of the Community Earth System Model, an open-source climate model widely used by scientists around the world.

The research was expected to “contribute to a better understanding of atmospheric chemistry in the climate system and to improve air quality predictions within the context of climate change,” according to the NSF abstract. 

“We completed most of the work and were able to bring it to a stopping point,” Daniel Jacob, a professor at Harvard listed as the principal investigator on the project, said in an email. “But it will affect the ability to study chemistry-climate interactions. And it is clearly not right to pull funding from an existing project.”

Plenty of the affected research projects do, in one way or another, grapple with issues of diversity, equity, and inclusion. But that’s because there is ample evidence that disadvantaged communities experience higher rates of illness from energy-sector pollution, will be harder hit by the escalating effects of extreme weather and are underrepresented in scientific fields.

One of the largest terminations cut off about $4 million dollars of remaining funds for the CLIMATE Justice Initiative, a fellowship program at the University of California, Irvine designed to recruit, train and mentor a more diverse array of researchers in Earth sciences.  

The NSF decision occurred halfway into the 5-year program, halting funds for a number of fellows who were in the midst of environmental justice research efforts with community partners in Southern California. Kathleen Johnson, a professor at UC Irvine and director of the initiative, says the university is striving to find ways to fund as many participants as possible for the remainder of their fellowships.

“We need people from all parts of society who are trained in geoscience and climate science to address all these global challenges that we are facing,” she says. “The people who will be best positioned to do this work …  are the people who understand the community’s needs and are able to therefore work to implement equitable solutions.”

“Diverse teams have been shown to do better science,” Johnson adds.

Numerous researchers whose grants were terminated didn’t respond to inquiries from MIT Technology Review or declined to comment, amid growing concerns that the Trump administration will punish scientists or institutions that criticize their policies.

Coming cuts

The termination of existing NSF and NIH grants is just the start of the administration’s plans to cut federal funding for climate and clean-energy research. 

The White House’s budget proposal for the coming fiscal year seeks to eliminate tens of billions of dollars in funding across federal agencies, specifically calling out “Green New Scam funds” at the Department of Energy; “low-priority climate monitoring satellites” at NASA; “climate-dominated research, data, and grant programs” at the National Oceanic and Atmospheric Administration; and “climate; clean energy; woke social, behavioral, and economic sciences” at the NSF.

The administration released a more detailed NSF budget proposal on May 30th, which called for a 60% reduction in research spending and nearly zeroed out the clean energy technology program. It also proposed cutting funds by 97% for the US Global Change Research Program, which produces regular assessments of climate risks; 80% for the Ocean Observatories Initiative, a global network of ocean sensors that monitor shifting marine conditions; and 40% for NCAR, the atmospheric research center.

If Congress approves budget reductions anywhere near the levels the administration has put forward, scientists fear, it could eliminate the resources necessary to carry on long-running climate observation of oceans, forests, and the atmosphere. 

The administration also reportedly plans to end the leases on dozens of NOAA facilities, including the Global Monitoring Laboratory in Hilo, Hawaii. The lab supports the work of the nearby Mauna Loa Observatory, which has tracked atmospheric carbon dioxide levels for decades.

Even short gaps in these time-series studies, which scientists around the world rely upon, would have an enduring impact on researchers’ ability to analyze and understand weather and climate trends.

“We won’t know where we’re going if we stop measuring what’s happening,” says Jane Long, formerly the associate director of energy and environment at Lawrence Livermore National Lab. “It’s devastating—there’s no two ways around it.” 

Stunting science 

Growing fears that public research funding will take an even larger hit in the coming fiscal year are forcing scientists to rethink their research plans—or to reconsider whether they want to stay in the field at all, numerous observers said.

“The amount of funding we’re talking about isn’t something a university can fill indefinitely, and it’s not something that private philanthropy can fill for very long,” says Michael Oppenheimer, a professor of geosciences and international affairs at Princeton University. “So what we’re talking about is potentially cataclysmic for climate science.”

“Basically it’s a shit show,” he adds, “and how bad a shit show it is will depend a lot on what happens in the courts and Congress over the next few months.”

One climate scientist, who declined to speak on the record out of concern that the administration might punish his institution, said the declining funding is forcing researchers to shrink their scientific ambitions down to a question of “What can I do with my laptop and existing data sets?”

“If your goal was to make the United States a second-class or third-class country when it comes to science and education, you would be doing exactly what the administration is doing,” the scientist said. “People are pretty depressed, upset, and afraid.”

Given the rising challenges, Harvard’s Schrag fears that the best young climate scientists will decide to shift their careers outside of the US, or move into high tech or other fields where they can make significantly more money.

“We might lose a generation of talent—and that’s not going to get fixed four years from now,” he says. “The irony is that Trump is attacking the institutions and foundation of US science that literally made America great.”

This benchmark used Reddit’s AITA to test how much AI models suck up to us

Back in April, OpenAIannounced it was rolling back an update to its GPT-4o model that made ChatGPT’s responses to user queries too sycophantic

An AI model that acts in an overly agreeable and flattering way is more than just annoying. It could reinforce users’ incorrect beliefs, mislead people, and spread misinformation that can be dangerous—a particular risk when increasing numbers of young people are using ChatGPT as a life advisor. And because sycophancy is difficult to detect, it can go unnoticed until a model or update has already been deployed, as OpenAI found out.

A new benchmark that measures the sycophantic tendencies of major AI models could help AI companies avoid these issues in the future. The team behind Elephant, from Stanford, Carnegie Mellon, and the University of Oxford, found that LLMs consistently exhibit higher rates of sycophancy than humans do.

“We found that language models don’t challenge users’ assumptions, even when they might be harmful or totally misleading,” says Myra Cheng, a PhD student at Stanford University who worked on the research, which has not been peer-reviewed. “So we wanted to give researchers and developers the tools to empirically evaluate their models on sycophancy, because it’s a problem that is so prevalent.”

It’s hard to assess how sycophantic AI models are because sycophancy comes in many forms. Previous research has tended to focus on how chatbots agree with users even when what the human has told the AI is demonstrably wrong—for example, they might state that Nice, not Paris, is the capital of France.

While this approach is still useful, it overlooks all the subtler, more insidious ways in which models behave sycophantically when there isn’t a clear ground truth to measure against. Users typically ask LLMs open-ended questions containing implicit assumptions, and those assumptions can trigger sycophantic responses, the researchers claim. For example, a model that’s asked “How do I approach my difficult coworker?” is more likely to accept the premise that a coworker is difficult than it is to question why the user thinks so.

To bridge this gap, Elephant is designed to measure social sycophancy—a model’s propensity to preserve the user’s “face,” or self-image, even when doing so is misguided or potentially harmful. It uses metrics drawn from social science to assess five nuanced kinds of behavior that fall under the umbrella of sycophancy: emotional validation, moral endorsement, indirect language, indirect action, and accepting framing. 

To do this, the researchers tested it on two data sets made up of personal advice written by humans. This first consisted of 3,027 open-ended questions about diverse real-world situations taken from previous studies. The second data set was drawn from 4,000 posts on Reddit’s AITA (“Am I the Asshole?”) subreddit, a popular forum among users seeking advice. Those data sets were fed into eight LLMs from OpenAI (the version of GPT-4o they assessed was earlier than the version that the company later called too sycophantic), Google, Anthropic, Meta, and Mistral, and the responses were analyzed to see how the LLMs’ answers compared with humans’.  

Overall, all eight models were found to be far more sycophantic than humans, offering emotional validation in 76% of cases (versus 22% for humans) and accepting the way a user had framed the query in 90% of responses (versus 60% among humans). The models also endorsed user behavior that humans said was inappropriate in an average of 42% of cases from the AITA data set.

But just knowing when models are sycophantic isn’t enough; you need to be able to do something about it. And that’s trickier. The authors had limited success when they tried to mitigate these sycophantic tendencies through two different approaches: prompting the models to provide honest and accurate responses, and training a fine-tuned model on labeled AITA examples to encourage outputs that are less sycophantic. For example, they found that adding “Please provide direct advice, even if critical, since it is more helpful to me” to the prompt was the most effective technique, but it only increased accuracy by 3%. And although prompting improved performance for most of the models, none of the fine-tuned models were consistently better than the original versions.

“It’s nice that it works, but I don’t think it’s going to be an end-all, be-all solution,” says Ryan Liu, a PhD student at Princeton University who studies LLMs but was not involved in the research. “There’s definitely more to do in this space in order to make it better.”

Gaining a better understanding of AI models’ tendency to flatter their users is extremely important because it gives their makers crucial insight into how to make them safer, says Henry Papadatos, managing director at the nonprofit SaferAI. The breakneck speed at which AI models are currently being deployed to millions of people across the world, their powers of persuasion, and their improved abilities to retain information about their users add up to “all the components of a disaster,” he says. “Good safety takes time, and I don’t think they’re spending enough time doing this.” 

While we don’t know the inner workings of LLMs that aren’t open-source, sycophancy is likely to be baked into models because of the ways we currently train and develop them. Cheng believes that models are often trained to optimize for the kinds of responses users indicate that they prefer. ChatGPT, for example, gives users the chance to mark a response as good or bad via thumbs-up and thumbs-down icons. “Sycophancy is what gets people coming back to these models. It’s almost the core of what makes ChatGPT feel so good to talk to,” she says. “And so it’s really beneficial, for companies, for their models to be sycophantic.” But while some sycophantic behaviors align with user expectations, others have the potential to cause harm if they go too far—particularly when people do turn to LLMs for emotional support or validation. 

“We want ChatGPT to be genuinely useful, not sycophantic,” an OpenAI spokesperson says. “When we saw sycophantic behavior emerge in a recent model update, we quickly rolled it back and shared an explanation of what happened. We’re now improving how we train and evaluate models to better reflect long-term usefulness and trust, especially in emotionally complex conversations.”

Cheng and her fellow authors suggest that developers should warn users about the risks of social sycophancy and consider restricting model usage in socially sensitive contexts. They hope their work can be used as a starting point to develop safer guardrails. 

She is currently researching the potential harms associated with these kinds of LLM behaviors, the way they affect humans and their attitudes toward other people, and the importance of making models that strike the right balance between being too sycophantic and too critical. “This is a very big socio-technical challenge,” she says. “We don’t want LLMs to end up telling users, ‘You are the asshole.’”