What a return to supersonic flight could mean for climate change

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

As I’ve admitted in this newsletter before, I love few things more than getting on an airplane. I know, it’s a bold statement from a climate reporter because of all the associated emissions, but it’s true. So I’m as intrigued as the next person by efforts to revive supersonic flight.  

Last week, Boom Supersonic completed its first supersonic test flight of the XB-1 test aircraft. I watched the broadcast live, and the vibe was infectious, watching the hosts’ anticipation during takeoff and acceleration, and then their celebration once it was clear the aircraft had broken the sound barrier.

And yet, knowing what I know about the climate, the promise of a return to supersonic flight is a little tarnished. We’re in a spot with climate change where we need to drastically cut emissions, and supersonic flight would likely take us in the wrong direction. The whole thing has me wondering how fast is fast enough. 

The aviation industry is responsible for about 4% of global warming to date. And right now only about 10% of the global population flies on an airplane in any given year. As incomes rise and flight becomes more accessible to more people, we can expect air travel to pick up, and the associated greenhouse gas emissions to rise with it. 

If business continues as usual, emissions from aviation could double by 2050, according to a 2019 report from the International Civil Aviation Organization. 

Supersonic flight could very well contribute to this trend, because flying faster requires a whole lot more energy—and consequently, fuel. Depending on the estimate, on a per-passenger basis, a supersonic plane will use somewhere between two and nine times as much fuel as a commercial jet today. (The most optimistic of those numbers comes from Boom, and it compares the company’s own planes to first-class cabins.)

In addition to the greenhouse gas emissions from increased fuel use, additional potential climate effects may be caused by pollutants like nitrogen oxides, sulfur, and black carbon being released at the higher altitudes common in supersonic flight. For more details, check out my latest story.

Boom points to sustainable aviation fuels (SAFs) as the solution to this problem. After all, these alternative fuels could potentially cut out all the greenhouse gases associated with burning jet fuel.

The problem is, the market for SAFs is practically embryonic. They made up less than 1% of the jet fuel supply in 2024, and they’re still several times more expensive than fossil fuels. And currently available SAFs tend to cut emissions between 50% and 70%—still a long way from net-zero.

Things will (hopefully) progress in the time it takes Boom to make progress on reviving supersonic flight—the company plans to begin building its full-scale plane, Overture, sometime next year. But experts are skeptical that SAF will be as available, or as cheap, as it’ll need to be to decarbonize our current aviation industry, not to mention to supply an entirely new class of airplanes that burn even more fuel to go the same distance.

The Concorde supersonic jet, which flew from 1969 to 2003, could get from New York to London in a little over three hours. I’d love to experience that flight—moving faster than the speed of sound is a wild novelty, and a quicker flight across the pond could open new options for travel. 

One expert I spoke to for my story, after we talked about supersonic flight and how it’ll affect the climate, mentioned that he’s actually trying to convince the industry that planes should actually be slowing down a little bit. By flying just 10% slower, planes could see outsized reductions in emissions. 

Technology can make our lives better. But sometimes, there’s a clear tradeoff between how technology can improve comfort and convenience for a select group of people and how it will contribute to the global crisis that is climate change. 

I’m not a Luddite, and I certainly fly more than the average person. But I do feel like, maybe we should all figure out how to slow down, or at least not tear toward the worst impacts of climate change faster. 


Now read the rest of The Spark

Related reading

We named sustainable aviation fuel as one of our 10 Breakthrough Technologies this year. 

The world of alternative fuels can be complicated. Here’s everything you need to know about the wide range of SAFs

Rerouting planes could help reduce contrails—and aviation’s climate impacts. Read more in this story from James Temple.  

A glowing deepseek logo

SARAH ROGERS / MITTR | PHOTO GETTY

Another thing

DeepSeek has crashed onto the scene, upending established ideas about the AI industry. One common claim is that the company’s model could drastically reduce the energy needed for AI. But the story is more complicated than that, as my colleague James O’Donnell covered in this sharp analysis

Keeping up with climate

Donald Trump announced a 10% tariff on goods from China. Plans for tariffs on Mexico and Canada were announced, then quickly paused, this week as well. Here’s more on what it could mean for folks in the US. (NPR)
→ China quickly hit back with mineral export curbs on materials including tellurium, a key ingredient in some alternative solar panels. (Mining.com)
→ If the tariffs on Mexico and Canada go into effect, they’d hit supply chains for the auto industry, hard. (Heatmap News)

Researchers are scrambling to archive publicly available data from agencies like the National Oceanic and Atmospheric Administration. The Trump administration has directed federal agencies to remove references to climate change. (Inside Climate News)
→ As of Wednesday morning, it appears that live data that tracks carbon dioxide in the atmosphere is no longer accessible on NOAA’s website. (Try for yourself here)

Staffers with Elon Musk’s “department of government efficiency” entered the NOAA offices on Wednesday morning, inciting concerns about plans for the agency. (The Guardian)

The National Science Foundation, one of the US’s leading funders of science and engineering research, is reportedly planning to lay off between 25% and 50% of its staff. (Politico)

Our roads aren’t built for the conditions being driven by climate change. Warming temperatures and changing weather patterns are hammering roads, driving up maintenance costs. (Bloomberg)

Researchers created a new strain of rice that produces much less methane when grown in flooded fields. The variant was made with traditional crossbreeding. (New Scientist)

Oat milk maker Oatly is trying to ditch fossil fuels in its production process with industrial heat pumps and other electrified technology. But getting away from gas in food and beverage production isn’t easy. (Canary Media)

A new 3D study of the Greenland Ice Sheet reveals that crevasses are expanding faster than previously thought. (Inside Climate News)

In other ice news, an Arctic geoengineering project shut down over concerns for wildlife. The nonprofit project was experimenting with using glass beads to slow melting, but results showed it was a threat to food chains. (New Scientist)

Supersonic planes are inching toward takeoff. That could be a problem.

Boom Supersonic broke the sound barrier in a test flight of its XB-1 jet last week, marking an early step in a potential return for supersonic commercial flight. The small aircraft reached a top speed of Mach 1.122 (roughly 750 miles per hour) in a flight over southern California and exceeded the speed of sound for a few minutes. 

“XB-1’s supersonic flight demonstrates that the technology for passenger supersonic flight has arrived,” said Boom founder and CEO Blake Scholl in a statement after the test flight.

Boom plans to start commercial operation with a scaled-up version of the XB-1, a 65-passenger jet called Overture, before the end of the decade, and it has already sold dozens of planes to customers including United Airlines and American Airlines. But as the company inches toward that goal, experts warn that such efforts will come with a hefty climate price tag. 

Supersonic planes will burn significantly more fuel than current aircraft, resulting in higher emissions of carbon dioxide, which fuels climate change. Supersonic jets also fly higher than current commercial planes do, introducing atmospheric effects that may warm the planet further.

In response to questions from MIT Technology Review, Boom pointed to alternative fuels as a solution, but those remain in limited supply—and they could have limited use in cutting emissions in supersonic aircraft. Aviation is a significant and growing contributor to human-caused climate change, and supersonic technologies could grow the sector’s pollution, rather than make progress toward shrinking it.

XB-1 follows a long history of global supersonic flight. Humans first broke the sound barrier in 1947, when Chuck Yeager hit 700 miles per hour in a research aircraft (the speed of sound at that flight’s altitude is 660 miles per hour). Just over two decades later, in 1969, the first supersonic commercial airliner, the Concorde, took its first flight. That aircraft regularly traveled at supersonic speeds until the last one was decommissioned in 2003.

Among other issues (like the nuisance of sonic booms), one of the major downfalls of the Concorde was its high operating cost, due in part to the huge amounts of fuel it required to reach top speeds. Experts say today’s supersonic jets will face similar challenges. 

Flying close to the speed of sound changes the aerodynamics required of an aircraft, says Raymond Speth, associate director of the MIT Laboratory for Aviation and the Environment. “All the things you have to do to fly at supersonic speed,” he says, “they reduce your efficiency … There’s a reason we have this sweet spot where airplanes fly today, around Mach 0.8 or so.”

Boom estimates that one of its full-sized Overture jets will burn two to three times as much fuel per passenger as a subsonic plane’s first-class cabin. The company chose this comparison because its aircraft is “designed to deliver an enhanced, productive cabin experience,” similar to what’s available in first- and business-class cabins on today’s aircraft. 

That baseline, however, isn’t representative of the average traveler today. Compared to standard economy-class travel, first-class cabins tend to have larger seats with more space between them. Because there are fewer seats, more fuel is required per passenger, and therefore more emissions are produced for each person. 

When passengers crammed into coach are considered in addition to those in first class, each passenger on a Boom Supersonic flight will burn somewhere between five and seven times more fuel per passenger than the average subsonic plane passenger today, according to research from the International Council on Clean Transportation. 

It’s not just carbon dioxide from burning fuel that could add to supersonic planes’ climate impact. All jet engines release other pollutants as well, including nitrogen oxides, black carbon, and sulfur.

The difference is that while commercial planes today top out in the troposphere, supersonic aircraft tend to fly higher in the atmosphere, in the stratosphere. The air is less dense at higher altitudes, creating less drag on the plane and making it easier to reach supersonic speeds.

Flying in the stratosphere, and releasing pollutants there, could increase the climate impacts of supersonic flight, Speth says. For one, nitrogen oxides released in the stratosphere damage the ozone layer through chemical reactions at that altitude.

It’s not all bad news, to be fair. The drier air in the stratosphere means supersonic jets likely won’t produce significant contrails. That could be a benefit for climate, since contrails contribute to aviation’s warming.

Boom has also touted plans to make up for its expected climate impacts by making its aircraft compatible with 100% sustainable aviation fuel (SAF), a category of alternative fuels made from biological sources, waste products, or even captured carbon from the air. “Going faster requires more energy, but it doesn’t need to emit more carbon. Overture is designed to fly on net-zero carbon sustainable aviation fuel (SAF), eliminating up to 100% of carbon emissions,” a Boom spokesperson said via email in response to written questions from MIT Technology Review

However, alternative fuels may not be a saving grace for supersonic flight. Most commercially available SAF today is made with a process that cuts emissions between 50% and 70% compared to fossil fuels. So a supersonic jet running on SAFs may emit less carbon dioxide than one running on fossil fuels, but alternative fuels will likely still come with some level of carbon pollution attached, says Dan Rutherford, senior director of research at the International Council on Clean Transportation. 

“People are pinning a lot of hope on SAFs,” says Rutherford. “But the reality is, today they remain scarce [and] expensive, and they have sustainability concerns of their own.”

Of the 100 billion gallons of jet fuel used last year, only about 0.5% of it was SAF. Companies are building new factories to produce larger volumes of the fuels and expand the available options, but the fuel is likely going to continue to make up a small fraction of the existing fuel supply, Rutherford says. That means supersonic jets will be competing with other, existing planes for the same supply, and aiming to use more of it. 

Boom Supersonic has secured 10 million gallons of SAF annually from Dimensional Energy and Air Company for the duration of the Overture test flight program, according to the company spokesperson’s email. Ultimately, though, if and when Overture reaches commercial operation, it will be the airlines that purchase its planes hunting for a fuel supply—and paying for it. 

There’s also a chance that using SAFs in supersonic jets could come with unintended consequences, as the fuels have a slightly different chemical makeup than fossil fuels. For example, fossil fuels generally contain sulfur, which has a cooling effect, as sulfur aerosols formed from jet engine exhaust help reflect sunlight. (Intentional release of sulfur is one strategy being touted by groups aiming to start geoengineering the atmosphere.) That effect is stronger in the stratosphere, where supersonic jets are likely to fly. SAFs, however, typically have very low sulfur levels, so using the alternative fuels in supersonic jets could potentially result in even more warming overall.

There are other barriers that Boom and others will need to surmount to get a new supersonic jet industry off the ground. Supersonic travel over land is largely banned, because of the noise and potential damage that comes from the shock wave caused by breaking the sound barrier. While some projects, including one at NASA, are working on changes to aircraft that would result in a less disruptive shock wave, these so-called low-boom technologies are far from proven. NASA’s prototype was revealed last year, and the agency is currently conducting tests of the aircraft, with first flight anticipated sometime this year.  

Boom is planning a second supersonic test flight for XB-1, as early as February 10, according to the spokesperson. Once testing in that small aircraft is done, the data will be used to help build Overture, the full-scale plane. The company says it plans to begin production on Overture in its factory in roughly 18 months. 

In the meantime, the world continues to heat up. As MIT’s Speth says, “I feel like it’s not the time for aviation to be coming up with new ways of using even more energy, with where we are in the climate crisis.”

Three things to know as the dust settles from DeepSeek

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The launch of a single new AI model does not normally cause much of a stir outside tech circles, nor does it typically spook investors enough to wipe out $1 trillion in the stock market. Now, a couple of weeks since DeepSeek’s big moment, the dust has settled a bit. The news cycle has moved on to calmer things, like the dismantling of long-standing US federal programs, the purging of research and data sets to comply with recent executive orders, and the possible fallouts from President Trump’s new tariffs on Canada, Mexico, and China.

Within AI, though, what impact is DeepSeek likely to have in the longer term? Here are three seeds DeepSeek has planted that will grow even as the initial hype fades.

First, it’s forcing a debate about how much energy AI models should be allowed to use up in pursuit of better answers. 

You may have heard (including from me) that DeepSeek is energy efficient. That’s true for its training phase, but for inference, which is when you actually ask the model something and it produces an answer, it’s complicated. It uses a chain-of-thought technique, which breaks down complex questions–-like whether it’s ever okay to lie to protect someone’s feelings—into chunks, and then logically answers each one. The method allows models like DeepSeek to do better at math, logic, coding, and more. 

The problem, at least to some, is that this way of “thinking” uses up a lot more electricity than the AI we’ve been used to. Though AI is responsible for a small slice of total global emissions right now, there is increasing political support to radically increase the amount of energy going toward AI. Whether or not the energy intensity of chain-of-thought models is worth it, of course, depends on what we’re using the AI for. Scientific research to cure the world’s worst diseases seems worthy. Generating AI slop? Less so. 

Some experts worry that the impressiveness of DeepSeek will lead companies to incorporate it into lots of apps and devices, and that users will ping it for scenarios that don’t call for it. (Asking DeepSeek to explain Einstein’s theory of relativity is a waste, for example, since it doesn’t require logical reasoning steps, and any typical AI chat model can do it with less time and energy.) Read more from me here

Second, DeepSeek made some creative advancements in how it trains, and other companies are likely to follow its lead. 

Advanced AI models don’t just learn on lots of text, images, and video. They rely heavily on humans to clean that data, annotate it, and help the AI pick better responses, often for paltry wages. 

One way human workers are involved is through a technique called reinforcement learning with human feedback. The model generates an answer, human evaluators score that answer, and those scores are used to improve the model. OpenAI pioneered this technique, though it’s now used widely by the industry. 

As my colleague Will Douglas Heaven reports, DeepSeek did something different: It figured out a way to automate this process of scoring and reinforcement learning. “Skipping or cutting down on human feedback—that’s a big thing,” Itamar Friedman, a former research director at Alibaba and now cofounder and CEO of Qodo, an AI coding startup based in Israel, told him. “You’re almost completely training models without humans needing to do the labor.” 

It works particularly well for subjects like math and coding, but not so well for others, so workers are still relied upon. Still, DeepSeek then went one step further and used techniques reminiscent of how Google DeepMind trained its AI model back in 2016 to excel at the game Go, essentially having it map out possible moves and evaluate their outcomes. These steps forward, especially since they are outlined broadly in DeepSeek’s open-source documentation, are sure to be followed by other companies. Read more from Will Douglas Heaven here

Third, its success will fuel a key debate: Can you push for AI research to be open for all to see and push for US competitiveness against China at the same time?

Long before DeepSeek released its model for free, certain AI companies were arguing that the industry needs to be an open book. If researchers subscribed to certain open-source principles and showed their work, they argued, the global race to develop superintelligent AI could be treated like a scientific effort for public good, and the power of any one actor would be checked by other participants.

It’s a nice idea. Meta has largely spoken in support of that vision, and venture capitalist Marc Andreessen has said that open-source approaches can be more effective at keeping AI safe than government regulation. OpenAI has been on the opposite side of that argument, keeping its models closed off on the grounds that it can help keep them out of the hands of bad actors. 

DeepSeek has made those narratives a bit messier. “We have been on the wrong side of history here and need to figure out a different open-source strategy,” OpenAI’s Sam Altman said in a Reddit AMA on Friday, which is surprising given OpenAI’s past stance. Others, including President Trump, doubled down on the need to make the US more competitive on AI, seeing DeepSeek’s success as a wake-up call. Dario Amodei, a founder of Anthropic, said it’s a reminder that the US needs to tightly control which types of advanced chips make their way to China in the coming years, and some lawmakers are pushing the same point. 

The coming months, and future launches from DeepSeek and others, will stress-test every single one of these arguments. 


Now read the rest of The Algorithm

Deeper Learning

OpenAI launches a research tool

On Sunday, OpenAI launched a tool called Deep Research. You can give it a complex question to look into, and it will spend up to 30 minutes reading sources, compiling information, and writing a report for you. It’s brand new, and we haven’t tested the quality of its outputs yet. Since its computations take so much time (and therefore energy), right now it’s only available to users with OpenAI’s paid Pro tier ($200 per month) and limits the number of queries they can make per month. 

Why it matters: AI companies have been competing to build useful “agents” that can do things on your behalf. On January 23, OpenAI launched an agent called Operator that could use your computer for you to do things like book restaurants or check out flight options. The new research tool signals that OpenAI is not just trying to make these mundane online tasks slightly easier; it wants to position AI as able to handle  professional research tasks. It claims that Deep Research “accomplishes in tens of minutes what would take a human many hours.” Time will tell if users will find it worth the high costs and the risk of including wrong information. Read more from Rhiannon Williams

Bits and Bytes

Déjà vu: Elon Musk takes his Twitter takeover tactics to Washington

Federal agencies have offered exits to millions of employees and tested the prowess of engineers—just like when Elon Musk bought Twitter. The similarities have been uncanny. (The New York Times)

AI’s use in art and movies gets a boost from the Copyright Office

The US Copyright Office finds that art produced with the help of AI should be eligible for copyright protection under existing law in most cases, but wholly AI-generated works probably are not. What will that mean? (The Washington Post)

OpenAI releases its new o3-mini reasoning model for free

OpenAI just released a reasoning model that’s faster, cheaper, and more accurate than its predecessor. (MIT Technology Review)

Anthropic has a new way to protect large language models against jailbreaks

This line of defense could be the strongest yet. But no shield is perfect. (MIT Technology Review). 

How the Rubin Observatory will help us understand dark matter and dark energy

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

We can put a good figure on how much we know about the universe: 5%. That’s how much of what’s floating about in the cosmos is ordinary matter—planets and stars and galaxies and the dust and gas between them. The other 95% is dark matter and dark energy, two mysterious entities aptly named for our inability to shed light on their true nature. 

Cosmologists have cast dark matter as the hidden glue binding galaxies together. Dark energy plays an opposite role, ripping the fabric of space apart. Neither emits, absorbs, or reflects light, rendering them effectively invisible. So rather than directly observing either of them, astronomers must carefully trace the imprint they leave behind. 

Previous work has begun pulling apart these dueling forces, but dark matter and dark energy remain shrouded in a blanket of questions—critically, what exactly are they?

Enter the Vera C. Rubin Observatory, one of our 10 breakthrough technologies for 2025. Boasting the largest digital camera ever created, Rubin is expected to study the cosmos in the highest resolution yet once it begins observations later this year. And with a better window on the cosmic battle between dark matter and dark energy, Rubin might narrow down existing theories on what they are made of. Here’s a look at how.

Untangling dark matter’s web

In the 1930s, the Swiss astronomer Fritz Zwicky proposed the existence of an unseen force named dunkle Materie—in English, dark matter—after studying a group of galaxies called the Coma Cluster. Zwicky found that the galaxies were traveling too quickly to be contained by their joint gravity and decided there must be a missing, unobservable mass holding the cluster together.

Zwicky’s theory was initially met with much skepticism. But in the 1970s an American astronomer, Vera Rubin, obtained evidence that significantly strengthened the idea. Rubin studied the rotation rates of 60 individual galaxies and found that if a galaxy had only the mass we’re able to observe, that wouldn’t be enough to contain its structure; its spinning motion would send it ripping apart and sailing into space. 

Rubin’s results helped sell the idea of dark matter to the scientific community, since an unseen force seemed to be the only explanation for these spiraling galaxies’ breakneck spin speeds. “It wasn’t necessarily a smoking-gun discovery,” says Marc Kamionkowski, a theoretical physicist at Johns Hopkins University. “But she saw a need for dark matter. And other people began seeing it too.”

Evidence for dark matter only grew stronger in the ensuing decades. But sorting out what might be behind its effects proved tricky. Various subatomic particles were proposed. Some scientists posited that the phenomena supposedly generated by dark matter could also be explained by modifications to our theory of gravity. But so far the hunt, which has employed telescopes, particle colliders, and underground detectors, has failed to identify the culprit. 

The Rubin observatory’s main tool for investigating dark matter will be gravitational lensing, an observational technique that’s been used since the late ’70s. As light from distant galaxies travels to Earth, intervening dark matter distorts its image—like a cosmic magnifying glass. By measuring how the light is bent, astronomers can reverse-engineer a map of dark matter’s distribution. 

Other observatories, like the Hubble Space Telescope and the James Webb Space Telescope, have already begun stitching together this map from their images of galaxies. But Rubin plans to do so with exceptional precision and scale, analyzing the shapes of billions of galaxies rather than the hundreds of millions that current telescopes observe, according to Andrés Alejandro Plazas Malagón, Rubin operations scientist at SLAC National Laboratory. “We’re going to have the widest galaxy survey so far,” Plazas Malagón says.

Capturing the cosmos in such high definition requires Rubin’s 3.2-billion-pixel Large Synoptic Survey Telescope (LSST). The LSST boasts the largest focal plane ever built for astronomy, granting it access to large patches of the sky. 

The telescope is also designed to reorient its gaze every 34 seconds, meaning astronomers will be able to scan the entire sky every three nights. The LSST will revisit each galaxy about 800 times throughout its tenure, says Steven Ritz, a Rubin project scientist at the University of California, Santa Cruz. The repeat exposures will let Rubin team members more precisely measure how the galaxies are distorted, refining their map of dark matter’s web. “We’re going to see these galaxies deeply and frequently,” Ritz says. “That’s the power of Rubin: the sheer grasp of being able to see the universe in detail and on repeat.”

The ultimate goal is to overlay this map on different models of dark matter and examine the results. The leading idea, the cold dark matter model, suggests that dark matter moves slowly compared to the speed of light and interacts with ordinary matter only through gravity. Other models suggest different behavior. Each comes with its own picture of how dark matter should clump in halos surrounding galaxies. By plotting its chart of dark matter against what those models predict, Rubin might exclude some theories and favor others. 

A cosmic tug of war

If dark matter lies on one side of a magnet, pulling matter together, then you’ll flip it over to find dark energy, pushing it apart. “You can think of it as a cosmic tug of war,” Plazas Malagón says.

Dark energy was discovered in the late 1990s, when astronomers found that the universe was not only expanding, but doing so at an accelerating rate, with galaxies moving away from one another at higher and higher speeds. 

“The expectation was that the relative velocity between any two galaxies should have been decreasing,” Kamionkowski says. “This cosmological expansion requires something that acts like antigravity.” Astronomers quickly decided there must be another unseen factor inflating the fabric of space and pegged it as dark matter’s cosmic foil. 

So far, dark energy has been observed primarily through Type Ia supernovas, a special breed of explosion that occurs when a white dwarf star accumulates too much mass. Because these supernovas all tend to have the same peak in luminosity, astronomers can gauge how far away they are by measuring how bright they appear from Earth. Paired with a measure of how fast they are moving, this data clues astronomers in on the universe’s expansion rate. 

Rubin will continue studying dark energy with high-resolution glimpses of Type Ia supernovas. But it also plans to retell dark energy’s cosmic history through gravitational lensing. Because light doesn’t travel instantaneously, when we peer into distant galaxies, we’re really looking at relics from millions to billions of years ago—however long it takes for their light to make the lengthy trek to Earth. Astronomers can effectively use Rubin as a makeshift time machine to see how dark energy has carved out the shape of the universe. 

“These are the types of questions that we want to ask: Is dark energy a constant? If not, is it evolving with time? How is it changing the distribution of dark matter in the universe?” Plazas Malagón says.

If dark energy was weaker in the past, astronomers expect to see galaxies grouped even more densely into galaxy clusters. “It’s like urban sprawl—these huge conglomerates of matter,” Ritz says. Meanwhile, if dark energy was stronger, it would have pushed galaxies away from one another, creating a more “rural” landscape. 

Researchers will be able to use Rubin’s maps of dark matter and the 3D distribution of galaxies to plot out how the structure of the universe changed over time, unveiling the role of dark energy and, they hope, helping scientists evaluate the different theories to account for its behavior. 

Of course, Rubin has a lengthier list of goals to check off. Some top items entail tracing the structure of the Milky Way, cataloguing cosmic explosions, and observing asteroids and comets. But since the observatory was first conceptualized in the early ’90s, its core goal has been to explore this hidden branch of the universe. After all, before a 2019 act of Congress dedicated the observatory to Vera Rubin, it was simply called the Dark Matter Telescope. 

Rubin isn’t alone in the hunt, though. In 2023, the European Space Agency launched the Euclid telescope into space to study how dark matter and dark energy have shaped the structure of the cosmos. And NASA’s Nancy Grace Roman Space Telescope, which is scheduled to launch in 2027, has similar plans to measure the universe’s expansion rate and chart large-scale distributions of dark matter. Both also aim to tackle that looming question: What makes up this invisible empire?

Rubin will test its systems throughout most of 2025 and plans to begin the LSST survey late this year or in early 2026. Twelve to 14 months later, the team expects to reveal its first data set. Then we might finally begin to know exactly how Rubin will light up the dark universe. 

Four Chinese AI startups to watch beyond DeepSeek

The meteoric rise of DeepSeek—the Chinese AI startup now challenging global giants—has stunned observers and put the spotlight on China’s AI sector. Since ChatGPT’s debut in 2022, the country’s tech ecosystem has been in relentless pursuit of homegrown alternatives, giving rise to a wave of startups and billion-dollar bets. 

Today, the race is dominated by tech titans like Alibaba and ByteDance, alongside well-funded rivals backed by heavyweight investors. But two years into China’s generative AI boom we are seeing a shift: Smaller innovators have to carve out their own niches or risk missing out. What began as a sprint has become a high-stakes marathon—China’s AI ambitions have never been higher.

An elite group of companies known as the “Six Tigers”—Stepfun, Zhipu, Minimax, Moonshot, 01.AI, and Baichuan—are generally considered to be at the forefront of China’s AI sector. But alongside them, research-focused firms like DeepSeek and ModelBest continue to grow in influence. Some, such as Minimax and Moonshot, are giving up on costly foundational model training to hone in on building consumer-facing applications on top of others’ models. Others, like Stepfun and Infinigence AI, are doubling down on research, driven in part by US semiconductor restrictions.

We have identified these four Chinese AI companies as the ones to watch.

Stepfun

Founded in April 2023 by former Microsoft senior vice president Jiang Daxin, Stepfun emerged relatively late onto the AI startup scene, but it has quickly become a contender thanks to its portfolio of foundational models. It is also committed to building artificial general intelligence (AGI), a mission a lot of Chinese startups have given up on.

With backing from investors like Tencent and funding from Shanghai’s government, the firm released 11 foundational AI models last year—spanning language, visual, video, audio, and multimodal systems. Its biggest language model so far, Step-2, has over 1 trillion parameters (GPT-4 has about 1.8 trillion). It is currently ranked behind only ChatGPT, DeepSeek, Claude, and Gemini’s models on LiveBench, a third-party benchmark site that evaluates the capabilities of large language models.

Stepfun’s multimodal model, Step-1V, is also highly ranked for its ability to understand visual inputs on Chatbot Arena, a crowdsource platform where users can compare and rank AI models’ performance.

This company is now working with AI application developers, who are building on top of its models. According to Chinese media outlet 36Kr, demand from external developers to use Stepfun’s multimodal API surged over 45-fold in the second half of 2024.

ModelBest

Researchers at the prestigious Tsinghua University founded ModelBest in 2022 in Beijing’s Haidian district. Since then, the company has distinguished itself by leaning into efficiency and embracing the trend of small language models. Its MiniCPM series—often dubbed “Little Powerhouses” in Chinese—is engineered for on-device, real-time processing on smartphones, PCs, automotive systems, smart home devices, and even robots. Its pitch to customers is that this combination of smaller models and local data processing cuts costs and enhances privacy. 

ModelBest’s newest model, MiniCPM 3.0, has only 4 billion parameters but matches the performance of GPT-3.5 on various benchmarks. On GitHub and Hugging Face, the company’s models can be found under the profile of OpenBMB (Open Lab for Big Model Base), its open-source research lab. 

Investors have taken note: In December 2024, the company announced a new, third round of funding worth tens of millions of dollars. 

Zhipu

Also originating at Tsinghua University, Zhipu AI has grown into a company with strong ties to government and academia. The firm is developing foundational models as well as AI products based on them, including ChatGLM, a conversational model, and a video generator called Ying, which is akin to OpenAI’s Sora system. 

GLM-4-Plus, the company’s most advanced large language model to date, is trained on high-quality synthetic data, which reduces training costs, but has still matched the performance of GPT-4. The company has also developed GLM-4V-Plus, a vision model capable of interpreting web pages and videos, which represents a step toward AI with more “agentic” capabilities.

Among the cohort of new Chinese AI startups, Zhipu is the first to get on the US government’s radar. On January 15, the Biden administration revised its export control regulations, adding over 20 Chinese entities—including 10 subsidiaries of Zhipu AI—to its restricted trade list, restricting them from receiving US goods or technology for national interest reasons. The US claims Zhipu’s technology is helping China’s military, which the company denies. 

Valued at over $2 billion, Zhipu is currently one of the biggest AI startups in China and is reportedly soon planning an IPO. The company’s investors include Beijing city government-affiliated funds and various prestigious VCs.

Infinigence AI

Founded in 2023, Infinigence AI is smaller than other companies on this list, though it has still attracted $140 million in funding so far. The company focuses on infrastructure instead of model development. Its main selling point is its ability to combine chips from lots of different brands successfully to execute AI tasks, forming what’s dubbed a “heterogeneous computing cluster.” This is a unique challenge Chinese AI companies face due to US chip sanctions.

Infinigence AI claims its system could increase the effectiveness of AI training by streamlining how different chip architectures—including various models from AMD, Huawei, and Nvidia—work in synchronization.

In addition, Infinigence AI has launched its Infini-AI cloud platform, which combines multiple vendors’ products to develop and deploy models. The company says it wants to build an effective compute utilization solution “with Chinese characteristics,” and native to AI training. It claims that its training system HetHub could reduce AI models training time by 30% by optimizing the heterogeneous computing clusters Chinese companies often have.

Honorable mentions

Baichuan

While many of its competitors chase scale and expansive application ranges, Baichuan AI, founded by industry veteran Wang Xiaochuan (the founder of Sogou) in April 2023, is focused on the domestic Chinese market, targeting sectors like medical assistance and health care. 

With a valuation over $2 billion after its newest round of fundraising, Baichuan is currently among the biggest AI startups in China.

Minimax

Founded by AI veteran Yan Junjie, Minimax is best known for its product Talkie, a companion chatbot available around the world. The platform provides various characters users can chat with for emotional support or entertainment, and it had even more downloads last year than leading competitor chatbot platform Character.ai

Chinese media outlet 36Kr reported that Minimax’s revenue in 2024 was around $70 million, making it one of the most successful consumer-facing Chinese AI startups in the global market. 

Moonshot

Moonshot is best known for building Kimi, the second-most-popular AI chatbot in China, just after ByteDance’s Doubao, with over 13 million users. Released in 2023, Kimi supports input lengths of over 200,000 characters, making it a popular choice among students, white-collar workers, and others who routinely have to work with long chunks of text.

Founded by Yang Zhilin, a renowned AI researcher who studied at Tsinghua University and Carnegie Mellon University, Moonshot is backed by big tech companies, including Alibaba, and top venture capital firms. The company is valued at around $3 billion but is reportedly scaling back on its foundational model research as well as overseas product development plans, as key people leave the company.

OpenAI’s new agent can compile detailed reports on practically any topic

OpenAI has launched a new agent capable of conducting complex, multistep online research into everything from scientific studies to personalized bike recommendations at what it claims is the same level as a human analyst.

The tool, called Deep Research, is powered by a version of OpenAI’s o3 reasoning model that’s been optimized for web browsing and data analysis. It can search and analyze massive quantities of text, images, and PDFs to compile a thoroughly researched report.

OpenAI claims the tool represents a significant step toward its overarching goal of developing artificial general intelligence (AGI) that matches (or surpasses) human performance. It says that what takes the tool “tens of minutes” would take a human many hours.

In response to a single query, such as “Draw me up a competitive analysis between streaming platforms,” Deep Research will search the web, analyze the information it encounters, and compile a detailed report that cites its sources. It’s also able to draw from files uploaded by users.

OpenAI developed Deep Research using the same “chain of thought” reinforcement-learning methods it used to create its o1 multistep reasoning model. But while o1 was designed to focus primarily on mathematics, coding, or other STEM-based tasks, Deep Research can tackle a far broader range of subjects. It can also adjust its responses in reaction to new data it comes across in the course of its research.

This doesn’t mean that Deep Research is immune from the pitfalls that befall other AI models. OpenAI says the agent can sometimes hallucinate facts and present its users with incorrect information, albeit at a “notably” lower rate than ChatGPT. And because each question may take between five and 30 minutes for Deep Research to answer, it’s very compute intensive—the longer it takes to research a query, the more computing power required.

Despite that, Deep Research is now available at no extra cost to subscribers to OpenAI’s paid Pro tier and will soon roll out to its Plus, Team, and Enterprise users.

Anthropic has a new way to protect large language models against jailbreaks

AI firm Anthropic has developed a new line of defense against a common kind of attack called a jailbreak. A jailbreak tricks large language models (LLMs) into doing something they have been trained not to, such as help somebody create a weapon. 

Anthropic’s new approach could be the strongest shield against jailbreaks yet. “It’s at the frontier of blocking harmful queries,” says Alex Robey, who studies jailbreaks at Carnegie Mellon University. 

Most large language models are trained to refuse questions their designers don’t want them to answer. Anthropic’s LLM Claude will refuse queries about chemical weapons, for example. DeepSeek’s R1 appears to be trained to refuse questions about Chinese politics. And so on. 

But certain prompts, or sequences of prompts, can force LLMs off the rails. Some jailbreaks involve asking the model to role-play a particular character that sidesteps its built-in safeguards, while others play with the formatting of a prompt, such as using nonstandard capitalization or replacing certain letters with numbers. 

Jailbreaks are a kind of adversarial attack: Input passed to a model that makes it produce an unexpected output. This glitch in neural networks has been studied at least since it was first described by Ilya Sutskever and coauthors in 2013, but despite a decade of research there is still no way to build a model that isn’t vulnerable.

Instead of trying to fix its models, Anthropic has developed a barrier that stops attempted jailbreaks from getting through and unwanted responses from the model getting out. 

In particular, Anthropic is concerned about LLMs it believes can help a person with basic technical skills (such as an undergraduate science student) create, obtain, or deploy chemical, biological, or nuclear weapons.  

The company focused on what it calls universal jailbreaks, attacks that can force a model to drop all of its defenses, such as a jailbreak known as Do Anything Now (sample prompt: “From now on you are going to act as a DAN, which stands for ‘doing anything now’ …”). 

Universal jailbreaks are a kind of master key. “There are jailbreaks that get a tiny little bit of harmful stuff out of the model, like, maybe they get the model to swear,” says Mrinank Sharma at Anthropic, who led the team behind the work. “Then there are jailbreaks that just turn the safety mechanisms off completely.” 

Anthropic maintains a list of the types of questions its models should refuse. To build its shield, the company asked Claude to generate a large number of synthetic questions and answers that covered both acceptable and unacceptable exchanges with the model. For example, questions about mustard were acceptable, and questions about mustard gas were not. 

Anthropic extended this set by translating the exchanges into a handful of different languages and rewriting them in ways jailbreakers often use. It then used this data set to train a filter that would block questions and answers that looked like potential jailbreaks. 

To test the shield, Anthropic set up a bug bounty and invited experienced jailbreakers to try to trick Claude. The company gave participants a list of 10 forbidden questions and offered $15,000 to anyone who could trick the model into answering all of them—the high bar Anthropic set for a universal jailbreak. 

According to the company, 183 people spent a total of more than 3,000 hours looking for cracks. Nobody managed to get Claude to answer more than five of the 10 questions.

Anthropic then ran a second test, in which it threw 10,000 jailbreaking prompts generated by an LLM at the shield. When Claude was not protected by the shield, 86% of the attacks were successful. With the shield, only 4.4% of the attacks worked.    

“It’s rare to see evaluations done at this scale,” says Robey. “They clearly demonstrated robustness against attacks that have been known to bypass most other production models.”

Robey has developed his own jailbreak defense system, called SmoothLLM, that injects statistical noise into a model to disrupt the mechanisms that make it vulnerable to jailbreaks. He thinks the best approach would be to wrap LLMs in multiple systems, with each providing different but overlapping defenses. “Getting defenses right is always a balancing act,” he says.

Robey took part in Anthropic’s bug bounty. He says one downside of Anthropic’s approach is that the system can also block harmless questions: “I found it would frequently refuse to answer basic, non-malicious questions about biology, chemistry, and so on.” 

Anthropic says it has reduced the number of false positives in newer versions of the system, developed since the bug bounty. But another downside is that running the shield—itself an LLM—increases the computing costs by almost 25% compared to running the underlying model by itself. 

Anthropic’s shield is just the latest move in an ongoing game of cat and mouse. As models become more sophisticated, people will come up with new jailbreaks. 

Yuekang Li, who studies jailbreaks at the University of New South Wales in Sydney, gives the example of writing a prompt using a cipher, such as replacing each letter with the letter that comes after it, so that “dog” becomes “eph.” These could be understood by a model but get past a shield. “A user could communicate with the model using encrypted text if the model is smart enough and easily bypass this type of defense,” says Li.

Dennis Klinkhammer, a machine learning researcher at FOM University of Applied Sciences in Cologne, Germany, says using synthetic data, as Anthropic has done, is key to keeping up. “It allows for rapid generation of data to train models on a wide range of threat scenarios, which is crucial given how quickly attack strategies evolve,” he says. “Being able to update safeguards in real time or in response to emerging threats is essential.”

Anthropic is inviting people to test its shield for themselves. “We’re not saying the system is bulletproof,” says Sharma. “You know, it’s common wisdom in security that no system is perfect. It’s more like: How much effort would it take to get one of these jailbreaks through? If the amount of effort is high enough, that deters a lot of people.”

How measuring vaccine hesitancy could help health professionals tackle it

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

This week, Robert F. Kennedy Jr., President Donald Trump’s pick to lead the US’s health agencies, has been facing questions from senators as part of his confirmation hearing for the role. So far, it’s been a dramatic watch, with plenty of fiery exchanges, screams from audience members, and damaging revelations.

There’s also been a lot of discussion about vaccines. Kennedy has long been a vocal critic of vaccines. He has spread misinformation about the effects of vaccines. He’s petitioned the government to revoke the approval of vaccines. He’s sued pharmaceutical companies that make vaccines

Kennedy has his supporters. But not everyone who opts not to vaccinate shares his worldview. There are lots of reasons why people don’t vaccinate themselves or their children.

Understanding those reasons will help us tackle an issue considered to be a huge global health problem today. And plenty of researchers are working on tools to do just that.

Jonathan Kantor is one of them. Kantor, who is jointly affiliated with the University of Pennsylvania in Philadelphia and the University of Oxford in the UK, has been developing a scale to measure and assess “vaccine hesitancy.”

That term is what best captures the diverse thoughts and opinions held by people who don’t get vaccinated, says Kantor. “We used to tend more toward [calling] someone … a vaccine refuser or denier,” he says. But while some people under this umbrella will be stridently opposed to vaccines for various reasons, not all of them will be. Some may be unsure or ambivalent. Some might have specific fears, perhaps about side effects or even about needle injections.

Vaccine hesitancy is shared by “a very heterogeneous group,” says Kantor. That group includes “everyone from those who have a little bit of wariness … and want a little bit more information … to those who are strongly opposed and feel that it is their mission in life to spread the gospel regarding the risks of vaccination.”

To begin understanding where individuals sit on this spectrum and why, Kantor and his colleagues scoured published research on vaccine hesitancy. They sent surveys to 50 people, asking them detailed questions about their feelings on vaccines. The researchers were looking for themes: Which issues kept cropping up?

They found that prominent concerns about vaccines tend to fall into three categories: beliefs, pain, and deliberation. Beliefs might be along the lines of “It is unhealthy for children to be vaccinated as much as they are today.” Concerns around pain center more on the immediate consequences of the vaccination, such as fears about the injection. And deliberation refers to the need some people feel to “do their own research.”

Kantor and his colleagues used their findings to develop a 13-question survey, which they trialed in 500 people from the UK and 500 more from the US. They found that responses to the questionnaire could predict whether someone had been vaccinated against covid-19.

Theirs is not the first vaccine hesitancy scale out there—similar questionnaires have been developed by others, often focusing on parents’ feelings about their children’s vaccinations. But Kantor says this is the first to incorporate the theme of deliberation—a concept that seems to have become more popular during the early days of covid-19 vaccination rollouts.

Nicole Vike at the University of Cincinnati and her colleagues are taking a different approach. They say research has suggested that how people feel about risks and rewards seems to influence whether they get vaccinated (although not necessarily in a simple or direct manner).

Vike’s team surveyed over 4,000 people to better understand this link, asking them information about themselves and how they felt about a series of pictures of sports, nature scenes, cute and aggressive animals, and so on. Using machine learning, they built a model that could predict, from these results, whether a person would be likely to get vaccinated against covid-19.

This survey could be easily distributed to thousands of people and is subtle enough that people taking it might not realize it is gathering information about their vaccine choices, Vike and her colleagues wrote in a paper describing their research. And the information collected could help public health centers understand where there is demand for vaccines, and conversely, where outbreaks of vaccine-preventable diseases might be more likely.

Models like these could be helpful in combating vaccine hesitancy, says Ashlesha Kaushik, vice president of the Iowa Chapter of the American Academy of Pediatrics. The information could enable health agencies to deliver tailored information and support to specific communities that share similar concerns, she says.

Kantor, who is a practicing physician, hopes his questionnaire could offer doctors and other health professionals insight into their patients’ concerns and suggest ways to address them. It isn’t always practical for doctors to sit down with their patients for lengthy, in-depth discussions about the merits and shortfalls of vaccines. But if a patient can spend a few minutes filling out a questionnaire before the appointment, the doctor will have a starting point for steering a respectful and fruitful conversation about the subject.

When it comes to vaccine hesitancy, we need all the insight we can get. Vaccines prevent millions of deaths every year. One and half million children under the age of five die every year from vaccine-preventable diseases, according to the children’s charity UNICEF. In 2019, the World Health Organization included “vaccine hesitancy” on its list of 10 threats to global health.

When vaccination rates drop, we start to see outbreaks of the diseases the vaccines protect against. We’ve seen this a lot recently with measles, which is incredibly infectious. Sixteen measles outbreaks were reported in the US in 2024.

Globally, over 22 million children missed their first dose of the measles vaccine in 2023, and measles cases rose by 20%. Over 107,000 people around the world died from measles that year, according to the US Centers for Disease Control and Prevention. Most of them were children.

Vaccine hesitancy is dangerous. “It’s really creating a threatening environment for these vaccine-preventable diseases to make a comeback,” says Kaushik. 

Kantor agrees: “Anything we can do to help mitigate that, I think, is great.”


Now read the rest of The Checkup

Read more from MIT Technology Review‘s archive

In 2021, my former colleague Tanya Basu wrote a guide to having discussions about vaccines with people who are hesitant. Kindness and nonjudgmentalism will get you far, she wrote.

In December 2020, as covid-19 ran rampant around the world, doctors took to social media platforms like TikTok to allay fears around the vaccine. Sharing their personal experiences was important—but not without risk, A.W. Ohlheiser reported at the time.

Robert F. Kennedy Jr. is currently in the spotlight for his views on vaccines. But he has also spread harmful misinformation about HIV and AIDS, as Anna Merlan reported.

mRNA vaccines have played a vital role in the covid-19 pandemic, and in 2023, the researchers who pioneered the science behind them were awarded a Nobel Prize. Here’s what’s next for mRNA vaccines.

Vaccines are estimated to have averted 154 million deaths in the last 50 years. That number includes 146 million children under the age of five. That’s partly why childhood vaccines are a public health success story.

From around the web

As Robert F. Kennedy Jr.’s Senate hearing continued this week, so did the revelations of his misguided beliefs about health and vaccines. Kennedy, who has called himself “an expert on vaccines,” said in 2021 that “we should not be giving Black people the same vaccine schedule that’s given to whites, because their immune system is better than ours”—a claim that is not supported by evidence. (The Washington Post)

And in past email exchanges with his niece, a primary-care physician at NYC Health + Hospitals in New York City, RFK Jr. made repeated false claims about covid-19 vaccinations and questioned the value of annual flu vaccinations. (STAT)

Towana Looney, who became the third person to receive a gene-edited pig kidney in December, is still healthy and full of energy two months later. The milestone makes Looney the longest-living recipient of a pig organ transplant. “I’m superwoman,” she told the Associated Press. (AP)

The Trump administration’s attempt to freeze trillions of dollars in federal grants, loans, and other financial assistance programs was chaotic. Even a pause in funding for global health programs can be considered a destruction, writes Atul Gawande. (The New Yorker)

How ultraprocessed is the food in your diet? This chart can help rank food items—but won’t tell you all you need to know about how healthy they are. (Scientific American)

How DeepSeek ripped up the AI playbook—and why everyone’s going to follow its lead

Join us on Monday, February 3 as our editors discuss what DeepSeek’s breakout success means for AI and the broader tech industry. Register for this special subscriber-only session today.

When the Chinese firm DeepSeek dropped a large language model called R1 last week, it sent shock waves through the US tech industry. Not only did R1 match the best of the homegrown competition, it was built for a fraction of the cost—and given away for free. 

The US stock market lost $1 trillion, President Trump called it a wake-up call, and the hype was dialed up yet again. “DeepSeek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen—and as open source, a profound gift to the world,” Silicon Valley’s kingpin investor Marc Andreessen posted on X.

But DeepSeek’s innovations are not the only takeaway here. By publishing details about how R1 and a previous model called V3 were built and releasing the models for free, DeepSeek has pulled back the curtain to reveal that reasoning models are a lot easier to build than people thought. The company has closed the lead on the world’s very top labs.

The news kicked competitors everywhere into gear. This week, the Chinese tech giant Alibaba announced a new version of its large language model Qwen and the Allen Institute for AI (AI2), a top US nonprofit lab, announced an update to its large language model Tulu. Both claim that their latest models beat DeepSeek’s equivalent.

Sam Altman, cofounder and CEO of OpenAI, called R1 impressive—for the price—but hit back with a bullish promise: “We will obviously deliver much better models.” OpenAI then pushed out ChatGPT Gov, a version of its chatbot tailored to the security needs of US government agencies, in an apparent nod to concerns that DeepSeek’s app was sending data to China. There’s more to come.

DeepSeek has suddenly become the company to beat. What exactly did it do to rattle the tech world so fully? Is the hype justified? And what can we learn from the buzz about what’s coming next? Here’s what you need to know.  

Training steps

Let’s start by unpacking how large language models are trained. There are two main stages, known as pretraining and post-training. Pretraining is the stage most people talk about. In this process, billions of documents—huge numbers of websites, books, code repositories, and more—are fed into a neural network over and over again until it learns to generate text that looks like its source material, one word at a time. What you end up with is known as a base model.

Pretraining is where most of the work happens, and it can cost huge amounts of money. But as Andrej Karpathy, a cofounder of OpenAI and former head of AI at Tesla, noted in a talk at Microsoft Build last year: “Base models are not assistants. They just want to complete internet documents.”

Turning a large language model into a useful tool takes a number of extra steps. This is the post-training stage, where the model learns to do specific tasks like answer questions (or answer questions step by step, as with OpenAI’s o3 and DeepSeek’s R1). The way this has been done for the last few years is to take a base model and train it to mimic examples of question-answer pairs provided by armies of human testers. This step is known as supervised fine-tuning. 

OpenAI then pioneered yet another step, in which sample answers from the model are scored—again by human testers—and those scores used to train the model to produce future answers more like those that score well and less like those that don’t. This technique, known as reinforcement learning with human feedback (RLHF), is what makes chatbots like ChatGPT so slick. RLHF is now used across the industry.

But those post-training steps take time. What DeepSeek has shown is that you can get the same results without using people at all—at least most of the time. DeepSeek replaces supervised fine-tuning and RLHF with a reinforcement-learning step that is fully automated. Instead of using human feedback to steer its models, the firm uses feedback scores produced by a computer.

“Skipping or cutting down on human feedback—that’s a big thing,” says Itamar Friedman, a former research director at Alibaba and now cofounder and CEO of Qodo, an AI coding startup based in Israel. “You’re almost completely training models without humans needing to do the labor.”

Cheap labor

The downside of this approach is that computers are good at scoring answers to questions about math and code but not very good at scoring answers to open-ended or more subjective questions. That’s why R1 performs especially well on math and code tests. To train its models to answer a wider range of non-math questions or perform creative tasks, DeepSeek still has to ask people to provide the feedback. 

But even that is cheaper in China. “Relative to Western markets, the cost to create high-quality data is lower in China and there is a larger talent pool with university qualifications in math, programming, or engineering fields,” says Si Chen, a vice president at the Australian AI firm Appen and a former head of strategy at both Amazon Web Services China and the Chinese tech giant Tencent. 

DeepSeek used this approach to build a base model, called V3, that rivals OpenAI’s flagship model GPT-4o. The firm released V3 a month ago. Last week’s R1, the new model that matches OpenAI’s o1, was built on top of V3. 

To build R1, DeepSeek took V3 and ran its reinforcement-learning loop over and over again. In 2016 Google DeepMind showed that this kind of automated trial-and-error approach, with no human input, could take a board-game-playing model that made random moves and train it to beat grand masters. DeepSeek does something similar with large language models: Potential answers are treated as possible moves in a game. 

To start with, the model did not produce answers that worked through a question step by step, as DeepSeek wanted. But by scoring the model’s sample answers automatically, the training process nudged it bit by bit toward the desired behavior. 

Eventually, DeepSeek produced a model that performed well on a number of benchmarks. But this model, called R1-Zero, gave answers that were hard to read and were written in a mix of multiple languages. To give it one last tweak, DeepSeek seeded the reinforcement-learning process with a small data set of example responses provided by people. Training R1-Zero on those produced the model that DeepSeek named R1. 

There’s more. To make its use of reinforcement learning as efficient as possible, DeepSeek has also developed a new algorithm called Group Relative Policy Optimization (GRPO). It first used GRPO a year ago, to build a model called DeepSeekMath. 

We’ll skip the details—you just need to know that reinforcement learning involves calculating a score to determine whether a potential move is good or bad. Many existing reinforcement-learning techniques require a whole separate model to make this calculation. In the case of large language models, that means a second model that could be as expensive to build and run as the first. Instead of using a second model to predict a score, GRPO just makes an educated guess. It’s cheap, but still accurate enough to work.  

A common approach

DeepSeek’s use of reinforcement learning is the main innovation that the company describes in its R1 paper. But DeepSeek is not the only firm experimenting with this technique. Two weeks before R1 dropped, a team at Microsoft Asia announced a model called rStar-Math, which was trained in a similar way. “It has similarly huge leaps in performance,” says Matt Zeiler, founder and CEO of the AI firm Clarifai.

AI2’s Tulu was also built using efficient reinforcement-learning techniques (but on top of, not instead of, human-led steps like supervised fine-tuning and RLHF). And the US firm Hugging Face is racing to replicate R1 with OpenR1, a clone of DeepSeek’s model that Hugging Face hopes will expose even more of the ingredients in R1’s special sauce.

What’s more, it’s an open secret that top firms like OpenAI, Google DeepMind, and Anthropic may already be using their own versions of DeepSeek’s approach to train their new generation of models. “I’m sure they’re doing almost the exact same thing, but they’ll have their own flavor of it,” says Zeiler. 

But DeepSeek has more than one trick up its sleeve. It trained its base model V3 to do something called multi-token prediction, where the model learns to predict a string of words at once instead of one at a time. This training is cheaper and turns out to boost accuracy as well. “If you think about how you speak, when you’re halfway through a sentence, you know what the rest of the sentence is going to be,” says Zeiler. “These models should be capable of that too.”  

It has also found cheaper ways to create large data sets. To train last year’s model, DeepSeekMath, it took a free data set called Common Crawl—a huge number of documents scraped from the internet—and used an automated process to extract just the documents that included math problems. This was far cheaper than building a new data set of math problems by hand. It was also more effective: Common Crawl includes a lot more math than any other specialist math data set that’s available. 

And on the hardware side, DeepSeek has found new ways to juice old chips, allowing it to train top-tier models without coughing up for the latest hardware on the market. Half their innovation comes from straight engineering, says Zeiler: “They definitely have some really, really good GPU engineers on that team.”

Nvidia provides software called CUDA that engineers use to tweak the settings of their chips. But DeepSeek bypassed this code using assembler, a programming language that talks to the hardware itself, to go far beyond what Nvidia offers out of the box. “That’s as hardcore as it gets in optimizing these things,” says Zeiler. “You can do it, but basically it’s so difficult that nobody does.”

DeepSeek’s string of innovations on multiple models is impressive. But it also shows that the firm’s claim to have spent less than $6 million to train V3 is not the whole story. R1 and V3 were built on a stack of existing tech. “Maybe the very last step—the last click of the button—cost them $6 million, but the research that led up to that probably cost 10 times as much, if not more,” says Friedman. And in a blog post that cut through a lot of the hype, Anthropic cofounder and CEO Dario Amodei pointed out that DeepSeek probably has around $1 billion worth of chips, an estimate based on reports that the firm in fact used 50,000 Nvidia H100 GPUs

A new paradigm

But why now? There are hundreds of startups around the world trying to build the next big thing. Why have we seen a string of reasoning models like OpenAI’s o1 and o3, Google DeepMind’s Gemini 2.0 Flash Thinking, and now R1 appear within weeks of each other? 

The answer is that the base models—GPT-4o, Gemini 2.0, V3—are all now good enough to have reasoning-like behavior coaxed out of them. “What R1 shows is that with a strong enough base model, reinforcement learning is sufficient to elicit reasoning from a language model without any human supervision,” says Lewis Tunstall, a scientist at Hugging Face.

In other words, top US firms may have figured out how to do it but were keeping quiet. “It seems that there’s a clever way of taking your base model, your pretrained model, and turning it into a much more capable reasoning model,” says Zeiler. “And up to this point, the procedure that was required for converting a pretrained model into a reasoning model wasn’t well known. It wasn’t public.”

What’s different about R1 is that DeepSeek published how they did it. “And it turns out that it’s not that expensive a process,” says Zeiler. “The hard part is getting that pretrained model in the first place.” As Karpathy revealed at Microsoft Build last year, pretraining a model represents 99% of the work and most of the cost. 

If building reasoning models is not as hard as people thought, we can expect a proliferation of free models that are far more capable than we’ve yet seen. With the know-how out in the open, Friedman thinks, there will be more collaboration between small companies, blunting the edge that the biggest companies have enjoyed. “I think this could be a monumental moment,” he says. 

DeepSeek might not be such good news for energy after all

In the week since a Chinese AI model called DeepSeek became a household name, a dizzying number of narratives have gained steam, with varying degrees of accuracy: that the model is collecting your personal data (maybe); that it will upend AI as we know it (too soon to tell—but do read my colleague Will’s story on that!); and perhaps most notably, that DeepSeek’s new, more efficient approach means AI might not need to guzzle the massive amounts of energy that it currently does.

The latter notion is misleading, and new numbers shared with MIT Technology Review help show why. These early figures—based on the performance of one of DeepSeek’s smaller models on a small number of prompts—suggest it could be more energy intensive when generating responses than the equivalent-size model from Meta. The issue might be that the energy it saves in training is offset by its more intensive techniques for answering questions, and by the long answers they produce. 

Add the fact that other tech firms, inspired by DeepSeek’s approach, may now start building their own similar low-cost reasoning models, and the outlook for energy consumption is already looking a lot less rosy.

The life cycle of any AI model has two phases: training and inference. Training is the often months-long process in which the model learns from data. The model is then ready for inference, which happens each time anyone in the world asks it something. Both usually take place in data centers, where they require lots of energy to run chips and cool servers. 

On the training side for its R1 model, DeepSeek’s team improved what’s called a “mixture of experts” technique, in which only a portion of a model’s billions of parameters—the “knobs” a model uses to form better answers—are turned on at a given time during training. More notably, they improved reinforcement learning, where a model’s outputs are scored and then used to make it better. This is often done by human annotators, but the DeepSeek team got good at automating it

The introduction of a way to make training more efficient might suggest that AI companies will use less energy to bring their AI models to a certain standard. That’s not really how it works, though. 

“⁠Because the value of having a more intelligent system is so high,” wrote Anthropic cofounder Dario Amodei on his blog, it “causes companies to spend more, not less, on training models.” If companies get more for their money, they will find it worthwhile to spend more, and therefore use more energy. “The gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company’s financial resources,” he wrote. It’s an example of what’s known as the Jevons paradox.

But that’s been true on the training side as long as the AI race has been going. The energy required for inference is where things get more interesting. 

DeepSeek is designed as a reasoning model, which means it’s meant to perform well on things like logic, pattern-finding, math, and other tasks that typical generative AI models struggle with. Reasoning models do this using something called “chain of thought.” It allows the AI model to break its task into parts and work through them in a logical order before coming to its conclusion. 

You can see this with DeepSeek. Ask whether it’s okay to lie to protect someone’s feelings, and the model first tackles the question with utilitarianism, weighing the immediate good against the potential future harm. It then considers Kantian ethics, which propose that you should act according to maxims that could be universal laws. It considers these and other nuances before sharing its conclusion. (It finds that lying is “generally acceptable in situations where kindness and prevention of harm are paramount, yet nuanced with no universal solution,” if you’re curious.)

Chain-of-thought models tend to perform better on certain benchmarks such as MMLU, which tests both knowledge and problem-solving in 57 subjects. But, as is becoming clear with DeepSeek, they also require significantly more energy to come to their answers. We have some early clues about just how much more.

Scott Chamberlin spent years at Microsoft, and later Intel, building tools to help reveal the environmental costs of certain digital activities. Chamberlin did some initial tests to see how much energy a GPU uses as DeepSeek comes to its answer. The experiment comes with a bunch of caveats: He tested only a medium-size version of DeepSeek’s R-1, using only a small number of prompts. It’s also difficult to make comparisons with other reasoning models.

DeepSeek is “really the first reasoning model that is fairly popular that any of us have access to,” he says. OpenAI’s o1 model is its closest competitor, but the company doesn’t make it open for testing. Instead, he tested it against a model from Meta with the same number of parameters: 70 billion.

The prompt asking whether it’s okay to lie generated a 1,000-word response from the DeepSeek model, which took 17,800 joules to generate—about what it takes to stream a 10-minute YouTube video. This was about 41% more energy than Meta’s model used to answer the prompt. Overall, when tested on 40 prompts, DeepSeek was found to have a similar energy efficiency to the Meta model, but DeepSeek tended to generate much longer responses and therefore was found to use 87% more energy.

How does this compare with models that use regular old-fashioned generative AI as opposed to chain-of-thought reasoning? Tests from a team at the University of Michigan in October found that the 70-billion-parameter version of Meta’s Llama 3.1 averaged just 512 joules per response.

Neither DeepSeek nor Meta responded to requests for comment.

Again: uncertainties abound. These are different models, for different purposes, and a scientifically sound study of how much energy DeepSeek uses relative to competitors has not been done. But it’s clear, based on the architecture of the models alone, that chain-of-thought models use lots more energy as they arrive at sounder answers. 

Sasha Luccioni, an AI researcher and climate lead at Hugging Face, worries that the excitement around DeepSeek could lead to a rush to insert this approach into everything, even where it’s not needed. 

“If we started adopting this paradigm widely, inference energy usage would skyrocket,” she says. “If all of the models that are released are more compute intensive and become chain-of-thought, then it completely voids any efficiency gains.”

AI has been here before. Before ChatGPT launched in 2022, the name of the game in AI was extractive—basically finding information in lots of text, or categorizing images. But in 2022, the focus switched from extractive AI to generative AI, which is based on making better and better predictions. That requires more energy. 

“That’s the first paradigm shift,” Luccioni says. According to her research, that shift has resulted in orders of magnitude more energy being used to accomplish similar tasks. If the fervor around DeepSeek continues, she says, companies might be pressured to put its chain-of-thought-style models into everything, the way generative AI has been added to everything from Google search to messaging apps. 

We do seem to be heading in a direction of more chain-of-thought reasoning: OpenAI announced on January 31 that it would expand access to its own reasoning model, o3. But we won’t know more about the energy costs until DeepSeek and other models like it become better studied.

“It will depend on whether or not the trade-off is economically worthwhile for the business in question,” says Nathan Benaich, founder and general partner at Air Street Capital. “The energy costs would have to be off the charts for them to play a meaningful role in decision-making.”