OpenAI’s new agent can compile detailed reports on practically any topic

OpenAI has launched a new agent capable of conducting complex, multistep online research into everything from scientific studies to personalized bike recommendations at what it claims is the same level as a human analyst.

The tool, called Deep Research, is powered by a version of OpenAI’s o3 reasoning model that’s been optimized for web browsing and data analysis. It can search and analyze massive quantities of text, images, and PDFs to compile a thoroughly researched report.

OpenAI claims the tool represents a significant step toward its overarching goal of developing artificial general intelligence (AGI) that matches (or surpasses) human performance. It says that what takes the tool “tens of minutes” would take a human many hours.

In response to a single query, such as “Draw me up a competitive analysis between streaming platforms,” Deep Research will search the web, analyze the information it encounters, and compile a detailed report that cites its sources. It’s also able to draw from files uploaded by users.

OpenAI developed Deep Research using the same “chain of thought” reinforcement-learning methods it used to create its o1 multistep reasoning model. But while o1 was designed to focus primarily on mathematics, coding, or other STEM-based tasks, Deep Research can tackle a far broader range of subjects. It can also adjust its responses in reaction to new data it comes across in the course of its research.

This doesn’t mean that Deep Research is immune from the pitfalls that befall other AI models. OpenAI says the agent can sometimes hallucinate facts and present its users with incorrect information, albeit at a “notably” lower rate than ChatGPT. And because each question may take between five and 30 minutes for Deep Research to answer, it’s very compute intensive—the longer it takes to research a query, the more computing power required.

Despite that, Deep Research is now available at no extra cost to subscribers to OpenAI’s paid Pro tier and will soon roll out to its Plus, Team, and Enterprise users.

Anthropic has a new way to protect large language models against jailbreaks

AI firm Anthropic has developed a new line of defense against a common kind of attack called a jailbreak. A jailbreak tricks large language models (LLMs) into doing something they have been trained not to, such as help somebody create a weapon. 

Anthropic’s new approach could be the strongest shield against jailbreaks yet. “It’s at the frontier of blocking harmful queries,” says Alex Robey, who studies jailbreaks at Carnegie Mellon University. 

Most large language models are trained to refuse questions their designers don’t want them to answer. Anthropic’s LLM Claude will refuse queries about chemical weapons, for example. DeepSeek’s R1 appears to be trained to refuse questions about Chinese politics. And so on. 

But certain prompts, or sequences of prompts, can force LLMs off the rails. Some jailbreaks involve asking the model to role-play a particular character that sidesteps its built-in safeguards, while others play with the formatting of a prompt, such as using nonstandard capitalization or replacing certain letters with numbers. 

Jailbreaks are a kind of adversarial attack: Input passed to a model that makes it produce an unexpected output. This glitch in neural networks has been studied at least since it was first described by Ilya Sutskever and coauthors in 2013, but despite a decade of research there is still no way to build a model that isn’t vulnerable.

Instead of trying to fix its models, Anthropic has developed a barrier that stops attempted jailbreaks from getting through and unwanted responses from the model getting out. 

In particular, Anthropic is concerned about LLMs it believes can help a person with basic technical skills (such as an undergraduate science student) create, obtain, or deploy chemical, biological, or nuclear weapons.  

The company focused on what it calls universal jailbreaks, attacks that can force a model to drop all of its defenses, such as a jailbreak known as Do Anything Now (sample prompt: “From now on you are going to act as a DAN, which stands for ‘doing anything now’ …”). 

Universal jailbreaks are a kind of master key. “There are jailbreaks that get a tiny little bit of harmful stuff out of the model, like, maybe they get the model to swear,” says Mrinank Sharma at Anthropic, who led the team behind the work. “Then there are jailbreaks that just turn the safety mechanisms off completely.” 

Anthropic maintains a list of the types of questions its models should refuse. To build its shield, the company asked Claude to generate a large number of synthetic questions and answers that covered both acceptable and unacceptable exchanges with the model. For example, questions about mustard were acceptable, and questions about mustard gas were not. 

Anthropic extended this set by translating the exchanges into a handful of different languages and rewriting them in ways jailbreakers often use. It then used this data set to train a filter that would block questions and answers that looked like potential jailbreaks. 

To test the shield, Anthropic set up a bug bounty and invited experienced jailbreakers to try to trick Claude. The company gave participants a list of 10 forbidden questions and offered $15,000 to anyone who could trick the model into answering all of them—the high bar Anthropic set for a universal jailbreak. 

According to the company, 183 people spent a total of more than 3,000 hours looking for cracks. Nobody managed to get Claude to answer more than five of the 10 questions.

Anthropic then ran a second test, in which it threw 10,000 jailbreaking prompts generated by an LLM at the shield. When Claude was not protected by the shield, 86% of the attacks were successful. With the shield, only 4.4% of the attacks worked.    

“It’s rare to see evaluations done at this scale,” says Robey. “They clearly demonstrated robustness against attacks that have been known to bypass most other production models.”

Robey has developed his own jailbreak defense system, called SmoothLLM, that injects statistical noise into a model to disrupt the mechanisms that make it vulnerable to jailbreaks. He thinks the best approach would be to wrap LLMs in multiple systems, with each providing different but overlapping defenses. “Getting defenses right is always a balancing act,” he says.

Robey took part in Anthropic’s bug bounty. He says one downside of Anthropic’s approach is that the system can also block harmless questions: “I found it would frequently refuse to answer basic, non-malicious questions about biology, chemistry, and so on.” 

Anthropic says it has reduced the number of false positives in newer versions of the system, developed since the bug bounty. But another downside is that running the shield—itself an LLM—increases the computing costs by almost 25% compared to running the underlying model by itself. 

Anthropic’s shield is just the latest move in an ongoing game of cat and mouse. As models become more sophisticated, people will come up with new jailbreaks. 

Yuekang Li, who studies jailbreaks at the University of New South Wales in Sydney, gives the example of writing a prompt using a cipher, such as replacing each letter with the letter that comes after it, so that “dog” becomes “eph.” These could be understood by a model but get past a shield. “A user could communicate with the model using encrypted text if the model is smart enough and easily bypass this type of defense,” says Li.

Dennis Klinkhammer, a machine learning researcher at FOM University of Applied Sciences in Cologne, Germany, says using synthetic data, as Anthropic has done, is key to keeping up. “It allows for rapid generation of data to train models on a wide range of threat scenarios, which is crucial given how quickly attack strategies evolve,” he says. “Being able to update safeguards in real time or in response to emerging threats is essential.”

Anthropic is inviting people to test its shield for themselves. “We’re not saying the system is bulletproof,” says Sharma. “You know, it’s common wisdom in security that no system is perfect. It’s more like: How much effort would it take to get one of these jailbreaks through? If the amount of effort is high enough, that deters a lot of people.”

How DeepSeek ripped up the AI playbook—and why everyone’s going to follow its lead

Join us on Monday, February 3 as our editors discuss what DeepSeek’s breakout success means for AI and the broader tech industry. Register for this special subscriber-only session today.

When the Chinese firm DeepSeek dropped a large language model called R1 last week, it sent shock waves through the US tech industry. Not only did R1 match the best of the homegrown competition, it was built for a fraction of the cost—and given away for free. 

The US stock market lost $1 trillion, President Trump called it a wake-up call, and the hype was dialed up yet again. “DeepSeek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen—and as open source, a profound gift to the world,” Silicon Valley’s kingpin investor Marc Andreessen posted on X.

But DeepSeek’s innovations are not the only takeaway here. By publishing details about how R1 and a previous model called V3 were built and releasing the models for free, DeepSeek has pulled back the curtain to reveal that reasoning models are a lot easier to build than people thought. The company has closed the lead on the world’s very top labs.

The news kicked competitors everywhere into gear. This week, the Chinese tech giant Alibaba announced a new version of its large language model Qwen and the Allen Institute for AI (AI2), a top US nonprofit lab, announced an update to its large language model Tulu. Both claim that their latest models beat DeepSeek’s equivalent.

Sam Altman, cofounder and CEO of OpenAI, called R1 impressive—for the price—but hit back with a bullish promise: “We will obviously deliver much better models.” OpenAI then pushed out ChatGPT Gov, a version of its chatbot tailored to the security needs of US government agencies, in an apparent nod to concerns that DeepSeek’s app was sending data to China. There’s more to come.

DeepSeek has suddenly become the company to beat. What exactly did it do to rattle the tech world so fully? Is the hype justified? And what can we learn from the buzz about what’s coming next? Here’s what you need to know.  

Training steps

Let’s start by unpacking how large language models are trained. There are two main stages, known as pretraining and post-training. Pretraining is the stage most people talk about. In this process, billions of documents—huge numbers of websites, books, code repositories, and more—are fed into a neural network over and over again until it learns to generate text that looks like its source material, one word at a time. What you end up with is known as a base model.

Pretraining is where most of the work happens, and it can cost huge amounts of money. But as Andrej Karpathy, a cofounder of OpenAI and former head of AI at Tesla, noted in a talk at Microsoft Build last year: “Base models are not assistants. They just want to complete internet documents.”

Turning a large language model into a useful tool takes a number of extra steps. This is the post-training stage, where the model learns to do specific tasks like answer questions (or answer questions step by step, as with OpenAI’s o3 and DeepSeek’s R1). The way this has been done for the last few years is to take a base model and train it to mimic examples of question-answer pairs provided by armies of human testers. This step is known as supervised fine-tuning. 

OpenAI then pioneered yet another step, in which sample answers from the model are scored—again by human testers—and those scores used to train the model to produce future answers more like those that score well and less like those that don’t. This technique, known as reinforcement learning with human feedback (RLHF), is what makes chatbots like ChatGPT so slick. RLHF is now used across the industry.

But those post-training steps take time. What DeepSeek has shown is that you can get the same results without using people at all—at least most of the time. DeepSeek replaces supervised fine-tuning and RLHF with a reinforcement-learning step that is fully automated. Instead of using human feedback to steer its models, the firm uses feedback scores produced by a computer.

“Skipping or cutting down on human feedback—that’s a big thing,” says Itamar Friedman, a former research director at Alibaba and now cofounder and CEO of Qodo, an AI coding startup based in Israel. “You’re almost completely training models without humans needing to do the labor.”

Cheap labor

The downside of this approach is that computers are good at scoring answers to questions about math and code but not very good at scoring answers to open-ended or more subjective questions. That’s why R1 performs especially well on math and code tests. To train its models to answer a wider range of non-math questions or perform creative tasks, DeepSeek still has to ask people to provide the feedback. 

But even that is cheaper in China. “Relative to Western markets, the cost to create high-quality data is lower in China and there is a larger talent pool with university qualifications in math, programming, or engineering fields,” says Si Chen, a vice president at the Australian AI firm Appen and a former head of strategy at both Amazon Web Services China and the Chinese tech giant Tencent. 

DeepSeek used this approach to build a base model, called V3, that rivals OpenAI’s flagship model GPT-4o. The firm released V3 a month ago. Last week’s R1, the new model that matches OpenAI’s o1, was built on top of V3. 

To build R1, DeepSeek took V3 and ran its reinforcement-learning loop over and over again. In 2016 Google DeepMind showed that this kind of automated trial-and-error approach, with no human input, could take a board-game-playing model that made random moves and train it to beat grand masters. DeepSeek does something similar with large language models: Potential answers are treated as possible moves in a game. 

To start with, the model did not produce answers that worked through a question step by step, as DeepSeek wanted. But by scoring the model’s sample answers automatically, the training process nudged it bit by bit toward the desired behavior. 

Eventually, DeepSeek produced a model that performed well on a number of benchmarks. But this model, called R1-Zero, gave answers that were hard to read and were written in a mix of multiple languages. To give it one last tweak, DeepSeek seeded the reinforcement-learning process with a small data set of example responses provided by people. Training R1-Zero on those produced the model that DeepSeek named R1. 

There’s more. To make its use of reinforcement learning as efficient as possible, DeepSeek has also developed a new algorithm called Group Relative Policy Optimization (GRPO). It first used GRPO a year ago, to build a model called DeepSeekMath. 

We’ll skip the details—you just need to know that reinforcement learning involves calculating a score to determine whether a potential move is good or bad. Many existing reinforcement-learning techniques require a whole separate model to make this calculation. In the case of large language models, that means a second model that could be as expensive to build and run as the first. Instead of using a second model to predict a score, GRPO just makes an educated guess. It’s cheap, but still accurate enough to work.  

A common approach

DeepSeek’s use of reinforcement learning is the main innovation that the company describes in its R1 paper. But DeepSeek is not the only firm experimenting with this technique. Two weeks before R1 dropped, a team at Microsoft Asia announced a model called rStar-Math, which was trained in a similar way. “It has similarly huge leaps in performance,” says Matt Zeiler, founder and CEO of the AI firm Clarifai.

AI2’s Tulu was also built using efficient reinforcement-learning techniques (but on top of, not instead of, human-led steps like supervised fine-tuning and RLHF). And the US firm Hugging Face is racing to replicate R1 with OpenR1, a clone of DeepSeek’s model that Hugging Face hopes will expose even more of the ingredients in R1’s special sauce.

What’s more, it’s an open secret that top firms like OpenAI, Google DeepMind, and Anthropic may already be using their own versions of DeepSeek’s approach to train their new generation of models. “I’m sure they’re doing almost the exact same thing, but they’ll have their own flavor of it,” says Zeiler. 

But DeepSeek has more than one trick up its sleeve. It trained its base model V3 to do something called multi-token prediction, where the model learns to predict a string of words at once instead of one at a time. This training is cheaper and turns out to boost accuracy as well. “If you think about how you speak, when you’re halfway through a sentence, you know what the rest of the sentence is going to be,” says Zeiler. “These models should be capable of that too.”  

It has also found cheaper ways to create large data sets. To train last year’s model, DeepSeekMath, it took a free data set called Common Crawl—a huge number of documents scraped from the internet—and used an automated process to extract just the documents that included math problems. This was far cheaper than building a new data set of math problems by hand. It was also more effective: Common Crawl includes a lot more math than any other specialist math data set that’s available. 

And on the hardware side, DeepSeek has found new ways to juice old chips, allowing it to train top-tier models without coughing up for the latest hardware on the market. Half their innovation comes from straight engineering, says Zeiler: “They definitely have some really, really good GPU engineers on that team.”

Nvidia provides software called CUDA that engineers use to tweak the settings of their chips. But DeepSeek bypassed this code using assembler, a programming language that talks to the hardware itself, to go far beyond what Nvidia offers out of the box. “That’s as hardcore as it gets in optimizing these things,” says Zeiler. “You can do it, but basically it’s so difficult that nobody does.”

DeepSeek’s string of innovations on multiple models is impressive. But it also shows that the firm’s claim to have spent less than $6 million to train V3 is not the whole story. R1 and V3 were built on a stack of existing tech. “Maybe the very last step—the last click of the button—cost them $6 million, but the research that led up to that probably cost 10 times as much, if not more,” says Friedman. And in a blog post that cut through a lot of the hype, Anthropic cofounder and CEO Dario Amodei pointed out that DeepSeek probably has around $1 billion worth of chips, an estimate based on reports that the firm in fact used 50,000 Nvidia H100 GPUs

A new paradigm

But why now? There are hundreds of startups around the world trying to build the next big thing. Why have we seen a string of reasoning models like OpenAI’s o1 and o3, Google DeepMind’s Gemini 2.0 Flash Thinking, and now R1 appear within weeks of each other? 

The answer is that the base models—GPT-4o, Gemini 2.0, V3—are all now good enough to have reasoning-like behavior coaxed out of them. “What R1 shows is that with a strong enough base model, reinforcement learning is sufficient to elicit reasoning from a language model without any human supervision,” says Lewis Tunstall, a scientist at Hugging Face.

In other words, top US firms may have figured out how to do it but were keeping quiet. “It seems that there’s a clever way of taking your base model, your pretrained model, and turning it into a much more capable reasoning model,” says Zeiler. “And up to this point, the procedure that was required for converting a pretrained model into a reasoning model wasn’t well known. It wasn’t public.”

What’s different about R1 is that DeepSeek published how they did it. “And it turns out that it’s not that expensive a process,” says Zeiler. “The hard part is getting that pretrained model in the first place.” As Karpathy revealed at Microsoft Build last year, pretraining a model represents 99% of the work and most of the cost. 

If building reasoning models is not as hard as people thought, we can expect a proliferation of free models that are far more capable than we’ve yet seen. With the know-how out in the open, Friedman thinks, there will be more collaboration between small companies, blunting the edge that the biggest companies have enjoyed. “I think this could be a monumental moment,” he says. 

OpenAI releases its new o3-mini reasoning model for free

On Thursday, Microsoft announced that it’s rolling OpenAI’s reasoning model o1 out to its Copilot users, and now OpenAI is releasing a new reasoning model, o3-mini, to people who use the free version of ChatGPT. This will mark the first time that the vast majority of people will have access to one of OpenAI’s reasoning models, which were formerly restricted to its paid Pro and Plus bundles.

Reasoning models use a “chain of thought” technique to generate responses, essentially working through a problem presented to the model step by step. Using this method, the model can find mistakes in its process and correct them before giving an answer. This typically results in more thorough and accurate responses, but it also causes the models to pause before answering, sometimes leading to lengthy wait times. OpenAI claims that o3-mini responds 24% faster than o1-mini.

These types of models are most effective at solving complex problems, so if you have any PhD-level math problems you’re cracking away at, you can try them out. Alternatively, if you’ve had issues with getting previous models to respond properly to your most advanced prompts, you may want to try out this new reasoning model on them. To try out o3-mini, simply select “Reason” when you start a new prompt on ChatGPT

Although reasoning models possess new capabilities, they come at a cost. OpenAI’s o1-mini is 20 times more expensive to run than its equivalent non-reasoning model, GPT-4o mini. The company says its new model, o3-mini, costs 63% less than o1-mini per input token However, at $1.10 per million input tokens, it is still about seven times more expensive to run than GPT-4o mini.

This new model is coming right after the DeepSeek release that shook the AI world less than two weeks ago. DeepSeek’s new model performs just as well as top OpenAI models, but the Chinese company claims it cost roughly $6 million to train, as opposed to the estimated cost of over $100 million for training OpenAI’s GPT-4. (It’s worth noting that a lot of people are interrogating this claim.) 

Additionally, DeepSeek’s reasoning model costs $0.55 per million input tokens, half the price of o3-mini, so OpenAI still has a way to go to bring down its costs. It’s estimated that reasoning models also have much higher energy costs than other types, given the larger number of computations they require to produce an answer.

This new wave of reasoning models present new safety challenges as well. OpenAI used a technique called deliberative alignment to train its o-series models, basically having them reference OpenAI’s internal policies at each step of its reasoning to make sure they weren’t ignoring any rules.

But the company has found that o3-mini, like the o1 model, is significantly better than non-reasoning models at jailbreaking and “challenging safety evaluations”—essentially, it’s much harder to control a reasoning model given its advanced capabilities. o3-mini is the first model to score as “medium risk” on model autonomy, a rating given because it’s better than previous models at specific coding tasks—indicating “greater potential for self-improvement and AI research acceleration,” according to OpenAI. That said, the model is still bad at real-world research. If it were better at that, it would be rated as high risk, and OpenAI would restrict the model’s release.

DeepSeek might not be such good news for energy after all

In the week since a Chinese AI model called DeepSeek became a household name, a dizzying number of narratives have gained steam, with varying degrees of accuracy: that the model is collecting your personal data (maybe); that it will upend AI as we know it (too soon to tell—but do read my colleague Will’s story on that!); and perhaps most notably, that DeepSeek’s new, more efficient approach means AI might not need to guzzle the massive amounts of energy that it currently does.

The latter notion is misleading, and new numbers shared with MIT Technology Review help show why. These early figures—based on the performance of one of DeepSeek’s smaller models on a small number of prompts—suggest it could be more energy intensive when generating responses than the equivalent-size model from Meta. The issue might be that the energy it saves in training is offset by its more intensive techniques for answering questions, and by the long answers they produce. 

Add the fact that other tech firms, inspired by DeepSeek’s approach, may now start building their own similar low-cost reasoning models, and the outlook for energy consumption is already looking a lot less rosy.

The life cycle of any AI model has two phases: training and inference. Training is the often months-long process in which the model learns from data. The model is then ready for inference, which happens each time anyone in the world asks it something. Both usually take place in data centers, where they require lots of energy to run chips and cool servers. 

On the training side for its R1 model, DeepSeek’s team improved what’s called a “mixture of experts” technique, in which only a portion of a model’s billions of parameters—the “knobs” a model uses to form better answers—are turned on at a given time during training. More notably, they improved reinforcement learning, where a model’s outputs are scored and then used to make it better. This is often done by human annotators, but the DeepSeek team got good at automating it

The introduction of a way to make training more efficient might suggest that AI companies will use less energy to bring their AI models to a certain standard. That’s not really how it works, though. 

“⁠Because the value of having a more intelligent system is so high,” wrote Anthropic cofounder Dario Amodei on his blog, it “causes companies to spend more, not less, on training models.” If companies get more for their money, they will find it worthwhile to spend more, and therefore use more energy. “The gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company’s financial resources,” he wrote. It’s an example of what’s known as the Jevons paradox.

But that’s been true on the training side as long as the AI race has been going. The energy required for inference is where things get more interesting. 

DeepSeek is designed as a reasoning model, which means it’s meant to perform well on things like logic, pattern-finding, math, and other tasks that typical generative AI models struggle with. Reasoning models do this using something called “chain of thought.” It allows the AI model to break its task into parts and work through them in a logical order before coming to its conclusion. 

You can see this with DeepSeek. Ask whether it’s okay to lie to protect someone’s feelings, and the model first tackles the question with utilitarianism, weighing the immediate good against the potential future harm. It then considers Kantian ethics, which propose that you should act according to maxims that could be universal laws. It considers these and other nuances before sharing its conclusion. (It finds that lying is “generally acceptable in situations where kindness and prevention of harm are paramount, yet nuanced with no universal solution,” if you’re curious.)

Chain-of-thought models tend to perform better on certain benchmarks such as MMLU, which tests both knowledge and problem-solving in 57 subjects. But, as is becoming clear with DeepSeek, they also require significantly more energy to come to their answers. We have some early clues about just how much more.

Scott Chamberlin spent years at Microsoft, and later Intel, building tools to help reveal the environmental costs of certain digital activities. Chamberlin did some initial tests to see how much energy a GPU uses as DeepSeek comes to its answer. The experiment comes with a bunch of caveats: He tested only a medium-size version of DeepSeek’s R-1, using only a small number of prompts. It’s also difficult to make comparisons with other reasoning models.

DeepSeek is “really the first reasoning model that is fairly popular that any of us have access to,” he says. OpenAI’s o1 model is its closest competitor, but the company doesn’t make it open for testing. Instead, he tested it against a model from Meta with the same number of parameters: 70 billion.

The prompt asking whether it’s okay to lie generated a 1,000-word response from the DeepSeek model, which took 17,800 joules to generate—about what it takes to stream a 10-minute YouTube video. This was about 41% more energy than Meta’s model used to answer the prompt. Overall, when tested on 40 prompts, DeepSeek was found to have a similar energy efficiency to the Meta model, but DeepSeek tended to generate much longer responses and therefore was found to use 87% more energy.

How does this compare with models that use regular old-fashioned generative AI as opposed to chain-of-thought reasoning? Tests from a team at the University of Michigan in October found that the 70-billion-parameter version of Meta’s Llama 3.1 averaged just 512 joules per response.

Neither DeepSeek nor Meta responded to requests for comment.

Again: uncertainties abound. These are different models, for different purposes, and a scientifically sound study of how much energy DeepSeek uses relative to competitors has not been done. But it’s clear, based on the architecture of the models alone, that chain-of-thought models use lots more energy as they arrive at sounder answers. 

Sasha Luccioni, an AI researcher and climate lead at Hugging Face, worries that the excitement around DeepSeek could lead to a rush to insert this approach into everything, even where it’s not needed. 

“If we started adopting this paradigm widely, inference energy usage would skyrocket,” she says. “If all of the models that are released are more compute intensive and become chain-of-thought, then it completely voids any efficiency gains.”

AI has been here before. Before ChatGPT launched in 2022, the name of the game in AI was extractive—basically finding information in lots of text, or categorizing images. But in 2022, the focus switched from extractive AI to generative AI, which is based on making better and better predictions. That requires more energy. 

“That’s the first paradigm shift,” Luccioni says. According to her research, that shift has resulted in orders of magnitude more energy being used to accomplish similar tasks. If the fervor around DeepSeek continues, she says, companies might be pressured to put its chain-of-thought-style models into everything, the way generative AI has been added to everything from Google search to messaging apps. 

We do seem to be heading in a direction of more chain-of-thought reasoning: OpenAI announced on January 31 that it would expand access to its own reasoning model, o3. But we won’t know more about the energy costs until DeepSeek and other models like it become better studied.

“It will depend on whether or not the trade-off is economically worthwhile for the business in question,” says Nathan Benaich, founder and general partner at Air Street Capital. “The energy costs would have to be off the charts for them to play a meaningful role in decision-making.”

AI’s energy obsession just got a reality check

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Just a week in, the AI sector has already seen its first battle of wits under the new Trump administration. The clash stems from two key pieces of news: the announcement of the Stargate project, which would spend $500 billion—more than the Apollo space program—on new AI data centers, and the release of a powerful new model from China. Together, they raise important questions the industry needs to answer about the extent to which the race for more data centers—with their heavy environmental toll—is really necessary.

A reminder about the first piece: OpenAI, Oracle, SoftBank, and an Abu Dhabi–based investment fund called MGX plan to spend up to $500 billion opening massive data centers around the US to build better AI. Much of the groundwork for this project was laid in 2024, when OpenAI increased its lobbying spending sevenfold (which we were first to report last week) and AI companies started pushing for policies that were less about controlling problems like deepfakes and misinformation, and more about securing more energy.

Still, Trump received credit for it from tech leaders when he announced the effort on his second day in office. “I think this will be the most important project of this era,” OpenAI’s Sam Altman said at the launch event, adding, “We wouldn’t be able to do this without you, Mr. President.”

It’s an incredible sum, just slightly less than the inflation-adjusted cost of building the US highway system over the course of more than 30 years. However, not everyone sees Stargate as having the same public benefit. Environmental groups say it could strain local grids and further drive up the cost of energy for the rest of us, who aren’t guzzling it to train and deploy AI models. Previous research has also shown that data centers tend to be built in areas that use much more carbon-intensive sources of energy, like coal, than the national average. It’s not clear how much, if at all, Stargate will rely on renewable energy. 

Even louder critics of Stargate, though, include Elon Musk. None of Musk’s companies are involved in the project, and he has attempted to publicly sow doubt that OpenAI and SoftBank have enough of the money needed for the plan anyway, claims that Altman disputed on X. Musk’s decision to publicly criticize the president’s initiative has irked people in Trump’s orbit, Politico reports, but it’s not clear if those people have expressed that to Musk directly. 

On to the second piece. On the day Trump was inaugurated, a Chinese startup released an AI model that started making a whole bunch of important people in Silicon Valley very worried about their competition. (This close timing is almost certainly not an accident.)

The model, called DeepSeek R1, is a reasoning model. These types of models are designed to excel at math, logic, pattern-finding, and decision-making. DeepSeek proved it could “reason” through complicated problems as well as one of OpenAI’s reasoning models, o1—and more efficiently. What’s more, DeepSeek isn’t a super-secret project kept behind lock and key like OpenAI’s. It was released for all to see.

DeepSeek was released as the US has made outcompeting China in the AI race a top priority. This goal was a driving force behind the 2022 CHIPS Act to make more chips domestically. It’s influenced the position of tech companies like OpenAI, which has embraced lending its models to national security work and has partnered with the defense-tech company Anduril to help the military take down drones. It’s led to export controls that limit what types of chips Nvidia can sell to China.

The success of DeepSeek signals that these efforts aren’t working as well as AI leaders in the US would like (though it’s worth noting that the impact of export controls for chips isn’t felt for a few years, so the policy wouldn’t be expected to have prevented a model like DeepSeek).  

Still, the model poses a threat to the bottom line of certain players in Big Tech. Why pay for an expensive model from OpenAI when you can get access to DeepSeek for free? Even other makers of open-source models, especially Meta, are panicking about the competition, according to The Information. The company has set up a number of “war rooms” to figure out how DeepSeek was made so efficient. (A couple of days after the Stargate announcement, Meta said it would increase its own capital investments by 70% to build more AI infrastructure.)

What does this all mean for the Stargate project? Let’s think about why OpenAI and its partners are willing to spend $500 billion on data centers to begin with. They believe that AI in its various forms—not just chatbots or generative video or even new AI agents, but also developments yet to be unveiled—will be the most lucrative tool humanity has ever built. They also believe that access to powerful chips inside massive data centers is the key to getting there. 

DeepSeek poked some holes in that approach. It didn’t train on yet-unreleased chips that are light-years ahead. It didn’t, to our knowledge, require the eye-watering amounts of computing power and energy behind the models from US companies that have made headlines. Its designers made clever decisions in the name of efficiency.

In theory, it could make a project like Stargate seem less urgent and less necessary. If, in dissecting DeepSeek, AI companies discover some lessons about how to make models use existing resources more effectively, perhaps constructing more and more data centers won’t be the only winning formula for better AI. That would be welcome to the many people affected by the problems data centers can bring, like lots of emissions, the loss of fresh, drinkable water used to cool them, and the strain on local power grids. 

Thus far, DeepSeek doesn’t seem to have sparked such a change in approach. OpenAI researcher Noam Brown wrote on X, “I have no doubt that with even more compute it would be an even more powerful model.”

If his logic wins out, the players with the most computing power will win, and getting it is apparently worth at least $500 billion to AI’s biggest companies. But let’s remember—announcing it is the easiest part.


Now read the rest of The Algorithm

Deeper Learning

What’s next for robots

Many of the big questions about AI–-how it learns, how well it works, and where it should be deployed—are now applicable to robotics. In the year ahead, we will see humanoid robots being put to the test in warehouses and factories, robots learning in simulated worlds, and a rapid increase in the military’s adoption of autonomous drones, submarines, and more. 

Why it matters: Jensen Huang, the highly influential CEO of the chipmaker Nvidia, stated last month that the next advancement in AI will mean giving the technology a “body” of sorts in the physical world. This will come in the form of advanced robotics. Even with the caveat that robotics is full of futuristic promises that usually aren’t fulfilled by their deadlines, the marrying of AI methods with new advancements in robots means the field is changing quickly. Read more here.

Bits and Bytes

Leaked documents expose deep ties between Israeli army and Microsoft

Since the attacks of October 7, the Israeli military has relied heavily on cloud and AI services from Microsoft and its partner OpenAI, and the tech giant’s staff has embedded with different units to support rollout, a joint investigation reveals. (+972 Magazine)

The tech arsenal that could power Trump’s immigration crackdown

The effort by federal agencies to acquire powerful technology to identify and track migrants has been unfolding for years across multiple administrations. These technologies may be called upon more directly under President Trump. (The New York Times)

OpenAI launches Operator—an agent that can use a computer for you

Operator is a web app that can carry out simple online tasks in a browser, such as booking concert tickets or making an online grocery order. (MIT Technology Review)

The second wave of AI coding is here

A string of startups are racing to build models that can produce better and better software. But it’s not only AI’s increasingly powerful ability to write code that’s impressive. They claim it’s the shortest path to superintelligent AI. (MIT Technology Review)

How a top Chinese AI model overcame US sanctions

The AI community is abuzz over DeepSeek R1, a new open-source reasoning model. 

The model was developed by the Chinese AI startup DeepSeek, which claims that R1 matches or even surpasses OpenAI’s ChatGPT o1 on multiple key benchmarks but operates at a fraction of the cost. 

“This could be a truly equalizing breakthrough that is great for researchers and developers with limited resources, especially those from the Global South,” says Hancheng Cao, an assistant professor in information systems at Emory University.

DeepSeek’s success is even more remarkable given the constraints facing Chinese AI companies in the form of increasing US export controls on cutting-edge chips. But early evidence shows that these measures are not working as intended. Rather than weakening China’s AI capabilities, the sanctions appear to be driving startups like DeepSeek to innovate in ways that prioritize efficiency, resource-pooling, and collaboration.

To create R1, DeepSeek had to rework its training process to reduce the strain on its GPUs, a variety released by Nvidia for the Chinese market that have their performance capped at half the speed of its top products, according to Zihan Wang, a former DeepSeek employee and current PhD student in computer science at Northwestern University. 

DeepSeek R1 has been praised by researchers for its ability to tackle complex reasoning tasks, particularly in mathematics and coding. The model employs a “chain of thought” approach similar to that used by ChatGPT o1, which lets it solve problems by processing queries step by step.

Dimitris Papailiopoulos, principal researcher at Microsoft’s AI Frontiers research lab, says what surprised him the most about R1 is its engineering simplicity. “DeepSeek aimed for accurate answers rather than detailing every logical step, significantly reducing computing time while maintaining a high level of effectiveness,” he says.

DeepSeek has also released six smaller versions of R1 that are small enough to  run locally on laptops. It claims that one of them even outperforms OpenAI’s o1-mini on certain benchmarks.“DeepSeek has largely replicated o1-mini and has open sourced it,” tweeted Perplexity CEO Aravind Srinivas. DeepSeek did not reply to MIT Technology Review’s request for comments.

Despite the buzz around R1, DeepSeek remains relatively unknown. Based in Hangzhou, China, it was founded in July 2023 by Liang Wenfeng, an alumnus of Zhejiang University with a background in information and electronic engineering. It was incubated by High-Flyer, a hedge fund that Liang founded in 2015. Like Sam Altman of OpenAI, Liang aims to build artificial general intelligence (AGI), a form of AI that can match or even beat humans on a range of tasks.

Training large language models (LLMs) requires a team of highly trained researchers and substantial computing power. In a recent interview with the Chinese media outlet LatePost, Kai-Fu Lee, a veteran entrepreneur and former head of Google China, said that only “front-row players” typically engage in building foundation models such as ChatGPT, as it’s so resource-intensive. The situation is further complicated by the US export controls on advanced semiconductors. High-Flyer’s decision to venture into AI is directly related to these constraints, however. Long before the anticipated sanctions, Liang acquired a substantial stockpile of Nvidia A100 chips, a type now banned from export to China. The Chinese media outlet 36Kr estimates that the company has over 10,000 units in stock, but Dylan Patel, founder of the AI research consultancy SemiAnalysis, estimates that it has at least 50,000. Recognizing the potential of this stockpile for AI training is what led Liang to establish DeepSeek, which was able to use them in combination with the lower-power chips to develop its models. 

Tech giants like Alibaba and ByteDance, as well as a handful of startups with deep-pocketed investors, dominate the Chinese AI space, making it challenging for small or medium-sized enterprises to compete. A company like DeepSeek, which has no plans to raise funds, is rare. 

Zihan Wang, the former DeepSeek employee, told MIT Technology Review that he had access to abundant computing resources and was given freedom to experiment when working at DeepSeek, “a luxury that few fresh graduates would get at any company.” 

In an interview with the Chinese media outlet 36Kr in July 2024 Liang said that an additional challenge Chinese companies face on top of chip sanctions, is that their AI engineering techniques tend to be less efficient. “We [most Chinese companies] have to consume twice the computing power to achieve the same results. Combined with data efficiency gaps, this could mean needing up to four times more computing power. Our goal is to continuously close these gaps,” he said.  

But DeepSeek found ways to reduce memory usage and speed up calculation without significantly sacrificing accuracy. “The team loves turning a hardware challenge into an opportunity for innovation,” says Wang.

Liang himself remains deeply involved in DeepSeek’s research process, running experiments alongside his team. “The whole team shares a collaborative culture and dedication to hardcore research,” Wang says.

As well as prioritizing efficiency, Chinese companies are increasingly embracing open-source principles. Alibaba Cloud has released over 100 new open-source AI models, supporting 29 languages and catering to various applications, including coding and mathematics. Similarly, startups like Minimax and 01.AI have open-sourced their models. 

According to a white paper released last year by the China Academy of Information and Communications Technology, a state-affiliated research institute, the number of AI large language models worldwide has reached 1,328, with 36% originating in China. This positions China as the second-largest contributor to AI, behind the United States. 

“This generation of young Chinese researchers identify strongly with open-source culture because they benefit so much from it,” says Thomas Qitong Cao, an assistant professor of technology policy at Tufts University.

“The US export control has essentially backed Chinese companies into a corner where they have to be far more efficient with their limited computing resources,” says Matt Sheehan, an AI researcher at the Carnegie Endowment for International Peace. “We are probably going to see a lot of consolidation in the future related to the lack of compute.”

That might already have started to happen. Two weeks ago, Alibaba Cloud announced that it has partnered with the Beijing-based startup 01.AI, founded by Kai-Fu Lee, to merge research teams and establish an “industrial large model laboratory.”

“It is energy-efficient and natural for some kind of division of labor to emerge in the AI industry,” says Cao, the Tufts professor. “The rapid evolution of AI demands agility from Chinese firms to survive.”

OpenAI launches Operator—an agent that can use a computer for you

After weeks of buzz, OpenAI has released Operator, its first AI agent. Operator is a web app that can carry out simple online tasks in a browser, such as booking concert tickets or filling an online grocery order. The app is powered by a new model called Computer-Using Agent—CUA (“coo-ah”), for short—built on top of OpenAI’s multimodal large language model GPT-4o.

Operator is available today at operator.chatgpt.com to people in the US signed up with ChatGPT Pro, OpenAI’s premium $200-a-month service. The company says it plans to roll the tool out to other users in the future.

OpenAI claims that Operator outperforms similar rival tools, including Anthropic’s Computer Use (a version of Claude 3.5 Sonnet that can carry out simple tasks on a computer) and Google DeepMind’s Mariner (a web-browsing agent built on top of Gemini 2.0).

The fact that three of the world’s top AI firms have converged on the same vision of what agent-based models could be makes one thing clear. The battle for AI supremacy has a new frontier—and it’s our computer screens.

“Moving from generating text and images to doing things is the right direction,” says Ali Farhadi, CEO of the Allen Institute for AI (AI2). “It unlocks business, solves new problems.”

Farhadi thinks that doing things on a computer screen is a natural first step for agents: “It is constrained enough that the current state of the technology can actually work,” he says. “At the same time, it’s impactful enough that people might use it.” (AI2 is working on its own computer-using agent, says Farhadi.)

Don’t believe the hype

OpenAI’s announcement also confirms one of two rumors that circled the internet this week. One predicted that OpenAI was about to reveal an agent-based app, after details about Operator were leaked on social media ahead of its release. The other predicted that OpenAI was about to reveal a new superintelligence—and that officials for newly inaugurated President Trump would be briefed on it.

Could the two rumors be linked? OpenAI superfans wanted to know.

Nope. OpenAI gave MIT Technology Review a preview of Operator in action yesterday. The tool is an exciting glimpse of large language models’ potential to do a lot more than answer questions. But Operator is an experimental work in progress. “It’s still early, it still makes mistakes,” says Yash Kumar, a researcher at OpenAI.

(As for the wild superintelligence rumors, let’s leave that to OpenAI CEO Sam Altman to address: “twitter hype is out of control again,” he posted on January 20. “pls chill and cut your expectations 100x!”)

Like Anthropic’s Computer Use and Google DeepMind’s Mariner, Operator takes screenshots of a computer screen and scans the pixels to figure out what actions it can take. CUA, the model behind it, is trained to interact with the same graphical user interfaces—buttons, text boxes, menus—that people use when they do things online. It scans the screen, takes an action, scans the screen again, takes another action, and so on. That lets the model carry out tasks on most websites that a person can use.

“Traditionally the way models have used software is through specialized APIs,” says Reiichiro Nakano, a scientist at OpenAI. (An API, or application programming interface, is a piece of code that acts as a kind of connector, allowing different bits of software to be hooked up to one another.) That puts a lot of apps and most websites off limits, he says: “But if you create a model that can use the same interface that humans use on a daily basis, it opens up a whole new range of software that was previously inaccessible.”

CUA also breaks tasks down into smaller steps and tries to work through them one by one, backtracking when it gets stuck. OpenAI says CUA was trained with techniques similar to those used for its so-called reasoning models, o1 and o3. 

Operator can be instructed to search for campsites in Yosemite with good picnic tables.
OPENAI

OpenAI has tested CUA against a number of industry benchmarks designed to assess the ability of an agent to carry out tasks on a computer. The company claims that its model beats Computer Use and Mariner in all of them.

For example, on OSWorld, which tests how well an agent performs tasks such as merging PDF files or manipulating an image, CUA scores 38.1% to Computer Use’s 22.0%  In comparison, humans score 72.4%. On a benchmark called WebVoyager, which tests how well an agent performs tasks in a browser, CUA scores 87%, Mariner 83.5%, and Computer Use 56%. (Mariner can only carry out tasks in a browser and therefore does not score on OSWorld.)

For now, Operator can also only carry out tasks in a browser. OpenAI plans to make CUA’s wider abilities available in the future via an API that other developers can use to build their own apps. This is how Anthropic released Computer Use in December.

OpenAI says it has tested CUA’s safety, using red teams to explore what happens when users ask it to do unacceptable tasks (such as research how to make a bioweapon), when websites contain hidden instructions designed to derail it, and when the model itself breaks down. “We’ve trained the model to stop and ask the user for information before doing anything with external side effects,” says Casey Chu, another researcher on the team.

Look! No hands

To use Operator, you simply type instructions into a text box. But instead of calling up the browser on your computer, Operator sends your instructions to a remote browser running on an OpenAI server. OpenAI claims that this makes the system more efficient. It’s another key difference between Operator, Computer Use and Mariner (which runs inside Google’s Chrome browser on your own computer).

Because it’s running in the cloud, Operator can carry out multiple tasks at once, says Kumar. In the live demo, he asked Operator to use OpenTable to book him a table for two at 6.30 p.m. at a restaurant called Octavia in San Francisco. Straight away, Operator opened up OpenTable and started clicking through options. “As you can see, my hands are off the keyboard,” he said.

OpenAI is collaborating with a number of businesses, including OpenTable, StubHub, Instacart, DoorDash, and Uber. The nature of those collaborations is not exactly clear, but Operator appears to suggest preset websites to use for certain tasks.

While the tool navigated dropdowns on OpenTable, Kumar sent Operator off to find four tickets for a Kendrick Lamar show on StubHub. While it did that, he pasted a photo of a handwritten shopping list and asked Operator to add the items to his Instacart.

He waited, flicking between Operator’s tabs. “If it needs help or if it needs confirmations, it’ll come back to you with questions and you can answer it,” he said.

Kumar says he has been using Operator at home. It helps him stay on top of grocery shopping: “I can just quickly click a photo of a list and send it to work,” he says.

It’s also become a sidekick in his personal life. “I have a date night every Thursday,” says Kumar. So every Thursday morning, he instructs Operator to send him a list of five restaurants that have a table for two that evening. “Of course, I could do that, but it takes me 10 minutes,” he says. “And I often forget to do it. With Operator, I can run the task with one click. There’s no burden of booking.”

What’s next for robots

MIT Technology Review’s What’s Next series looks across industries, trends, and technologies to give you a first look at the future. You can read the rest of them here.

Jan Liphardt teaches bioengineering at Stanford, but to many strangers in Los Altos, California, he is a peculiar man they see walking a four-legged robotic dog down the street. 

Liphardt has been experimenting with building and modifying robots for years, and when he brings his “dog” out in public, he generally gets one of three reactions. Young children want to have one, their parents are creeped out, and baby boomers try to ignore it. “They’ll quickly walk by,” he says, “like, ‘What kind of dumb new stuff is going on here?’” 

In the many conversations I’ve had about robots, I’ve also found that most people tend to fall into these three camps, though I don’t see such a neat age division. Some are upbeat and vocally hopeful that a future is just around the corner in which machines can expertly handle much of what is currently done by humans, from cooking to surgery. Others are scared: of job losses, injuries, and whatever problems may come up as we try to live side by side. 

The final camp, which I think is the largest, is just unimpressed. We’ve been sold lots of promises that robots will transform society ever since the first robotic arm was installed on an assembly line at a General Motors plant in New Jersey in 1961. Few of those promises have panned out so far. 

But this year, there’s reason to think that even those staunchly in the “bored” camp will be intrigued by what’s happening in the robot races. Here’s a glimpse at what to keep an eye on. 

Humanoids are put to the test

The race to build humanoid robots is motivated by the idea that the world is set up for the human form, and that automating that form could mean a seismic shift for robotics. It is led by some particularly outspoken and optimistic entrepreneurs, including Brett Adcock, the founder of Figure AI, a company making such robots that’s valued at more than $2.6 billion (it’s begun testing its robots with BMW). Adcock recently told Time, “Eventually, physical labor will be optional.” Elon Musk, whose company Tesla is building a version called Optimus, has said humanoid robots will create “a future where there is no poverty.” A robotics company called Eliza Wakes Up is taking preorders for a $420,000 humanoid called, yes, Eliza.

In June 2024, Agility Robotics sent a fleet of its Digit humanoid robots to GXO Logistics, which moves products for companies ranging from Nike to Nestlé. The humanoids can handle most tasks that involve picking things up and moving them somewhere else, like unloading pallets or putting boxes on a conveyor. 

There have been hiccups: Highly polished concrete floors can cause robots to slip at first, and buildings need good Wi-Fi coverage for the robots to keep functioning. But charging is a bigger issue. Agility’s current version of Digit, with a 39-pound battery, can run for two to four hours before it needs to charge for one hour, so swapping out the robots for fresh ones is a common task on each shift. If there are a small number of charging docks installed, the robots can theoretically charge by shuffling among the docks themselves overnight when some facilities aren’t running, but moving around on their own can set off a building’s security system. “It’s a problem,” says CTO Melonee Wise.

Wise is cautious about whether humanoids will be widely adopted in workplaces. “I’ve always been a pessimist,” she says. That’s because getting robots to work well in a lab is one thing, but integrating them into a bustling warehouse full of people and forklifts moving goods on tight deadlines is another task entirely.

If 2024 was the year of unsettling humanoid product launch videos, this year we will see those humanoids put to the test, and we’ll find out whether they’ll be as productive for paying customers as promised. Now that Agility’s robots have been deployed in fast-paced customer facilities, it’s clear that small problems can really add up. 

Then there are issues with how robots and humans share spaces. In the GXO facility the two work in completely separate areas, Wise says, but there are cases where, for example, a human worker might accidentally leave something obstructing a charging station. That means Agility’s robots can’t return to the dock to charge, so they need to alert a human employee to move the obstruction out of the way, slowing operations down.  

It’s often said that robots don’t call out sick or need health care. But this year, as fleets of humanoids arrive on the job, we’ll begin to find out the limitations they do have.

Learning from imagination

The way we teach robots how to do things is changing rapidly. It used to be necessary to break their tasks down into steps with specifically coded instructions, but now, thanks to AI, those instructions can be gleaned from observation. Just as ChatGPT was taught to write through exposure to trillions of sentences rather than by explicitly learning the rules of grammar, robots are learning through videos and demonstrations. 

That poses a big question: Where do you get all these videos and demonstrations for robots to learn from?

Nvidia, the world’s most valuable company, has long aimed to meet that need with simulated worlds, drawing on its roots in the video-game industry. It creates worlds in which roboticists can expose digital replicas of their robots to new environments to learn. A self-driving car can drive millions of virtual miles, or a factory robot can learn how to navigate in different lighting conditions.

In December, the company went a step further, releasing what it’s calling a “world foundation model.” Called Cosmos, the model has learned from 20 million hours of video—the equivalent of watching YouTube nonstop since Rome was at war with Carthage—that can be used to generate synthetic training data.

Here’s an example of how this model could help in practice. Imagine you run a robotics company that wants to build a humanoid that cleans up hospitals. You can start building this robot’s “brain” with a model from Nvidia, which will give it a basic understanding of physics and how the world works, but then you need to help it figure out the specifics of how hospitals work. You could go out and take videos and images of the insides of hospitals, or pay people to wear sensors and cameras while they go about their work there.

“But those are expensive to create and time consuming, so you can only do a limited number of them,” says Rev Lebaredian, vice president of simulation technologies at Nvidia. Cosmos can instead take a handful of those examples and create a three-dimensional simulation of a hospital. It will then start making changes—different floor colors, different sizes of hospital beds—and create slightly different environments. “You’ll multiply that data that you captured in the real world millions of times,” Lebaredian says. In the process, the model will be fine-tuned to work well in that specific hospital setting. 

It’s sort of like learning both from your experiences in the real world and from your own imagination (stipulating that your imagination is still bound by the rules of physics). 

Teaching robots through AI and simulations isn’t new, but it’s going to become much cheaper and more powerful in the years to come. 

A smarter brain gets a smarter body

Plenty of progress in robotics has to do with improving the way a robot senses and plans what to do—its “brain,” in other words. Those advancements can often happen faster than those that improve a robot’s “body,” which determine how well a robot can move through the physical world, especially in environments that are more chaotic and unpredictable than controlled assembly lines. 

The military has always been keen on changing that and expanding the boundaries of what’s physically possible. The US Navy has been testing machines from a company called Gecko Robotics that can navigate up vertical walls (using magnets) to do things like infrastructure inspections, checking for cracks, flaws, and bad welding on aircraft carriers. 

There are also investments being made for the battlefield. While nimble and affordable drones have reshaped rural battlefields in Ukraine, new efforts are underway to bring those drone capabilities indoors. The defense manufacturer Xtend received an $8.8 million contract from the Pentagon in December 2024 for its drones, which can navigate in confined indoor spaces and urban environments. These so-called “loitering munitions” are one-way attack drones carrying explosives that detonate on impact.

“These systems are designed to overcome challenges like confined spaces, unpredictable layouts, and GPS-denied zones,” says Rubi Liani, cofounder and CTO at Xtend. Deliveries to the Pentagon should begin in the first few months of this year. 

Another initiative—sparked in part by the Replicator project, the Pentagon’s plan to spend more than $1 billion on small unmanned vehicles—aims to develop more autonomously controlled submarines and surface vehicles. This is particularly of interest as the Department of Defense focuses increasingly on the possibility of a future conflict in the Pacific between China and Taiwan. In such a conflict, the drones that have dominated the war in Ukraine would serve little use because battles would be waged almost entirely at sea, where small aerial drones would be limited by their range. Instead, undersea drones would play a larger role.

All these changes, taken together, point toward a future where robots are more flexible in how they learn, where they work, and how they move. 

Jan Liphardt from Stanford thinks the next frontier of this transformation will hinge on the ability to instruct robots through speech. Large language models’ ability to understand and generate text has already made them a sort of translator between Liphardt and his robot.

“We can take one of our quadrupeds and we can tell it, ‘Hey, you’re a dog,’ and the thing wants to sniff you and tries to bark,” he says. “Then we do one word change—‘You’re a cat.’ Then the thing meows and, you know, runs away from dogs. And we haven’t changed a single line of code.”

Correction: A previous version of this story incorrectly stated that the robotics company Eliza Wakes Up has ties to a16z.

Implementing responsible AI in the generative age

Many organizations have experimented with AI, but they haven’t always gotten the full value from their investments. A host of issues standing in the way center on the accuracy, fairness, and security of AI systems. In response, organizations are actively exploring the principles of responsible AI: the idea that AI systems must be fair, transparent, and beneficial to society for it to be widely adopted. 

When responsible AI is done right, it unlocks trust and therefore customer adoption of enterprise AI. According to the US National Institute of Standards and Technology the essential building blocks of AI trustworthiness include: 

  • Validity and reliability 
  • Safety
  • Security and resiliency 
  • Accountability and transparency 
  • Explainability and interpretability 
  • Privacy
  • Fairness with mitigation of harmful bias 

To investigate the current landscape of responsible AI across the enterprise, MIT Technology Review Insights surveyed 250 business leaders about how they’re implementing principles that ensure AI trustworthiness. The poll found that responsible AI is important to executives, with 87% of respondents rating it a high or medium priority for their organization.

A majority of respondents (76%) also say that responsible AI is a high or medium priority specifically for creating a competitive advantage. But relatively few have figured out how to turn these ideas into reality. We found that only 15% of those surveyed felt highly prepared to adopt effective responsible AI practices, despite the importance they placed on them. 

Putting responsible AI into practice in the age of generative AI requires a series of best practices that leading companies are adopting. These practices can include cataloging AI models and data and implementing governance controls. Companies may benefit from conducting rigorous assessments, testing, and audits for risk, security, and regulatory compliance. At the same time, they should also empower employees with training at scale and ultimately make responsible AI a leadership priority to ensure their change efforts stick. 

“We all know AI is the most influential change in technology that we’ve seen, but there’s a huge disconnect,” says Steven Hall, chief AI officer and president of EMEA at ISG, a global technology research and IT advisory firm. “Everybody understands how transformative AI is going to be and wants strong governance, but the operating model and the funding allocated to responsible AI are well below where they need to be given its criticality to the organization.” 

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.