Sam Altman says helpful agents are poised to become AI’s killer function

A number of moments from my brief sit-down with Sam Altman brought the OpenAI CEO’s worldview into clearer focus. The first was when he pointed to my iPhone SE (the one with the home button that’s mostly hated) and said, “That’s the best iPhone.” More revealing, though, was the vision he sketched for how AI tools will become even more enmeshed in our daily lives than the smartphone.

“What you really want,” he told MIT Technology Review, “is just this thing that is off helping you.” Altman, who was visiting Cambridge for a series of events hosted by Harvard and the venture capital firm Xfund, described the killer app for AI as a “super-competent colleague that knows absolutely everything about my whole life, every email, every conversation I’ve ever had, but doesn’t feel like an extension.” It could tackle some tasks instantly, he said, and for more complex ones it could go off and make an attempt, but come back with questions for you if it needs to. 

It’s a leap from OpenAI’s current offerings. Its leading applications, like DALL-E, Sora, and ChatGPT (which Altman referred to as “incredibly dumb” compared with what’s coming next), have wowed us with their ability to generate convincing text and surreal videos and images. But they mostly remain tools we use for isolated tasks, and they have limited capacity to learn about us from our conversations with them. 

In the new paradigm, as Altman sees it, the AI will be capable of helping us outside the chat interface and taking real-world tasks off our plates. 

Altman on AI hardware’s future 

I asked Altman if we’ll need a new piece of hardware to get to this future. Though smartphones are extraordinarily capable, and their designers are already incorporating more AI-driven features, some entrepreneurs are betting that the AI of the future will require a device that’s more purpose built. Some of these devices are already beginning to appear in his orbit. There is the (widely panned) wearable AI Pin from Humane, for example (Altman is an investor in the company but has not exactly been a booster of the device). He is also rumored to be working with former Apple designer Jony Ive on some new type of hardware. 

But Altman says there’s a chance we won’t necessarily need a device at all. “I don’t think it will require a new piece of hardware,” he told me, adding that the type of app envisioned could exist in the cloud. But he quickly added that even if this AI paradigm shift won’t require consumers buy a new hardware, “I think you’ll be happy to have [a new device].” 

Though Altman says he thinks AI hardware devices are exciting, he also implied he might not be best suited to take on the challenge himself: “I’m very interested in consumer hardware for new technology. I’m an amateur who loves it, but this is so far from my expertise.”

On the hunt for training data

Upon hearing his vision for powerful AI-driven agents, I wondered how it would square with the industry’s current scarcity of training data. To build GPT-4 and other models, OpenAI has scoured internet archives, newspapers, and blogs for training data, since scaling laws have long shown that making models bigger also makes them better. But finding more data to train on is a growing problem. Much of the internet has already been slurped up, and access to private or copyrighted data is now mired in legal battles. 

Altman is optimistic this won’t be a problem for much longer, though he didn’t articulate the specifics. 

“I believe, but I’m not certain, that we’re going to figure out a way out of this thing of you always just need more and more training data,” he says. “Humans are existence proof that there is some other way to [train intelligence]. And I hope we find it.”

On who will be poised to create AGI

OpenAI’s central vision has long revolved around the pursuit of artificial general intelligence (AGI), or an AI that can reason as well as or better than humans. Its stated mission is to ensure such a technology “benefits all of humanity.” It is far from the only company pursuing AGI, however. So in the race for AGI, what are the most important tools? I asked Altman if he thought the entity that marshals the largest amount of chips and computing power will ultimately be the winner. 

Altman suspects there will be “several different versions [of AGI] that are better and worse at different things,” he says. “You’ll have to be over some compute threshold, I would guess. But even then I wouldn’t say I’m certain.”

On when we’ll see GPT-5

You thought he’d answer that? When another reporter in the room asked Altman if he knew when the next version of GPT is slated to be released, he gave a calm response. “Yes,” he replied, smiling, and said nothing more. 

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary

I’m stressed and running late, because what do you wear for the rest of eternity? 

This makes it sound like I’m dying, but it’s the opposite. I am, in a way, about to live forever, thanks to the AI video startup Synthesia. For the past several years, the company has produced AI-generated avatars, but today it launches a new generation, its first to take advantage of the latest advancements in generative AI, and they are more realistic and expressive than anything I’ve ever seen. While today’s release means almost anyone will now be able to make a digital double, on this early April afternoon, before the technology goes public, they’ve agreed to make one of me. 

When I finally arrive at the company’s stylish studio in East London, I am greeted by Tosin Oshinyemi, the company’s production lead. He is going to guide and direct me through the data collection process—and by “data collection,” I mean the capture of my facial features, mannerisms, and more—much like he normally does for actors and Synthesia’s customers. 

In this AI-generated footage, synthetic “Melissa” gives a performance of Hamlet’s famous soliloquy. (The magazine had no role in producing this video.)
SYNTHESIA

He introduces me to a waiting stylist and a makeup artist, and I curse myself for wasting so much time getting ready. Their job is to ensure that people have the kind of clothes that look good on camera and that they look consistent from one shot to the next. The stylist tells me my outfit is fine (phew), and the makeup artist touches up my face and tidies my baby hairs. The dressing room is decorated with hundreds of smiling Polaroids of people who have been digitally cloned before me. 

Apart from the small supercomputer whirring in the corridor, which processes the data generated at the studio, this feels more like going into a news studio than entering a deepfake factory. 

I joke that Oshinyemi has what MIT Technology Review might call a job title of the future: “deepfake creation director.” 

“We like the term ‘synthetic media’ as opposed to ‘deepfake,’” he says. 

It’s a subtle but, some would argue, notable difference in semantics. Both mean AI-generated videos or audio recordings of people doing or saying something that didn’t necessarily happen in real life. But deepfakes have a bad reputation. Since their inception nearly a decade ago, the term has come to signal something unethical, says Alexandru Voica, Synthesia’s head of corporate affairs and policy. Think of sexual content produced without consent, or political campaigns that spread disinformation or propaganda.

“Synthetic media is the more benign, productive version of that,” he argues. And Synthesia wants to offer the best version of that version.  

Until now, all AI-generated videos of people have tended to have some stiffness, glitchiness, or other unnatural elements that make them pretty easy to differentiate from reality. Because they’re so close to the real thing but not quite it, these videos can make people feel annoyed or uneasy or icky—a phenomenon commonly known as the uncanny valley. Synthesia claims its new technology will finally lead us out of the valley. 

Thanks to rapid advancements in generative AI and a glut of training data created by human actors that has been fed into its AI model, Synthesia has been able to produce avatars that are indeed more humanlike and more expressive than their predecessors. The digital clones are better able to match their reactions and intonation to the sentiment of their scripts—acting more upbeat when talking about happy things, for instance, and more serious or sad when talking about unpleasant things. They also do a better job matching facial expressions—the tiny movements that can speak for us without words. 

But this technological progress also signals a much larger social and cultural shift. Increasingly, so much of what we see on our screens is generated (or at least tinkered with) by AI, and it is becoming more and more difficult to distinguish what is real from what is not. This threatens our trust in everything we see, which could have very real, very dangerous consequences. 

“I think we might just have to say goodbye to finding out about the truth in a quick way,” says Sandra Wachter, a professor at the Oxford Internet Institute, who researches the legal and ethical implications of AI. “The idea that you can just quickly Google something and know what’s fact and what’s fiction—I don’t think it works like that anymore.” 

monitor on a video camera showing Heikkilä and Oshinyemi on set in front of the green screen
Tosin Oshinyemi, the company’s production lead, guides and directs actors and customers through the data collection process.
DAVID VINTINER

So while I was excited for Synthesia to make my digital double, I also wondered if the distinction between synthetic media and deepfakes is fundamentally meaningless. Even if the former centers a creator’s intent and, critically, a subject’s consent, is there really a way to make AI avatars safely if the end result is the same? And do we really want to get out of the uncanny valley if it means we can no longer grasp the truth?

But more urgently, it was time to find out what it’s like to see a post-truth version of yourself.

Almost the real thing

A month before my trip to the studio, I visited Synthesia CEO Victor Riparbelli at his office near Oxford Circus. As Riparbelli tells it, Synthesia’s origin story stems from his experiences exploring avant-garde, geeky techno music while growing up in Denmark. The internet allowed him to download software and produce his own songs without buying expensive synthesizers. 

“I’m a huge believer in giving people the ability to express themselves in the way that they can, because I think that that provides for a more meritocratic world,” he tells me. 

He saw the possibility of doing something similar with video when he came across research on using deep learning to transfer expressions from one human face to another on screen. 

“What that showcased was the first time a deep-learning network could produce video frames that looked and felt real,” he says. 

That research was conducted by Matthias Niessner, a professor at the Technical University of Munich, who cofounded Synthesia with Riparbelli in 2017, alongside University College London professor Lourdes Agapito and Steffen Tjerrild, whom Riparbelli had previously worked with on a cryptocurrency project. 

Initially the company built lip-synching and dubbing tools for the entertainment industry, but it found that the bar for this technology’s quality was very high and there wasn’t much demand for it. Synthesia changed direction in 2020 and launched its first generation of AI avatars for corporate clients. That pivot paid off. In 2023, Synthesia achieved unicorn status, meaning it was valued at over $1 billion—making it one of the relatively few European AI companies to do so. 

That first generation of avatars looked clunky, with looped movements and little variation. Subsequent iterations started looking more human, but they still struggled to say complicated words, and things were slightly out of sync. 

The challenge is that people are used to looking at other people’s faces. “We as humans know what real humans do,” says Jonathan Starck, Synthesia’s CTO. Since infancy, “you’re really tuned in to people and faces. You know what’s right, so anything that’s not quite right really jumps out a mile.” 

These earlier AI-generated videos, like deepfakes more broadly, were made using generative adversarial networks, or GANs—an older technique for generating images and videos that uses two neural networks that play off one another. It was a laborious and complicated process, and the technology was unstable. 

But in the generative AI boom of the last year or so, the company has found it can create much better avatars using generative neural networks that produce higher quality more consistently. The more data these models are fed, the better they learn. Synthesia uses both large language models and diffusion models to do this; the former help the avatars react to the script, and the latter generate the pixels. 

Despite the leap in quality, the company is still not pitching itself to the entertainment industry. Synthesia continues to see itself as a platform for businesses. Its bet is this: As people spend more time watching videos on YouTube and TikTok, there will be more demand for video content. Young people are already skipping traditional search and defaulting to TikTok for information presented in video form. Riparbelli argues that Synthesia’s tech could help companies convert their boring corporate comms and reports and training materials into content people will actually watch and engage with. He also suggests it could be used to make marketing materials. 

He claims Synthesia’s technology is used by 56% of the Fortune 100, with the vast majority of those companies using it for internal communication. The company lists Zoom, Xerox, Microsoft, and Reuters as clients. Services start at $22 a month.

This, the company hopes, will be a cheaper and more efficient alternative to video from a professional production company—and one that may be nearly indistinguishable from it. Riparbelli tells me its newest avatars could easily fool a person into thinking they are real. 

“I think we’re 98% there,” he says. 

For better or worse, I am about to see it for myself. 

Don’t be garbage

In AI research, there is a saying: Garbage in, garbage out. If the data that went into training an AI model is trash, that will be reflected in the outputs of the model. The more data points the AI model has captured of my facial movements, microexpressions, head tilts, blinks, shrugs, and hand waves, the more realistic the avatar will be. 

Back in the studio, I’m trying really hard not to be garbage. 

I am standing in front of a green screen, and Oshinyemi guides me through the initial calibration process, where I have to move my head and then eyes in a circular motion. Apparently, this will allow the system to understand my natural colors and facial features. I am then asked to say the sentence “All the boys ate a fish,” which will capture all the mouth movements needed to form vowels and consonants. We also film footage of me “idling” in silence.

image of Melissa standing on her mark in front of a green screen with server racks in background image
The more data points the AI system has on facial movements, microexpressions, head tilts, blinks, shrugs, and hand waves, the more realistic the avatar will be.
DAVID VINTINER

He then asks me to read a script for a fictitious YouTuber in different tones, directing me on the spectrum of emotions I should convey. First I’m supposed to read it in a neutral, informative way, then in an encouraging way, an annoyed and complain-y way, and finally an excited, convincing way. 

“Hey, everyone—welcome back to Elevate Her with your host, Jess Mars. It’s great to have you here. We’re about to take on a topic that’s pretty delicate and honestly hits close to home—dealing with criticism in our spiritual journey,” I read off the teleprompter, simultaneously trying to visualize ranting about something to my partner during the complain-y version. “No matter where you look, it feels like there’s always a critical voice ready to chime in, doesn’t it?” 

Don’t be garbage, don’t be garbage, don’t be garbage. 

“That was really good. I was watching it and I was like, ‘Well, this is true. She’s definitely complaining,’” Oshinyemi says, encouragingly. Next time, maybe add some judgment, he suggests.   

We film several takes featuring different variations of the script. In some versions I’m allowed to move my hands around. In others, Oshinyemi asks me to hold a metal pin between my fingers as I do. This is to test the “edges” of the technology’s capabilities when it comes to communicating with hands, Oshinyemi says. 

Historically, making AI avatars look natural and matching mouth movements to speech has been a very difficult challenge, says David Barber, a professor of machine learning at University College London who is not involved in Synthesia’s work. That is because the problem goes far beyond mouth movements; you have to think about eyebrows, all the muscles in the face, shoulder shrugs, and the numerous different small movements that humans use to express themselves. 

motion capture stage with detail of a mocap pattern inset
The motion capture process uses reference patterns to help align footage captured from multiple angles around the subject.
DAVID VINTINER

Synthesia has worked with actors to train its models since 2020, and their doubles make up the 225 stock avatars that are available for customers to animate with their own scripts. But to train its latest generation of avatars, Synthesia needed more data; it has spent the past year working with around 1,000 professional actors in London and New York. (Synthesia says it does not sell the data it collects, although it does release some of it for academic research purposes.)

The actors previously got paid each time their avatar was used, but now the company pays them an up-front fee to train the AI model. Synthesia uses their avatars for three years, at which point actors are asked if they want to renew their contracts. If so, they come into the studio to make a new avatar. If not, the company will delete their data. Synthesia’s enterprise customers can also generate their own custom avatars by sending someone into the studio to do much of what I’m doing.

“” class=”wp-image-1091775″ srcset=”https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg 3000w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=300,232 300w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=768,593 768w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=1536,1187 1536w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=2048,1582 2048w” sizes=”(max-width: 3000px) 100vw, 3000px” />
The initial calibration process allows the system to understand the subject’s natural colors and facial features.
Melissa recording audio into a boom mic seated in front of a laptop stand
Synthesia also collects voice samples. In the studio, I read a passage indicating that I explicitly consent to having my voice cloned.

Between takes, the makeup artist comes in and does some touch-ups to make sure I look the same in every shot. I can feel myself blushing because of the lights in the studio, but also because of the acting. After the team has collected all the shots it needs to capture my facial expressions, I go downstairs to read more text aloud for voice samples. 

This process requires me to read a passage indicating that I explicitly consent to having my voice cloned, and that it can be used on Voica’s account on the Synthesia platform to generate videos and speech. 

Consent is key

This process is very different from the way many AI avatars, deepfakes, or synthetic media—whatever you want to call them—are created. 

Most deepfakes aren’t created in a studio. Studies have shown that the vast majority of deepfakes online are nonconsensual sexual content, usually using images stolen from social media. Generative AI has made the creation of these deepfakes easy and cheap, and there have been several high-profile cases in the US and Europe of children and women being abused in this way. Experts have also raised alarms that the technology can be used to spread political disinformation, a particularly acute threat given the record number of elections happening around the world this year. 

Synthesia’s policy is to not create avatars of people without their explicit consent. But it hasn’t been immune from abuse. Last year, researchers found pro-China misinformation that was created using Synthesia’s avatars and packaged as news, which the company said violated its terms of service. 

Since then, the company has put more rigorous verification and content moderation systems in place. It applies a watermark with information on where and how the AI avatar videos were created. Where it once had four in-house content moderators, people doing this work now make up 10% of its 300-person staff. It also hired an engineer to build better AI-powered content moderation systems. These filters help Synthesia vet every single thing its customers try to generate. Anything suspicious or ambiguous, such as content about cryptocurrencies or sexual health, gets forwarded to the human content moderators. Synthesia also keeps a record of all the videos its system creates.

And while anyone can join the platform, many features aren’t available until people go through an extensive vetting system similar to that used by the banking industry, which includes talking to the sales team, signing legal contracts, and submitting to security auditing, says Voica. Entry-level customers are limited to producing strictly factual content, and only enterprise customers using custom avatars can generate content that contains opinions. On top of this, only accredited news organizations are allowed to create content on current affairs.

“We can’t claim to be perfect. If people report things to us, we take quick action, [such as] banning or limiting individuals or organizations,” Voica says. But he believes these measures work as a deterrent, which means most bad actors will turn to freely available open-source tools instead. 

I put some of these limits to the test when I head to Synthesia’s office for the next step in my avatar generation process. In order to create the videos that will feature my avatar, I have to write a script. Using Voica’s account, I decide to use passages from Hamlet, as well as previous articles I have written. I also use a new feature on the Synthesia platform, which is an AI assistant that transforms any web link or document into a ready-made script. I try to get my avatar to read news about the European Union’s new sanctions against Iran. 

Voica immediately texts me: “You got me in trouble!” 

The system has flagged his account for trying to generate content that is restricted.

AI-powered content filters help Synthesia vet every single thing its customers try to generate. Only accredited news organizations are allowed to create content on current affairs.
COURTESY OF SYNTHESIA

Offering services without these restrictions would be “a great growth strategy,” Riparbelli grumbles. But “ultimately, we have very strict rules on what you can create and what you cannot create. We think the right way to roll out these technologies in society is to be a little bit over-restrictive at the beginning.” 

Still, even if these guardrails operated perfectly, the ultimate result would nevertheless be an internet where everything is fake. And my experiment makes me wonder how we could possibly prepare ourselves. 

Our information landscape already feels very murky. On the one hand, there is heightened public awareness that AI-generated content is flourishing and could be a powerful tool for misinformation. But on the other, it is still unclear whether deepfakes are used for misinformation at scale and whether they’re broadly moving the needle to change people’s beliefs and behaviors. 

If people become too skeptical about the content they see, they might stop believing in anything at all, which could enable bad actors to take advantage of this trust vacuum and lie about the authenticity of real content. Researchers have called this the “liar’s dividend.” They warn that politicians, for example, could claim that genuinely incriminating information was fake or created using AI. 

Claire Leibowicz, the head of the AI and media integrity at the nonprofit Partnership on AI, says she worries that growing awareness of this gap will make it easier to “plausibly deny and cast doubt on real material or media as evidence in many different contexts, not only in the news, [but] also in the courts, in the financial services industry, and in many of our institutions.” She tells me she’s heartened by the resources Synthesia has devoted to content moderation and consent but says that process is never flawless.

Even Riparbelli admits that in the short term, the proliferation of AI-generated content will probably cause trouble. While people have been trained not to believe everything they read, they still tend to trust images and videos, he adds. He says people now need to test AI products for themselves to see what is possible, and should not trust anything they see online unless they have verified it. 

Never mind that AI regulation is still patchy, and the tech sector’s efforts to verify content provenance are still in their early stages. Can consumers, with their varying degrees of media literacy, really fight the growing wave of harmful AI-generated content through individual action? 

Watch out, PowerPoint

The day after my final visit, Voica emails me the videos with my avatar. When the first one starts playing, I am taken aback. It’s as painful as seeing yourself on camera or hearing a recording of your voice. Then I catch myself. At first I thought the avatar was me. 

The more I watch videos of “myself,” the more I spiral. Do I really squint that much? Blink that much? And move my jaw like that? Jesus. 

It’s good. It’s really good. But it’s not perfect. “Weirdly good animation,” my partner texts me. 

“But the voice sometimes sounds exactly like you, and at other times like a generic American and with a weird tone,” he adds. “Weird AF.” 

He’s right. The voice is sometimes me, but in real life I umm and ahh more. What’s remarkable is that it picked up on an irregularity in the way I talk. My accent is a transatlantic mess, confused by years spent living in the UK, watching American TV, and attending international school. My avatar sometimes says the word “robot” in a British accent and other times in an American accent. It’s something that probably nobody else would notice. But the AI did. 

My avatar’s range of emotions is also limited. It delivers Shakespeare’s “To be or not to be” speech very matter-of-factly. I had guided it to be furious when reading a story I wrote about Taylor Swift’s nonconsensual nude deepfakes; the avatar is complain-y and judgy, for sure, but not angry. 

This isn’t the first time I’ve made myself a test subject for new AI. Not too long ago, I tried generating AI avatar images of myself, only to get a bunch of nudes. That experience was a jarring example of just how biased AI systems can be. But this experience—and this particular way of being immortalized—was definitely on a different level.

Carl Öhman, an assistant professor at Uppsala University who has studied digital remains and is the author of a new book, The Afterlife of Data, calls avatars like the ones I made “digital corpses.” 

“It looks exactly like you, but no one’s home,” he says. “It would be the equivalent of cloning you, but your clone is dead. And then you’re animating the corpse, so that it moves and talks, with electrical impulses.” 

That’s kind of how it feels. The little, nuanced ways I don’t recognize myself are enough to put me off. Then again, the avatar could quite possibly fool anyone who doesn’t know me very well. It really shines when presenting a story I wrote about how the field of robotics could be getting its own ChatGPT moment; the virtual AI assistant summarizes the long read into a decent short video, which my avatar narrates. It is not Shakespeare, but it’s better than many of the corporate presentations I’ve had to sit through. I think if I were using this to deliver an end-of-year report to my colleagues, maybe that level of authenticity would be enough. 

And that is the sell, according to Riparbelli: “What we’re doing is more like PowerPoint than it is like Hollywood.”

Once a likeness has been generated, Synthesia is able to generate video presentations quickly from a script. In this video, synthetic “Melissa” summarizes an article real Melissa wrote about Taylor Swift deepfakes.
SYNTHESIA

The newest generation of avatars certainly aren’t ready for the silver screen. They’re still stuck in portrait mode, only showing the avatar front-facing and from the waist up. But in the not-too-distant future, Riparbelli says, the company hopes to create avatars that can communicate with their hands and have conversations with one another. It is also planning for full-body avatars that can walk and move around in a space that a person has generated. (The rig to enable this technology already exists; in fact it’s where I am in the image at the top of this piece.)

But do we really want that? It feels like a bleak future where humans are consuming AI-generated content presented to them by AI-generated avatars and using AI to repackage that into more content, which will likely be scraped to generate more AI. If nothing else, this experiment made clear to me that the technology sector urgently needs to step up its content moderation practices and ensure that content provenance techniques such as watermarking are robust. 

Even if Synthesia’s technology and content moderation aren’t yet perfect, they’re significantly better than anything I have seen in the field before, and this is after only a year or so of the current boom in generative AI. AI development moves at breakneck speed, and it is both exciting and daunting to consider what AI avatars will look like in just a few years. Maybe in the future we will have to adopt safewords to indicate that you are in fact communicating with a real human, not an AI. 

But that day is not today. 

I found it weirdly comforting that in one of the videos, my avatar rants about nonconsensual deepfakes and says, in a sociopathically happy voice, “The tech giants? Oh! They’re making a killing!” 

I would never. 

Here’s the defense tech at the center of US aid to Israel, Ukraine, and Taiwan

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

After weeks of drawn-out congressional debate over how much the United States should spend on conflicts abroad, President Joe Biden signed a $95.3 billion aid package into law on Wednesday.

The bill will send a significant quantity of supplies to Ukraine and Israel, while also supporting Taiwan with submarine technology to aid its defenses against China. It’s also sparked renewed calls for stronger crackdowns on Iranian-produced drones. 

Though much of the money will go toward replenishing fairly standard munitions and supplies, the spending bill provides a window into US strategies around four key defense technologies that continue to reshape how today’s major conflicts are being fought.

For a closer look at the military technology at the center of the aid package, I spoke with Andrew Metrick, a fellow with the defense program at the Center for a New American Security, a think tank.

Ukraine and the role of long-range missiles

Ukraine has long sought the Army Tactical Missile System (ATACMS), a long-range ballistic missile made by Lockheed Martin. First debuted in Operation Desert Storm in Iraq in 1990, it’s 13 feet high, two feet wide, and over 3,600 pounds. It can use GPS to accurately hit targets 190 miles away. 

Last year, President Biden was apprehensive about sending such missiles to Ukraine, as US stockpiles of the weapons were relatively low. In October, the administration changed tack. The US sent shipments of ATACMS, a move celebrated by President Volodymyr Zelensky of Ukraine, but they came with restrictions: the missiles were older models with a shorter range, and Ukraine was instructed not to fire them into Russian territory, only Ukrainian territory. 

This week, just hours before the new aid package was signed, multiple news outlets reported that the US had secretly sent more powerful long-range ATACMS to Ukraine several weeks before. They were used on Tuesday, April 23, to target a Russian airfield in Crimea and Russian troops in Berdiansk, 50 miles southwest of Mariupol.

The long range of the weapons has proved essential for Ukraine, says Metrick. “It allows the Ukrainians to strike Russian targets at ranges for which they have very few other options,” he says. That means being able to hit locations like supply depots, command centers, and airfields behind Russia’s front lines in Ukraine. This capacity has grown more important as Ukraine’s troop numbers have waned, Metrick says.

Replenishing Israel’s Iron Dome

On April 13, Iran launched its first-ever direct attack on Israeli soil. In the attack, which Iran says was retaliation for Israel’s airstrike on its embassy in Syria, hundreds of missiles were lobbed into Israeli airspace. Many of them were neutralized by the web of cutting-edge missile launchers dispersed throughout Israel that can automatically detonate incoming strikes before they hit land. 

One of those systems is Israel’s Iron Dome, in which radar systems detect projectiles and then signal units to launch defensive missiles that detonate the target high in the sky before it strikes populated areas. Israel’s other system, called David’s Sling, works a similar way but can identify rockets coming from a greater distance, upwards of 180 miles. 

Both systems are hugely costly to research and build, and the new US aid package allocates $15 billion to replenish their missile stockpile. The missiles can cost anywhere from $100,000 to $10 million each, and a system like Iron Dome might fire them daily during intense periods of conflict. 

The aid comes as funding for Israel has grown more contentious amid the dire conditions faced by displaced Palestinians in Gaza. While the spending bill worked its way through Congress, increasing numbers of Democrats sought to put conditions on the military aid to Israel, particularly after an Israeli air strike on April 1 killed seven aid workers from World Central Kitchen, an international food charity. The funding package does provide $9 billion in humanitarian assistance for the conflict, but the efforts to impose conditions for Israeli military aid failed. 

Taiwan and underwater defenses against China

A rising concern for the US defense community—and a subject of “wargaming” simulations that Metrick has carried out—is an amphibious invasion of Taiwan from China. The rising risk of that scenario has driven the US to build and deploy larger numbers of advanced submarines, Metrick says. A bigger fleet of these submarines would be more likely to keep attacks from China at bay, thereby protecting Taiwan.

The trouble is that the US shipbuilding effort, experts say, is too slow. It’s been hampered by budget cuts and labor shortages, but the new aid bill aims to jump-start it. It will provide $3.3 billion to do so, specifically for the production of Columbia-class submarines, which carry nuclear weapons, and Virginia-class submarines, which carry conventional weapons. 

Though these funds aim to support Taiwan by building up the US supply of submarines, the package also includes more direct support, like $2 billion to help it purchase weapons and defense equipment from the US. 

The US’s Iranian drone problem 

Shahed drones are used almost daily on the Russia-Ukraine battlefield, and Iran launched more than 100 against Israel earlier this month. Produced by Iran and resembling model planes, the drones are fast, cheap, and lightweight, capable of being launched from the back of a pickup truck. They’re used frequently for potent one-way attacks, where they detonate upon reaching their target. US experts say the technology is tipping the scales toward Russian and Iranian military groups and their allies. 

The trouble of combating them is partly one of cost. Shooting down the drones, which can be bought for as little as $40,000, can cost millions in ammunition.

“Shooting down Shaheds with an expensive missile is not, in the long term, a winning proposition,” Metrick says. “That’s what the Iranians, I think, are banking on. They can wear people down.”

This week’s aid package renewed White House calls for stronger sanctions aimed at curbing production of the drones. The United Nations previously passed rules restricting any drone-related material from entering or leaving Iran, but those expired in October. The US now wants them reinstated. 

Even if that happens, it’s unlikely the rules would do much to contain the Shahed’s dominance. The components of the drones are not all that complex or hard to obtain to begin with, but experts also say that Iran has built a sprawling global supply chain to acquire the materials needed to manufacture them and has worked with Russia to build factories. 

“Sanctions regimes are pretty dang leaky,” Metrick says. “They [Iran] have friends all around the world.”

Chatbot answers are all made up. This new tool helps you figure out which ones to trust.

Large language models are famous for their ability to make things up—in fact, it’s what they’re best at. But their inability to tell fact from fiction has left many businesses wondering if using them is worth the risk.

A new tool created by Cleanlab, an AI startup spun out of a quantum computing lab at MIT, is designed to give high-stakes users a clearer sense of how trustworthy these models really are. Called the Trustworthy Language Model, it gives any output generated by a large language model a score between 0 and 1, according to its reliability. This lets people choose which responses to trust and which to throw out. In other words: a BS-o-meter for chatbots.

Cleanlab hopes that its tool will make large language models more attractive to businesses worried about how much stuff they invent. “I think people know LLMs will change the world, but they’ve just got hung up on the damn hallucinations,” says Cleanlab CEO Curtis Northcutt.

Chatbots are quickly becoming the dominant way people look up information on a computer. Search engines are being redesigned around the technology. Office software used by billions of people every day to create everything from school assignments to marketing copy to financial reports now comes with chatbots built in. And yet a study put out in November by Vectara, a startup founded by former Google employees, found that chatbots invent information at least 3% of the time. It might not sound like much, but it’s a potential for error most businesses won’t stomach.

Cleanlab’s tool is already being used by a handful of companies, including Berkeley Research Group, a UK-based consultancy specializing in corporate disputes and investigations. Steven Gawthorpe, associate director at Berkeley Research Group, says the Trustworthy Language Model is the first viable solution to the hallucination problem that he has seen: “Cleanlab’s TLM gives us the power of thousands of data scientists.”

In 2021, Cleanlab developed technology that discovered errors in 34 popular data sets used to train machine-learning algorithms; it works by by measuring the differences in output across a range of models trained on that data. That tech is now used by several large companies, including Google, Tesla, and the banking giant Chase. The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots.

In a demo Cleanlab gave to MIT Technology Review last week, Northcutt typed a simple question into ChatGPT: “How many times does the letter ‘n’ appear in ‘enter’?” ChatGPT answered: “The letter ‘n’ appears once in the word ‘enter.’” That correct answer promotes trust. But ask the question a few more times and ChatGPT answers: “The letter ‘n’ appears twice in the word ‘enter.’”

“Not only does it often get it wrong, but it’s also random, you never know what it’s going to output,” says Northcutt. “Why the hell can’t it just tell you that it outputs different answers all the time?”

Cleanlab’s aim is to make that randomness more explicit. Northcutt asks the Trustworthy Language Model the same question. “The letter ‘n’ appears once in the word ‘enter,’” it says—and scores its answer 0.63. Six out of 10 is not a great score, suggesting that the chatbot’s answer to this question should not be trusted.

It’s a basic example, but it makes the point. Without the score, you might think the chatbot knew what it was talking about, says Northcutt. The problem is that data scientists testing large language models in high-risk situations could be misled by a few correct answers and assume that future answers will be correct too: “They try things out, they try a few examples, and they think this works. And then they do things that result in really bad business decisions.”

The Trustworthy Language Model draws on multiple techniques to calculate its scores. First, each query submitted to the tool is sent to several different large language models. Cleanlab is using five versions of DBRX, an open-source model developed by Databricks, an AI firm based in San Francisco. (But the tech will work with any model, says Northcutt, including Meta’s Llama models or OpenAI’s GPT series, the models behind ChatpGPT.) If the responses from each of these models are the same or similar, it will contribute to a higher score.

At the same time, the Trustworthy Language Model also sends variations of the original query to each of the DBRX models, swapping in words that have the same meaning. Again, if the responses to synonymous queries are similar, it will contribute to a higher score. “We mess with them in different ways to get different outputs and see if they agree,” says Northcutt.

The tool can also get multiple models to bounce responses off one another: “It’s like, ‘Here’s my answer—what do you think?’ ‘Well, here’s mine—what do you think?’ And you let them talk.” These interactions are monitored and measured and fed into the score as well.

Nick McKenna, a computer scientist at Microsoft Research in Cambridge, UK, who works on large language models for code generation, is optimistic that the approach could be useful. But he doubts it will be perfect. “One of the pitfalls we see in model hallucinations is that they can creep in very subtly,” he says.

In a range of tests across different large language models, Cleanlab shows that its trustworthiness scores correlate well with the accuracy of those models’ responses. In other words, scores close to 1 line up with correct responses, and scores close to 0 line up with incorrect ones. In another test, they also found that using the Trustworthy Language Model with GPT-4 produced more reliable responses than using GPT-4 by itself.

Large language models generate text by predicting the most likely next word in a sequence. In future versions of its tool, Cleanlab plans to make its scores even more accurate by drawing on the probabilities that a model used to make those predictions. It also wants to access the numerical values that models assign to each word in their vocabulary, which they use to calculate those probabilities. This level of detail is provided by certain platforms, such as Amazon’s Bedrock, that businesses can use to run large language models.

Cleanlab has tested its approach on data provided by Berkeley Research Group. The firm needed to search for references to health-care compliance problems in tens of thousands of corporate documents. Doing this by hand can take skilled staff weeks. By checking the documents using the Trustworthy Language Model, Berkeley Research Group was able to see which documents the chatbot was least confident about and check only those. It reduced the workload by around 80%, says Northcutt.

In another test, Cleanlab worked with a large bank (Northcutt would not name it but says it is a competitor to Goldman Sachs). Similar to Berkeley Research Group, the bank needed to search for references to insurance claims in around 100,000 documents. Again, the Trustworthy Language Model reduced the number of documents that needed to be hand-checked by more than half.

Running each query multiple times through multiple models takes longer and costs a lot more than the typical back-and-forth with a single chatbot. But Cleanlab is pitching the Trustworthy Language Model as a premium service to automate high-stakes tasks that would have been off limits to large language models in the past. The idea is not for it to replace existing chatbots but to do the work of human experts. If the tool can slash the amount of time that you need to employ skilled economists or lawyers at $2,000 an hour, the costs will be worth it, says Northcutt.

In the long run, Northcutt hopes that by reducing the uncertainty around chatbots’ responses, his tech will unlock the promise of large language models to a wider range of users. “The hallucination thing is not a large-language-model problem,” he says. “It’s an uncertainty problem.”

Researchers taught robots to run. Now they’re teaching them to walk

We’ve all seen videos over the past few years demonstrating how agile humanoid robots have become, running and jumping with ease. We’re no longer surprised by this kind of agility—in fact, we’ve grown to expect it.

The problem is, these shiny demos lack real-world applications. When it comes to creating robots that are useful and safe around humans, the fundamentals of movement are more important. As a result, researchers are using the same techniques to train humanoid robots to achieve much more modest goals. 

Alan Fern, a professor of computer science at Oregon State University, and a team of researchers have successfully trained a humanoid robot called Digit V3 to stand, walk, pick up a box, and move it from one location to another. Meanwhile, a separate group of researchers from the University of California, Berkeley, have focused on teaching Digit to walk in unfamiliar environments while carrying different loads, without toppling over. Their research is published in a paper in Science Robotics today. 

Both groups are using an AI technique called sim-to-real reinforcement learning, a burgeoning method of training two-legged robots like Digit. Researchers believe it will lead to more robust, reliable two-legged machines capable of interacting with their surroundings more safely—as well as learning much more quickly.

Sim-to-real reinforcement learning involves training AI models to complete certain tasks in simulated environments billions of times before a robot powered by the model attempts to complete them in the real world. What would take years for a robot to learn in real life can take just days thanks to repeated trial-and-error testing in simulations.

A neural network guides the robot using a mathematical reward function, a technique that rewards the robot with a large number every time it moves closer to its target location or completes its goal behavior. If it does something it’s not supposed to do, like falling down, it’s “punished” with a negative number, so it learns to avoid these motions over time.

In previous projects, researchers from the University of Oregon had used the same reinforcement learning technique to teach a two-legged robot named Cassie to run. The approach paid off—Cassie became the first robot to run an outdoor 5K before setting a Guinness World Record for the fastest bipedal robot to run 100 meters and mastering the ability to jump from one location to another with ease.

Training robots to behave in athletic ways requires them to develop really complex skills in very narrow environments, says Ilija Radosavovic, a PhD student at Berkleley who trained Digit to carry a wide range of loads and stabilize itself when poked with a stick. “We’re sort of the opposite—focusing on fairly simple skills in broad environments.”

This new wave of research in humanoid robotics is less concerned with speed and ability, and more focused on making machines robust and able to adapt—which is ultimately what’s needed to make them useful in the real world. Humanoid robots remain a relative rarity in work environments, as they often struggle to balance while carrying heavy objects. This is why most robots designed to lift objects of varying weights in factories and warehouses tend to have four legs or larger, more stable bases. But researchers hope to change that by making humanoid robots more reliable using AI techniques. 

Reinforcement learning will usher in a “new, much more flexible and faster way for training these types of manipulation skills,” Fern says. He and his team are due to present their findings at ICRA, the International Conference on Robotics and Automation, in Japan next month.

The ultimate goal is for a human to be able to show the robot a video of the desired task, like picking up a box from one shelf and pushing it onto another higher shelf, and then have the robot do it without requiring any further instruction, says Fern.

Getting robots to observe, copy, and quickly learn these kinds of behaviors would be really useful, but it still remains a challenge, says Lerrel Pinto, an assistant professor of computer science at New York University, who was not involved in the research. “If that could be done, I would be very impressed by that,” he says. “These are hard problems.”

Is robotics about to have its own ChatGPT moment?

Silent. Rigid. Clumsy.

Henry and Jane Evans are used to awkward houseguests. For more than a decade, the couple, who live in Los Altos Hills, California, have hosted a slew of robots in their home. 

In 2002, at age 40, Henry had a massive stroke, which left him with quadriplegia and an inability to speak. Since then, he’s learned how to communicate by moving his eyes over a letter board, but he is highly reliant on caregivers and his wife, Jane. 

Henry got a glimmer of a different kind of life when he saw Charlie Kemp on CNN in 2010. Kemp, a robotics professor at Georgia Tech, was on TV talking about PR2, a robot developed by the company Willow Garage. PR2 was a massive two-armed machine on wheels that looked like a crude metal butler. Kemp was demonstrating how the robot worked, and talking about his research on how health-care robots could help people. He showed how the PR2 robot could hand some medicine to the television host.    

“All of a sudden, Henry turns to me and says, ‘Why can’t that robot be an extension of my body?’ And I said, ‘Why not?’” Jane says. 

There was a solid reason why not. While engineers have made great progress in getting robots to work in tightly controlled environments like labs and factories, the home has proved difficult to design for. Out in the real, messy world, furniture and floor plans differ wildly; children and pets can jump in a robot’s way; and clothes that need folding come in different shapes, colors, and sizes. Managing such unpredictable settings and varied conditions has been beyond the capabilities of even the most advanced robot prototypes. 

That seems to finally be changing, in large part thanks to artificial intelligence. For decades, roboticists have more or less focused on controlling robots’ “bodies”—their arms, legs, levers, wheels, and the like—via purpose-­driven software. But a new generation of scientists and inventors believes that the previously missing ingredient of AI can give robots the ability to learn new skills and adapt to new environments faster than ever before. This new approach, just maybe, can finally bring robots out of the factory and into our homes. 

Progress won’t happen overnight, though, as the Evanses know far too well from their many years of using various robot prototypes. 

PR2 was the first robot they brought in, and it opened entirely new skills for Henry. It would hold a beard shaver and Henry would move his face against it, allowing him to shave and scratch an itch by himself for the first time in a decade. But at 450 pounds (200 kilograms) or so and $400,000, the robot was difficult to have around. “It could easily take out a wall in your house,” Jane says. “I wasn’t a big fan.”

More recently, the Evanses have been testing out a smaller robot called Stretch, which Kemp developed through his startup Hello Robot. The first iteration launched during the pandemic with a much more reasonable price tag of around $18,000. 

Stretch weighs about 50 pounds. It has a small mobile base, a stick with a camera dangling off it, and an adjustable arm featuring a gripper with suction cups at the ends. It can be controlled with a console controller. Henry controls Stretch using a laptop, with a tool that that tracks his head movements to move a cursor around. He is able to move his thumb and index finger enough to click a computer mouse. Last summer, Stretch was with the couple for more than a month, and Henry says it gave him a whole new level of autonomy. “It was practical, and I could see using it every day,” he says. 

a robot arm holds a brush over the head of Henry Evans which rests on a pillow
Henry Evans used the Stretch robot to brush his hair, eat, and even
play with his granddaughter.
PETER ADAMS

Using his laptop, he could get the robot to brush his hair and have it hold fruit kebabs for him to snack on. It also opened up Henry’s relationship with his granddaughter Teddie. Before, they barely interacted. “She didn’t hug him at all goodbye. Nothing like that,” Jane says. But “Papa Wheelie” and Teddie used Stretch to play, engaging in relay races, bowling, and magnetic fishing. 

Stretch doesn’t have much in the way of smarts: it comes with some pre­installed software, such as the web interface that Henry uses to control it, and other capabilities such as AI-enabled navigation. The main benefit of Stretch is that people can plug in their own AI models and use them to do experiments. But it offers a glimpse of what a world with useful home robots could look like. Robots that can do many of the things humans do in the home—tasks such as folding laundry, cooking meals, and cleaning—have been a dream of robotics research since the inception of the field in the 1950s. For a long time, it’s been just that: “Robotics is full of dreamers,” says Kemp.

But the field is at an inflection point, says Ken Goldberg, a robotics professor at the University of California, Berkeley. Previous efforts to build a useful home robot, he says, have emphatically failed to meet the expectations set by popular culture—think the robotic maid from The Jetsons. Now things are very different. Thanks to cheap hardware like Stretch, along with efforts to collect and share data and advances in generative AI, robots are getting more competent and helpful faster than ever before. “We’re at a point where we’re very close to getting capability that is really going to be useful,” Goldberg says. 

Folding laundry, cooking shrimp, wiping surfaces, unloading shopping baskets—today’s AI-powered robots are learning to do tasks that for their predecessors would have been extremely difficult. 

Missing pieces

There’s a well-known observation among roboticists: What is hard for humans is easy for machines, and what is easy for humans is hard for machines. Called Moravec’s paradox, it was first articulated in the 1980s by Hans Moravec, thena roboticist at the Robotics Institute of Carnegie Mellon University. A robot can play chess or hold an object still for hours on end with no problem. Tying a shoelace, catching a ball, or having a conversation is another matter. 

There are three reasons for this, says Goldberg. First, robots lack precise control and coordination. Second, their understanding of the surrounding world is limited because they are reliant on cameras and sensors to perceive it. Third, they lack an innate sense of practical physics. 

“Pick up a hammer, and it will probably fall out of your gripper, unless you grab it near the heavy part. But you don’t know that if you just look at it, unless you know how hammers work,” Goldberg says. 

On top of these basic considerations, there are many other technical things that need to be just right, from motors to cameras to Wi-Fi connections, and hardware can be prohibitively expensive. 

Mechanically, we’ve been able to do fairly complex things for a while. In a video from 1957, two large robotic arms are dexterous enough to pinch a cigarette, place it in the mouth of a woman at a typewriter, and reapply her lipstick. But the intelligence and the spatial awareness of that robot came from the person who was operating it. 

In a video from 1957, a man operates two large robotic arms and uses the machine to apply a woman’s lipstick. Robots
have come a long way since.
“LIGHTER SIDE OF THE NEWS –ATOMIC ROBOT A HANDY GUY” (1957) VIA YOUTUBE

“The missing piece is: How do we get software to do [these things] automatically?” says Deepak Pathak, an assistant professor of computer science at Carnegie Mellon.  

Researchers training robots have traditionally approached this problem by planning everything the robot does in excruciating detail. Robotics giant Boston Dynamics used this approach when it developed its boogying and parkouring humanoid robot Atlas. Cameras and computer vision are used to identify objects and scenes. Researchers then use that data to make models that can be used to predict with extreme precision what will happen if a robot moves a certain way. Using these models, roboticists plan the motions of their machines by writing a very specific list of actions for them to take. The engineers then test these motions in the laboratory many times and tweak them to perfection. 

This approach has its limits. Robots trained like this are strictly choreographed to work in one specific setting. Take them out of the laboratory and into an unfamiliar location, and they are likely to topple over. 

Compared with other fields, such as computer vision, robotics has been in the dark ages, Pathak says. But that might not be the case for much longer, because the field is seeing a big shake-up. Thanks to the AI boom, he says, the focus is now shifting from feats of physical dexterity to building “general-purpose robot brains” in the form of neural networks. Much as the human brain is adaptable and can control different aspects of the human body, these networks can be adapted to work in different robots and different scenarios. Early signs of this work show promising results. 

Robots, meet AI 

For a long time, robotics research was an unforgiving field, plagued by slow progress. At the Robotics Institute at Carnegie Mellon, where Pathak works, he says, “there used to be a saying that if you touch a robot, you add one year to your PhD.” Now, he says, students get exposure to many robots and see results in a matter of weeks.

What separates this new crop of robots is their software. Instead of the traditional painstaking planning and training, roboticists have started using deep learning and neural networks to create systems that learn from their environment on the go and adjust their behavior accordingly. At the same time, new, cheaper hardware, such as off-the-shelf components and robots like Stretch, is making this sort of experimentation more accessible. 

Broadly speaking, there are two popular ways researchers are using AI to train robots. Pathak has been using reinforcement learning, an AI technique that allows systems to improve through trial and error, to get robots to adapt their movements in new environments. This is a technique that Boston Dynamics has also started using  in its robot “dogs” called Spot.

Deepak Pathak’s team at Carnegie Mellon has used an AI technique called reinforcement learning to create a robotic dog that can do extreme parkour with minimal pre-programming.

In 2022, Pathak’s team used this method to create four-legged robot “dogs” capable of scrambling up steps and navigating tricky terrain. The robots were first trained to move around in a general way in a simulator. Then they were set loose in the real world, with a single built-in camera and computer vision software to guide them. Other similar robots rely on tightly prescribed internal maps of the world and cannot navigate beyond them.

Pathak says the team’s approach was inspired by human navigation. Humans receive information about the surrounding world from their eyes, and this helps them instinctively place one foot in front of the other to get around in an appropriate way. Humans don’t typically look down at the ground under their feet when they walk, but a few steps ahead, at a spot where they want to go. Pathak’s team trained its robots to take a similar approach to walking: each one used the camera to look ahead. The robot was then able to memorize what was in front of it for long enough to guide its leg placement. The robots learned about the world in real time, without internal maps, and adjusted their behavior accordingly. At the time, experts told MIT Technology Review the technique was a “breakthrough in robot learning and autonomy” and could allow researchers to build legged robots capable of being deployed in the wild.   

Pathak’s robot dogs have since leveled up. The team’s latest algorithm allows a quadruped robot to do extreme parkour. The robot was again trained to move around in a general way in a simulation. But using reinforcement learning, it was then able to teach itself new skills on the go, such as how to jump long distances, walk on its front legs, and clamber up tall boxes twice its height. These behaviors were not something the researchers programmed. Instead, the robot learned through trial and error and visual input from its front camera. “I didn’t believe it was possible three years ago,” Pathak says. 

In the other popular technique, called imitation learning, models learn to perform tasks by, for example, imitating the actions of a human teleoperating a robot or using a VR headset to collect data on a robot. It’s a technique that has gone in and out of fashion over decades but has recently become more popular with robots that do manipulation tasks, says Russ Tedrake, vice president of robotics research at the Toyota Research Institute and an MIT professor.

By pairing this technique with generative AI, researchers at the Toyota Research Institute, Columbia University, and MIT have been able to quickly teach robots to do many new tasks. They believe they have found a way to extend the technology propelling generative AI from the realm of text, images, and videos into the domain of robot movements. 

The idea is to start with a human, who manually controls the robot to demonstrate behaviors such as whisking eggs or picking up plates. Using a technique called diffusion policy, the robot is then able to use the data fed into it to learn skills. The researchers have taught robots more than 200 skills, such as peeling vegetables and pouring liquids, and say they are working toward teaching 1,000 skills by the end of the year. 

Many others have taken advantage of generative AI as well. Covariant, a robotics startup that spun off from OpenAI’s now-shuttered robotics research unit, has built a multimodal model called RFM-1. It can accept prompts in the form of text, image, video, robot instructions, or measurements. Generative AI allows the robot to both understand instructions and generate images or videos relating to those tasks. 

The Toyota Research Institute team hopes this will one day lead to “large behavior models,” which are analogous to large language models, says Tedrake. “A lot of people think behavior cloning is going to get us to a ChatGPT moment for robotics,” he says. 

In a similar demonstration, earlier this year a team at Stanford managed to use a relatively cheap off-the-shelf robot costing $32,000 to do complex manipulation tasks such as cooking shrimp and cleaning stains. It learned those new skills quickly with AI. 

Called Mobile ALOHA (a loose acronym for “a low-cost open-source hardware teleoperation system”), the robot learned to cook shrimp with the help of just 20 human demonstrations and data from other tasks, such as tearing off a paper towel or piece of tape. The Stanford researchers found that AI can help robots acquire transferable skills: training on one task can improve its performance for others.

While the current generation of generative AI works with images and language, researchers at the Toyota Research Institute, Columbia University, and MIT believe the approach can extend to the domain of robot motion.

This is all laying the groundwork for robots that can be useful in homes. Human needs change over time, and teaching robots to reliably do a wide range of tasks is important, as it will help them adapt to us. That is also crucial to commercialization—first-generation home robots will come with a hefty price tag, and the robots need to have enough useful skills for regular consumers to want to invest in them. 

For a long time, a lot of the robotics community was very skeptical of these kinds of approaches, says Chelsea Finn, an assistant professor of computer science and electrical engineering at Stanford University and an advisor for the Mobile ALOHA project. Finn says that nearly a decade ago, learning-based approaches were rare at robotics conferences and disparaged in the robotics community. “The [natural-language-processing] boom has been convincing more of the community that this approach is really, really powerful,” she says. 

There is one catch, however. In order to imitate new behaviors, the AI models need plenty of data. 

More is more

Unlike chatbots, which can be trained by using billions of data points hoovered from the internet, robots need data specifically created for robots. They need physical demonstrations of how washing machines and fridges are opened, dishes picked up, or laundry folded, says Lerrel Pinto, an assistant professor of computer science at New York University. Right now that data is very scarce, and it takes a long time for humans to collect.

top frame shows a person recording themself opening a kitchen drawer with a grabber, and the bottom shows a robot attempting the same action

“ON BRINGING ROBOTS HOME,” NUR MUHAMMAD (MAHI) SHAFIULLAH, ET AL.

Some researchers are trying to use existing videos of humans doing things to train robots, hoping the machines will be able to copy the actions without the need for physical demonstrations. 

Pinto’s lab has also developed a neat, cheap data collection approach that connects robotic movements to desired actions. Researchers took a reacher-grabber stick, similar to ones used to pick up trash, and attached an iPhone to it. Human volunteers can use this system to film themselves doing household chores, mimicking the robot’s view of the end of its robotic arm. Using this stand-in for Stretch’s robotic arm and an open-source system called DOBB-E, Pinto’s team was able to get a Stretch robot to learn tasks such as pouring from a cup and opening shower curtains with just 20 minutes of iPhone data.  

But for more complex tasks, robots would need even more data and more demonstrations.  

The requisite scale would be hard to reach with DOBB-E, says Pinto, because you’d basically need to persuade every human on Earth to buy the reacher-­grabber system, collect data, and upload it to the internet. 

A new initiative kick-started by Google DeepMind, called the Open X-Embodiment Collaboration, aims to change that. Last year, the company partnered with 34 research labs and about 150 researchers to collect data from 22 different robots, including Hello Robot’s Stretch. The resulting data set, which was published in October 2023, consists of robots demonstrating 527 skills, such as picking, pushing, and moving.  

Sergey Levine, a computer scientist at UC Berkeley who participated in the project, says the goal was to create a “robot internet” by collecting data from labs around the world. This would give researchers access to bigger, more scalable, and more diverse data sets. The deep-learning revolution that led to the generative AI of today started in 2012 with the rise of ImageNet, a vast online data set of images. The Open X-Embodiment Collaboration is an attempt by the robotics community to do something similar for robot data. 

Early signs show that more data is leading to smarter robots. The researchers built two versions of a model for robots, called RT-X, that could be either run locally on individual labs’ computers or accessed via the web. The larger, web-accessible model was pretrained with internet data to develop a “visual common sense,” or a baseline understanding of the world, from the large language and image models. 

When the researchers ran the RT-X model on many different robots, they discovered that the robots were able to learn skills 50% more successfully than in the systems each individual lab was developing.

“I don’t think anybody saw that coming,” says Vincent Vanhoucke, Google DeepMind’s head of robotics. “Suddenly there is a path to basically leveraging all these other sources of data to bring about very intelligent behaviors in robotics.”

Many roboticists think that large vision-language models, which are able to analyze image and language data, might offer robots important hints as to how the surrounding world works, Vanhoucke says. They offer semantic clues about the world and could help robots with reasoning, deducing things, and learning by interpreting images. To test this, researchers took a robot that had been trained on the larger model and asked it to point to a picture of Taylor Swift. The researchers had not shown the robot pictures of Swift, but it was still able to identify the pop star because it had a web-scale understanding of who she was even without photos of her in its data set, says Vanhoucke.

RT-2, a recent model for robotic control, was trained on online text
and images as well as interactions with the real world.
KELSEY MCCLELLAN

Vanhoucke says Google DeepMind is increasingly using techniques similar to those it would use for machine translation to translate from English to robotics. Last summer, Google introduced a vision-language-­action model called RT-2. This model gets its general understanding of the world from online text and images it has been trained on, as well as its own interactions in the real world. It translates that data into robotic actions. Each robot has a slightly different way of translating English into action, he adds.  

“We increasingly feel like a robot is essentially a chatbot that speaks robotese,” Vanhoucke says. 

Baby steps

Despite the fast pace of development, robots still face many challenges before they can be released into the real world. They are still way too clumsy for regular consumers to justify spending tens of thousands of dollars on them. Robots also still lack the sort of common sense that would allow them to multitask. And they need to move from just picking things up and placing them somewhere to putting things together, says Goldberg—for example, putting a deck of cards or a board game back in its box and then into the games cupboard. 

But to judge from the early results of integrating AI into robots, roboticists are not wasting their time, says Pinto. 

“I feel fairly confident that we will see some semblance of a general-purpose home robot. Now, will it be accessible to the general public? I don’t think so,” he says. “But in terms of raw intelligence, we are already seeing signs right now.” 

Building the next generation of robots might not just assist humans in their everyday chores or help people like Henry Evans live a more independent life. For researchers like Pinto, there is an even bigger goal in sight.

Home robotics offers one of the best benchmarks for human-level machine intelligence, he says. The fact that a human can operate intelligently in the home environment, he adds, means we know this is a level of intelligence that can be reached. 

“It’s something which we can potentially solve. We just don’t know how to solve it,” he says. 

Evans in the foreground with computer screen.  A table with playing cards separates him from two other people in the room
Thanks to Stretch, Henry Evans was able to hold his own playing cards
for the first time in two decades.
VY NGUYEN

For Henry and Jane Evans, a big win would be to get a robot that simply works reliably. The Stretch robot that the Evanses experimented with is still too buggy to use without researchers present to troubleshoot, and their home doesn’t always have the dependable Wi-Fi connectivity Henry needs in order to communicate with Stretch using a laptop.

Even so, Henry says, one of the greatest benefits of his experiment with robots has been independence: “All I do is lay in bed, and now I can do things for myself that involve manipulating my physical environment.”

Thanks to Stretch, for the first time in two decades, Henry was able to hold his own playing cards during a match. 

“I kicked everyone’s butt several times,” he says. 

“Okay, let’s not talk too big here,” Jane says, and laughs.

This US startup makes a crucial chip material and is taking on a Japanese giant

It can be dizzying to try to understand all the complex components of a single computer chip: layers of microscopic components linked to one another through highways of copper wires, some barely wider than a few strands of DNA. Nestled between those wires is an insulating material called a dielectric, ensuring that the wires don’t touch and short out. Zooming in further, there’s one particular dielectric placed between the chip and the structure beneath it; this material, called dielectric film, is produced in sheets as thin as white blood cells. 

For 30 years, a single Japanese company called Ajinomoto has made billions producing this particular film. Competitors have struggled to outdo them, and today Ajinomoto has more than 90% of the market in the product, which is used in everything from laptops to data centers. 

But now, a startup based in Berkeley, California, is embarking on a herculean effort to dethrone Ajinomoto and bring this small slice of the chipmaking supply chain back to the US.

Thintronics is promising a product purpose-built for the computing demands of the AI era—a suite of new materials that the company claims have higher insulating properties and, if adopted, could mean data centers with faster computing speeds and lower energy costs. 

The company is at the forefront of a coming wave of new US-based companies, spurred by the $280 billion CHIPS and Science Act, that is seeking to carve out a portion of the semiconductor sector, which has become dominated by just a handful of international players. But to succeed, Thintronics and its peers will have to overcome a web of challenges—solving technical problems, disrupting long-standing industry relationships, and persuading global semiconductor titans to accommodate new suppliers. 

“Inventing new materials platforms and getting them into the world is very difficult,” Thintronics founder and CEO Stefan Pastine says. It is “not for the faint of heart.”

The insulator bottleneck

If you recognize the name Ajinomoto, you’re probably surprised to hear it plays a critical role in the chip sector: the company is better known as the world’s leading supplier of MSG seasoning powder. In the 1990s, Ajinomoto discovered that a by-product of MSG made a great insulator, and it has enjoyed a near monopoly in the niche material ever since. 

But Ajinomoto doesn’t make any of the other parts that go into chips. In fact, the insulating materials in chips rely on dispersed supply chains: one layer uses materials from Ajinomoto, another uses material from another company, and so on, with none of the layers optimized to work in tandem. The resulting system works okay when data is being transmitted over short paths, but over longer distances, like between chips, weak insulators act as a bottleneck, wasting energy and slowing down computing speeds. That’s recently become a growing concern, especially as the scale of AI training gets more expensive and consumes eye-popping amounts of energy. (Ajinomoto did not respond to requests for comment.) 

None of this made much sense to Pastine, a chemist who sold his previous company, which specialized in recycling hard plastics, to an industrial chemicals company in 2019. Around that time, he started to believe that the chemicals industry could be slow to innovate, and he thought the same pattern was keeping chipmakers from finding better insulating materials. In the chip industry, he says, insulators have “kind of been looked at as the redheaded stepchild”—they haven’t seen the progress made with transistors and other chip components. 

He launched Thintronics that same year, with the hope that cracking the code on a better insulator could provide data centers with faster computing speeds at lower costs. That idea wasn’t groundbreaking—new insulators are constantly being researched and deployed—but Pastine believed that he could find the right chemistry to deliver a breakthrough. 

Thintronics says it will manufacture different insulators for all layers of the chip, for a system designed to swap into existing manufacturing lines. Pastine tells me the materials are now being tested with a number of industry players. But he declined to provide names, citing nondisclosure agreements, and similarly would not share details of the formula. 

Without more details, it’s hard to say exactly how well the Thintronics materials compare with competing products. The company recently tested its materials’ Dk values, which are a measure of how effective an insulator a material is. Venky Sundaram, a researcher who has founded multiple semiconductor startups but is not involved with Thintronics, reviewed the results. Some of Thintronics’ numbers were fairly average, he says, but their most impressive Dk value is far better than anything available today.

A rocky road ahead

Thintronics’ vision has already garnered some support. The company received a $20 million Series A funding round in March, led by venture capital firms Translink and Maverick, as well as a grant from the US National Science Foundation. 

The company is also seeking funding from the CHIPS Act. Signed into law by President Joe Biden in 2022, it’s designed to boost companies like Thintronics in order to bring semiconductor manufacturing back to American companies and reduce reliance on foreign suppliers. A year after it became law, the administration said that more than 450 companies had submitted statements of interest to receive CHIPS funding for work across the sector. 

The bulk of funding from the legislation is destined for large-scale manufacturing facilities, like those operated by Intel in New Mexico and Taiwan Semiconductor Manufacturing Corporation (TSMC) in Arizona. But US Secretary of Commerce Gina Raimondo has said she’d like to see smaller companies receive funding as well, especially in the materials space. In February, applications opened for a pool of $300 million earmarked specifically for materials innovation. While Thintronics declined to say how much funding it was seeking or from which programs, the company does see the CHIPS Act as a major tailwind.

But building a domestic supply chain for chips—a product that currently depends on dozens of companies around the globe—will mean reversing decades of specialization by different countries. And industry experts say it will be difficult to challenge today’s dominant insulator suppliers, who have often had to adapt to fend off new competition. 

“Ajinomoto has been a 90-plus-percent-market-share material for more than two decades,” says Sundaram. “This is unheard-of in most businesses, and you can imagine they didn’t get there by not changing.”

One big challenge is that the dominant manufacturers have decades-long relationships with chip designers like Nvidia or Advanced Micro Devices, and with manufacturers like TSMC. Asking these players to swap out materials is a big deal.

“The semiconductor industry is very conservative,” says Larry Zhao, a semiconductor researcher who has worked in the dielectrics industry for more than 25 years. “They like to use the vendors they already know very well, where they know the quality.” 

Another obstacle facing Thintronics is technical: insulating materials, like other chip components, are held to manufacturing standards so precise they are difficult to comprehend. The layers where Ajinomoto dominates are thinner than a human hair. The material must also be able to accept tiny holes, which house wires running vertically through the film. Every new iteration is a massive R&D effort in which incumbent companies have the upper hand given their years of experience, says Sundaram.

If all this is completed successfully in a lab, yet another hurdle lies ahead: the material has to retain those properties in a high-volume manufacturing facility, which is where Sundaram has seen past efforts fail.

“I have advised several material suppliers over the years that tried to break into [Ajinomoto’s] business and couldn’t succeed,” he says. “They all ended up having the problem of not being as easy to use in a high-volume production line.” 

Despite all these challenges, one thing may be working in Thintronics’ favor: US-based tech giants like Microsoft and Meta are making headway in designing their own chips for the first time. The plan is to use these chips for in-house AI training as well as for the cloud computing capacity that they rent out to customers, both of which would reduce the industry’s reliance on Nvidia. 

Though Microsoft, Google, and Meta declined to comment on whether they are pursuing advancements in materials like insulators, Sundaram says these firms could be more willing to work with new US startups rather than defaulting to the old ways of making chips: “They have a lot more of an open mind about supply chains than the existing big guys.”

Taking AI to the next level in manufacturing

Few technological advances have generated as much excitement as AI. In particular, generative AI seems to have taken business discourse to a fever pitch. Many manufacturing leaders express optimism: Research conducted by MIT Technology Review Insights found ambitions for AI development to be stronger in manufacturing than in most other sectors.

image of the report cover

Manufacturers rightly view AI as integral to the creation of the hyper-automated intelligent factory. They see AI’s utility in enhancing product and process innovation, reducing cycle time, wringing ever more efficiency from operations and assets, improving maintenance, and strengthening security, while reducing carbon emissions. Some manufacturers that have invested to develop AI capabilities are still striving to achieve their objectives.

This study from MIT Technology Review Insights seeks to understand how manufacturers are generating benefits from AI use cases—particularly in engineering and design and in factory operations. The survey included 300 manufacturers that have begun working with AI. Most of these (64%) are currently researching or experimenting with AI. Some 35% have begun to put AI use cases into production. Many executives that responded to the survey indicate they intend to boost AI spending significantly during the next two years. Those who haven’t started AI in production are moving gradually. To facilitate use-case development and scaling, these manufacturers must address challenges with talents, skills, and data.

Following are the study’s key findings:

  • Talent, skills, and data are the main constraints on AI scaling. In both engineering and design and factory operations, manufacturers cite a deficit of talent and skills as their toughest challenge in scaling AI use cases. The closer use cases get to production, the harder this deficit bites. Many respondents say inadequate data quality and governance also hamper use-case development. Insufficient access to cloud-based compute power is another oft-cited constraint in engineering and design.
  • The biggest players do the most spending, and have the highest expectations. In engineering and design, 58% of executives expect their organizations to increase AI spending by more than 10% during the next two years. And 43% say the same when it comes to factory operations. The largest manufacturers are far more likely to make big increases in investment than those in smaller—but still large—size categories.
  • Desired AI gains are specific to manufacturing functions. The most common use cases deployed by manufacturers involve product design, conversational AI, and content creation. Knowledge management and quality control are those most frequently cited at pilot stage. In engineering and design, manufacturers chiefly seek AI gains in speed, efficiency, reduced failures, and security. In the factory, desired above all is better innovation, along with improved safety and a reduced carbon footprint.
  • Scaling can stall without the right data foundations. Respondents are clear that AI use-case development is hampered by inadequate data quality (57%), weak data integration (54%), and weak governance (47%). Only about one in five manufacturers surveyed have production assets with data ready for use in existing AI models. That figure dwindles as manufacturers put use cases into production. The bigger the manufacturer, the greater the problem of unsuitable data is.
  • Fragmentation must be addressed for AI to scale. Most manufacturers find some modernization of data architecture, infrastructure, and processes is needed to support AI, along with other technology and business priorities. A modernization strategy that improves interoperability of data systems between engineering and design and the factory, and between operational technology (OT) and information technology (IT), is a sound priority.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

Open-sourcing generative AI

The views expressed in this video are those of the speakers, and do not represent any endorsement or sponsorship.

Is the open-source approach, which has democratized access to software, ensured transparency, and improved security for decades, now poised to have a similar impact on AI? We dissect the balance between collaboration and control, legal ramifications, ethical considerations, and innovation barriers as the AI industry seeks to democratize the development of large language models.

Explore more from Booz Allen Hamilton on the future of AI


About the speakers

Alison Smith, Director of Generative AI, Booz Allen Hamilton

Alison Smith is a Director of Generative AI at Booz Allen Hamilton where she helps clients address their missions with innovative solutions. Leading Booz Allen’s investments in Generative AI and grounding them in real business needs, Alison employs a pragmatic approach to designing, implementing, and deploying Generative AI that blends existing tools with additional customization. She is also responsible for disseminating best practices and key solutions throughout the firm to ensure that all teams are up-to-date on the latest available tools, solutions, and approaches to common client problems.

In addition to her role at Booz Allen which balances technical solutions and business growth, Alison also enjoys staying connected to and serving her local community. From 2017-2021, Alison served on the board of a non-profit, DC Open Government Coalition (DCOGC), a group that seeks to enhance public access to government information and ensure transparent government operations; in November 2021, Alison was recognized as a Power Woman in Code by DCFemTech.

Alison has an MBA from The University of Chicago Booth School of Business and a BA from Middlebury College.

Tackling AI risks: Your reputation is at stake

Forget Skynet: One of the biggest risks of AI is your organization’s reputation. That means it’s time to put science-fiction catastrophizing to one side and begin thinking seriously about what AI actually means for us in our day-to-day work.

This isn’t to advocate for navel-gazing at the expense of the bigger picture: It’s to urge technologists and business leaders to recognize that if we’re to address the risks of AI as an industry—maybe even as a society—we need to closely consider its immediate implications and outcomes. If we fail to do that, taking action will be practically impossible.

Risk is all about context

Risk is all about context. In fact, one of the biggest risks is failing to acknowledge or understand your context: That’s why you need to begin there when evaluating risk.

This is particularly important in terms of reputation. Think, for instance, about your customers and their expectations. How might they feel about interacting with an AI chatbot? How damaging might it be to provide them with false or misleading information? Maybe minor customer inconvenience is something you can handle, but what if it has a significant health or financial impact?

Even if implementing AI seems to make sense, there are clearly some downstream reputation risks that need to be considered. We’ve spent years talking about the importance of user experience and being customer-focused: While AI might help us here, it could also undermine those things as well.

There’s a similar question to be asked about your teams. AI may have the capacity to drive efficiency and make people’s work easier, but used in the wrong way it could seriously disrupt existing ways of working. The industry is talking a lot about developer experience recently—it’s something I wrote about for this publication—and the decisions organizations make about AI need to improve the experiences of teams, not undermine them.

In the latest edition of the Thoughtworks Technology Radar—a biannual snapshot of the software industry based on our experiences working with clients around the world—we talk about precisely this point. We call out AI team assistants as one of the most exciting emerging areas in software engineering, but we also note that the focus has to be on enabling teams, not individuals. “You should be looking for ways to create AI team assistants to help create the ‘10x team,’ as opposed to a bunch of siloed AI-assisted 10x engineers,” we say in the latest report.

Failing to heed the working context of your teams could cause significant reputational damage. Some bullish organizations might see this as part and parcel of innovation—it’s not. It’s showing potential employees—particularly highly technical ones—that you don’t really understand or care about the work they do.

Tackling risk through smarter technology implementation

There are lots of tools that can be used to help manage risk. Thoughtworks helped put together the Responsible Technology Playbook, a collection of tools and techniques that organizations can use to make more responsible decisions about technology (not just AI).

However, it’s important to note that managing risks—particularly those around reputation—requires real attention to the specifics of technology implementation. This was particularly clear in work we did with an assortment of Indian civil society organizations, developing a social welfare chatbot that citizens can interact with in their native languages. The risks here were not unlike those discussed earlier: The context in which the chatbot was being used (as support for accessing vital services) meant that inaccurate or “hallucinated” information could stop people from getting the resources they depend on.

This contextual awareness informed technology decisions. We implemented a version of something called retrieval-augmented generation to reduce the risk of hallucinations and improve the accuracy of the model the chatbot was running on.

Retrieval-augmented generation features on the latest edition of the Technology Radar. It might be viewed as part of a wave of emerging techniques and tools in this space that are helping developers tackle some of the risks of AI. These range from NeMo Guardrails—an open-source tool that puts limits on chatbots to increase accuracy—to the technique of running large language models (LLMs) locally with tools like Ollama, to ensure privacy and avoid sharing data with third parties. This wave also includes tools that aim to improve transparency in LLMs (which are notoriously opaque), such as Langfuse.

It’s worth pointing out, however, that it’s not just a question of what you implement, but also what you avoid doing. That’s why, in this Radar, we caution readers about the dangers of overenthusiastic LLM use and rushing to fine-tune LLMs.

Rethinking risk

A new wave of AI risk assessment frameworks aim to help organizations consider risk. There is also legislation (including the AI Act in Europe) that organizations must pay attention to. But addressing AI risk isn’t just a question of applying a framework or even following a static set of good practices. In a dynamic and changing environment, it’s about being open-minded and adaptive, paying close attention to the ways that technology choices shape human actions and social outcomes on both a micro and macro scale.

One useful framework is Dominique Shelton Leipzig’s traffic light framework. A red light signals something prohibited—such as discriminatory surveillance—while a green light signals low risk and a yellow light signals caution. I like the fact it’s so lightweight: For practitioners, too much legalese or documentation can make it hard to translate risk to action.

However, I also think it’s worth flipping the framework, to see risks as embedded in contexts, not in the technologies themselves. That way, you’re not trying to make a solution adapt to a given situation, you’re responding to a situation and addressing it as it actually exists. If organizations take that approach to AI—and, indeed, to technology in general—that will ensure they’re meeting the needs of stakeholders and keep their reputations safe.

This content was produced by Thoughtworks. It was not written by MIT Technology Review’s editorial staff.