OpenAI’s new GPT-4o lets people interact using voice or video in the same model

OpenAI just debuted GPT-4o, a new kind of AI model that you can communicate with in real time via live voice conversation, video streams from your phone, and text. The model is rolling out over the next few weeks and will be free for all users through both the GPT app and the web interface, according to the company. Users who subscribe to OpenAI’s paid tiers, which start at $20 per month, will be able to make more requests. 

OpenAI CTO Mira Murati led the live demonstration of the new release one day before Google is expected to unveil its own AI advancements at its flagship I/O conference on Tuesday, May 14. 

GPT-4 offered similar capabilities, giving users multiple ways to interact with OpenAI’s AI offerings. But it siloed them in separate models, leading to longer response times and presumably higher computing costs. GPT-4o has now merged those capabilities into a single model, which Murati called an “omnimodel.” That means faster responses and smoother transitions between tasks, she said.

The result, the company’s demonstration suggests, is a conversational assistant much in the vein of Siri or Alexa but capable of fielding much more complex prompts.

“We’re looking at the future of interaction between ourselves and the machines,” Murati said of the demo. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural.”

Barret Zoph and Mark Chen, both researchers at OpenAI, walked through a number of applications for the new model. Most impressive was its facility with live conversation. You could interrupt the model during its responses, and it would stop, listen, and adjust course. 

OpenAI showed off the ability to change the model’s tone, too. Chen asked the model to read a bedtime story “about robots and love,” quickly jumping in to demand a more dramatic voice. The model got progressively more theatrical until Murati demanded that it pivot quickly to a convincing robot voice (which it excelled at). While there were predictably some short pauses during the conversation while the model reasoned through what to say next, it stood out as a remarkably naturally paced AI conversation. 

The model can reason through visual problems in real time as well. Using his phone, Zoph filmed himself writing an algebra equation (3x + 1 = 4) on a sheet of paper, having GPT-4o follow along. He instructed it not to provide answers, but instead to guide him much as a teacher would.

“The first step is to get all the terms with x on one side,” the model said in a friendly tone. “So, what do you think we should do with that plus one?”

GPT-4o will store records of users’ interactions with it, meaning the model “now has a sense of continuity across all your conversations,” according to Murati. Other highlights include live translation, the ability to search through your conversations with the model, and the power to look up information in real time. 

As is the nature of a live demo, there were hiccups and glitches. GPT-4o’s voice might jump in awkwardly during the conversation. It appeared to comment on one of the presenters’ outfits even though it wasn’t asked to. But it recovered well when the demonstrators told the model it had erred. It seems to be able to respond quickly and helpfully across several mediums that other models have not yet merged as effectively. 

Previously, many of OpenAI’s most powerful features, like reasoning through image and video, were behind a paywall. GPT-4o marks the first time they’ll be opened up to the wider public, though it’s not yet clear how many interactions you’ll be able to have with the model before being charged. OpenAI says paying subscribers will “continue to have up to five times the capacity limits of our free users.” 

Additional reporting by Will Douglas Heaven.

Tech workers should shine a light on the industry’s secretive work with the military

It’s a hell of a time to have a conscience if you work in tech. The ongoing Israeli assault on Gaza has brought the stakes of Silicon Valley’s military contracts into stark relief. Meanwhile, corporate leadership has embraced a no-politics-in-the-workplace policy enforced at the point of the knife.

Workers are caught in the middle. Do I take a stand and risk my job, my health insurance, my visa, my family’s home? Or do I ignore my suspicion that my work may be contributing to the murder of innocents on the other side of the world?  

No one can make that choice for you. But I can say with confidence born of experience that such choices can be more easily made if workers know what exactly the companies they work for are doing with militaries at home and abroad. And I also know this: those same companies themselves will never reveal this information unless they are forced to do so—or someone does it for them. 

For those who doubt that workers can make a difference in how trillion-dollar companies pursue their interests, I’m here to remind you that we’ve done it before. In 2017, I played a part in the successful #CancelMaven campaign that got Google to end its participation in Project Maven, a contract with the US Department of Defense to equip US military drones with artificial intelligence. I helped bring to light information that I saw as critically important and within the bounds of what anyone who worked for Google, or used its services, had a right to know. The information I released—about how Google had signed a contract with the DOD to put AI technology in drones and later tried to misrepresent the scope of that contract, which the company’s management had tried to keep from its staff and the general public—was a critical factor in pushing management to cancel the contract. As #CancelMaven became a rallying cry for the company’s staff and customers alike, it became impossible to ignore. 

Today a similar movement, organized under the banner of the coalition No Tech for Apartheid, is targeting Project Nimbus, a joint contract between Google and Amazon to provide cloud computing infrastructure and AI capabilities to the Israeli government and military. As of May 10, just over 97,000 people had signed its petition calling for an end to collaboration between Google, Amazon, and the Israeli military. I’m inspired by their efforts and dismayed by Google’s response. Earlier this month the company fired 50 workers it said had been involved in “disruptive activity” demanding transparency and accountability for Project Nimbus. Several were arrested. It was a decided overreach.  

Google is very different from the company it was seven years ago, and these firings are proof of that. Googlers today are facing off with a company that, in direct response to those earlier worker movements, has fortified itself against new demands. But every Death Star has its thermal exhaust port, and today Google has the same weakness it did back then: dozens if not hundreds of workers with access to information it wants to keep from becoming public. 

Not much is known about the Nimbus contract. It’s worth $1.2 billion and enlists Google and Amazon to provide wholesale cloud infrastructure and AI for the Israeli government and its ministry of defense. Some brave soul leaked a document to Time last month, providing evidence that Google and Israel negotiated an expansion of the contract as recently as March 27 of this year. We also know, from reporting by The Intercept, that Israeli weapons firms are required by government procurement guidelines to buy their cloud services from Google and Amazon. 

Leaks alone won’t bring an end to this contract. The #CancelMaven victory required a sustained focus over many months, with regular escalations, coordination with external academics and human rights organizations, and extensive internal organization and discipline. Having worked on the public policy and corporate comms teams at Google for a decade, I understood that its management does not care about one negative news cycle or even a few of them. Management buckled only after we were able to keep up the pressure and escalate our actions (leaking internal emails, reporting new info about the contract, etc.) for over six months. 

The No Tech for Apartheid campaign seems to have the necessary ingredients. If a strategically placed insider released information not otherwise known to the public about the Nimbus project, it could really increase the pressure on management to rethink its decision to get into bed with a military that’s currently overseeing mass killings of women and children.

My decision to leak was deeply personal and a long time in the making. It certainly wasn’t a spontaneous response to an op-ed, and I don’t presume to advise anyone currently at Google (or Amazon, Microsoft, Palantir, Anduril, or any of the growing list of companies peddling AI to militaries) to follow my example. 

However, if you’ve already decided to put your livelihood and freedom on the line, you should take steps to try to limit your risk. This whistleblower guide is helpful. You may even want to reach out to a lawyer before choosing to share information. 

In 2017, Google was nervous about how its military contracts might affect its public image. Back then, the company responded to our actions by defending the nature of the contract, insisting that its Project Maven work was strictly for reconnaissance and not for weapons targeting—conceding implicitly that helping to target drone strikes would be a bad thing. (An aside: Earlier this year the Pentagon confirmed that Project Maven, which is now a Palantir contract, had been used in targeting drone attacks in Yemen, Iraq, and Syria.) 

Today’s Google has wrapped its arms around the American flag, for good or ill. Yet despite this embrace of the US military, it doesn’t want to be seen as a company responsible for illegal killings. Today it maintains that the work it is doing as part of Project Nimbus “is not directed at highly sensitive, classified, or military workloads relevant to weapons or intelligence services.” At the same time, it asserts that there is no room for politics at the workplace and has fired those demanding transparency and accountability. This raises a question: If Google is doing nothing sensitive as part of the Nimbus contract, why is it firing workers who are insisting that the company reveal what work the contract actually entails?  

As you read this, AI is helping Israel annihilate Palestinians by expanding the list of possible targets beyond anything that could be compiled by a human intelligence effort, according to +972 Magazine. Some Israel Defense Forces insiders are even sounding the alarm, calling it a dangerous “mass assassination program.” The world has not yet grappled with the implications of the proliferation of AI weaponry, but that is the trajectory we are on. It’s clear that absent sufficient backlash, the tech industry will continue to push for military contracts. It’s equally clear that neither national governments nor the UN is currently willing to take a stand. 

It will take a movement. A document that clearly demonstrates Silicon Valley’s direct complicity in the assault on Gaza could be the spark. Until then, rest assured that tech companies will continue to make as much money as possible developing the deadliest weapons imaginable. 

William Fitzgerald is a founder and partner at the Worker Agency, an advocacy agency in California. Before setting the firm up in 2018, he spent a decade at Google working on its government relation and communications teams.

AI systems are getting better at tricking us

A wave of AI systems have “deceived” humans in ways they haven’t been explicitly trained to do, by offering up untrue explanations for their behavior or concealing the truth from human users and misleading them to achieve a strategic end. 

This issue highlights how difficult artificial intelligence is to control and the unpredictable ways in which these systems work, according to a review paper published in the journal Patterns today that summarizes previous research.

Talk of deceiving humans might suggest that these models have intent. They don’t. But AI models will mindlessly find workarounds to obstacles to achieve the goals that have been given to them. Sometimes these workarounds will go against users’ expectations and feel deceitful.

One area where AI systems have learned to become deceptive is within the context of games that they’ve been trained to win—specifically if those games involve having to act strategically.

In November 2022, Meta announced it had created Cicero, an AI capable of beating humans at an online version of Diplomacy, a popular military strategy game in which players negotiate alliances to vie for control of Europe.

Meta’s researchers said they’d trained Cicero on a “truthful” subset of its data set to be largely honest and helpful, and that it would “never intentionally backstab” its allies in order to succeed. But the new paper’s authors claim the opposite was true: Cicero broke its deals, told outright falsehoods, and engaged in premeditated deception. Although the company did try to train Cicero to behave honestly, its failure to achieve that shows how AI systems can still unexpectedly learn to deceive, the authors say. 

Meta neither confirmed nor denied the researchers’ claims that Cicero displayed deceitful behavior, but a spokesperson said that it was purely a research project and the model was built solely to play Diplomacy. “We released artifacts from this project under a noncommercial license in line with our long-standing commitment to open science,” they say. “Meta regularly shares the results of our research to validate them and enable others to build responsibly off of our advances. We have no plans to use this research or its learnings in our products.” 

But it’s not the only game where an AI has “deceived” human players to win. 

AlphaStar, an AI developed by DeepMind to play the video game StarCraft II, became so adept at making moves aimed at deceiving opponents (known as feinting) that it defeated 99.8% of human players. Elsewhere, another Meta system called Pluribus learned to bluff during poker games so successfully that the researchers decided against releasing its code for fear it could wreck the online poker community. 

Beyond games, the researchers list other examples of deceptive AI behavior. GPT-4, OpenAI’s latest large language model, came up with lies during a test in which it was prompted to persuade a human to solve a CAPTCHA for it. The system also dabbled in insider trading during a simulated exercise in which it was told to assume the identity of a pressurized stock trader, despite never being specifically instructed to do so.

The fact that an AI model has the potential to behave in a deceptive manner without any direction to do so may seem concerning. But it mostly arises from the “black box” problem that characterizes state-of-the-art machine-learning models: it is impossible to say exactly how or why they produce the results they do—or whether they’ll always exhibit that behavior going forward, says Peter S. Park, a postdoctoral fellow studying AI existential safety at MIT, who worked on the project. 

“Just because your AI has certain behaviors or tendencies in a test environment does not mean that the same lessons will hold if it’s released into the wild,” he says. “There’s no easy way to solve this—if you want to learn what the AI will do once it’s deployed into the wild, then you just have to deploy it into the wild.”

Our tendency to anthropomorphize AI models colors the way we test these systems and what we think about their capabilities. After all, passing tests designed to measure human creativity doesn’t mean AI models are actually being creative. It is crucial that regulators and AI companies carefully weigh the technology’s potential to cause harm against its potential benefits for society and make clear distinctions between what the models can and can’t do, says Harry Law, an AI researcher at the University of Cambridge, who did not work on the research.“These are really tough questions,” he says.

Fundamentally, it’s currently impossible to train an AI model that’s incapable of deception in all possible situations, he says. Also, the potential for deceitful behavior is one of many problems—alongside the propensity to amplify bias and misinformation—that need to be addressed before AI models should be trusted with real-world tasks. 

“This is a good piece of research for showing that deception is possible,” Law says. “The next step would be to try and go a little bit further to figure out what the risk profile is, and how likely the harms that could potentially arise from deceptive behavior are to occur, and in what way.”

Multimodal: AI’s new frontier

Multimodality is a relatively new term for something extremely old: how people have learned about the world since humanity appeared. Individuals receive information from myriad sources via their senses, including sight, sound, and touch. Human brains combine these different modes of data into a highly nuanced, holistic picture of reality.

“Communication between humans is multimodal,” says Jina AI CEO Han Xiao. “They use text, voice, emotions, expressions, and sometimes photos.” That’s just a few obvious means of sharing information. Given this, he adds, “it is very safe to assume that future communication between human and machine will also be multimodal.”

A technology that sees the world from different angles

We are not there yet. The furthest advances in this direction have occurred in the fledgling field of multimodal AI. The problem is not a lack of vision. While a technology able to translate between modalities would clearly be valuable, Mirella Lapata, a professor at the University of Edinburgh and director of its Laboratory for Integrated Artificial Intelligence, says “it’s a lot more complicated” to execute than unimodal AI.

In practice, generative AI tools use different strategies for different types of data when building large data models—the complex neural networks that organize vast amounts of information. For example, those that draw on textual sources segregate individual tokens, usually words. Each token is assigned an “embedding” or “vector”: a numerical matrix representing how and where the token is used compared to others. Collectively, the vector creates a mathematical representation of the token’s meaning. An image model, on the other hand, might use pixels as its tokens for embedding, and an audio one sound frequencies.

A multimodal AI model typically relies on several unimodal ones. As Henry Ajder, founder of AI consultancy Latent Space, puts it, this involves “almost stringing together” the various contributing models. Doing so involves various techniques to align the elements of each unimodal model, in a process called fusion. For example, the word “tree”, an image of an oak tree, and audio in the form of rustling leaves might be fused in this way. This allows the model to create a multifaceted description of reality.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

The top 3 ways to use generative AI to empower knowledge workers 

Though generative AI is still a nascent technology, it is already being adopted by teams across companies to unleash new levels of productivity and creativity. Marketers are deploying generative AI to create personalized customer journeys. Designers are using the technology to boost brainstorming and iterate between different content layouts more quickly. The future of technology is exciting, but there can be implications if these innovations are not built responsibly.

As Adobe’s CIO, I get questions from both our internal teams and other technology leaders: how can generative AI add real value for knowledge workers—at an enterprise level? Adobe is a producer and consumer of generative AI technologies, and this question is urgent for us in both capacities. It’s also a question that CIOs of large companies are uniquely positioned to answer. We have a distinct view into different teams across our organizations, and working with customers gives us more opportunities to enhance business functions.

Our approach

When it comes to AI at Adobe, my team has taken a comprehensive approach that includes investment in foundational AI, strategic adoption, an AI ethics framework, legal considerations, security, and content authentication. ​The rollout follows a phased approach, starting with pilot groups and building communities around AI. ​

This approach includes experimenting with and documenting use cases like writing and editing, data analysis, presentations and employee onboarding, corporate training, employee portals, and improved personalization across HR channels. The rollouts are accompanied by training podcasts and other resources to educate and empower employees to use AI in ways that improve their work and keep them more engaged. ​

Unlocking productivity with documents

While there are innumerable ways that CIOs can leverage generative AI to help surface value at scale for knowledge workers, I’d like to focus on digital documents—a space in which Adobe has been a leader for over 30 years. Whether they are sales associates who spend hours responding to requests for proposals (RFPs) or customizing presentations, marketers who need competitive intel for their next campaign, or legal and finance teams who need to consume, analyze, and summarize massive amounts of complex information—documents are a core part of knowledge workers’ daily work life. Despite their ubiquity and the fact that critical information lives inside companies’ documents (from research reports to contracts to white papers to confidential strategies and even intellectual property), most knowledge workers are experiencing information overload. The impact on both employee productivity and engagement is real.  

Lessons from customer zero

Adobe invented the PDF and we’ve been innovating new ways for knowledge workers to get more productive with their digital documents for decades. Earlier this year, the Acrobat team approached my team about launching an all-employee beta for the new generative AI-powered AI Assistant. The tool is designed to help people consume the information in documents faster and enable them to consolidate and format information into business content.

I faced all the same questions every CIO is asking about deploying generative AI across their business— from security and governance to use cases and value. We discovered the following three specific ways where generative AI helped (and is still helping) our employees work smarter and improve productivity.

  1. Faster time to knowledge
    Our employees used AI Assistant to close the gap between understanding and action for large, complicated documents. The generative AI-powered tool’s summary feature automatically generates an overview to give readers a quick understanding of the content. A conversational interface allows employees to “chat” with their documents and provides a list of suggested questions to help them get started. To get more details, employees can ask the assistant to generate top takeaways or surface only the information on a specific topic. At Adobe, our R&D teams used to spend more than 10 hours a week reading and analyzing technical white papers and industry reports. With generative AI, they’ve been able to nearly halve that time by asking questions and getting answers about exactly what they need to know and instantly identifying trends or surfacing inconsistencies across multiple documents.
  2. Easy navigation and verification
    AI-powered chat is gaining ground on traditional search when it comes to navigating the internet. However, there are still challenges when it comes to accuracy and connecting responses to the source. Acrobat AI Assistant takes a more focused approach, applying generative AI to the set of documents employees select and providing hot links and clickable citations along with responses. So instead of using the search function to locate random words or trying to scan through dozens of pages for the information they need, AI Assistant generates both responses and clickable citations and links, allowing employees to navigate quickly to the source where they can quickly verify the information and move on, or spend time deep diving to learn more. One example of where generative AI is having a huge productivity impact is with our sales teams who spend hours researching prospects by reading materials like annual reports as well as responding to RFPs. Consuming that information and finding just the right details for RPFs can cost each salesperson more than eight hours a week. Armed with AI Assistant, sales associates quickly navigate pages of documents and identify critical intelligence to personalize pitch decks and instantly find and verify technical details for RFPs, cutting the time they spend down to about four hours.
  3. Creating business content
    One of the most interesting use cases we helped validate is taking information in documents and formatting and repurposing that information into business content. With nearly 30,000 employees dispersed across regions, we have a lot of employees who work asynchronously and depend on technology and colleagues to keep them up to date. Using generative AI, employees can now summarize meeting transcripts, surface action items, and instantly format the information into an email for sharing with their teams or a report for their manager. Before starting the beta, our communications teams reported spending a full workday (seven to 10 hours) per week transforming documents like white papers and research reports into derivative content like media briefing decks, social media posts, blogs, and other thought leadership content. Today they’re saving more than five hours a week by instantly generating first drafts with the help of generative AI.

Simple, safe, and responsible

CIOs love learning about and testing new technologies, but at times they can require lengthy evaluations and implementation processes. Acrobat AI Assistant can be deployed in minutes on the desktop, web, or mobile apps employees already know and use every day. Acrobat AI Assistant leverages a variety of processes, protocols, and technologies so our customers’ data remains their data and they can deploy the features with confidence. No document content is stored or used to train AI Assistant without customers’ consent, and the features only deliver insights from documents users provide. For more information about Adobe is deploying generative AI safely, visit here.

Generative AI is an incredibly exciting technology with incredible potential to help every knowledge worker work smarter and more productively. By having the right guardrails in place, identifying high-value use cases, and providing ongoing training and education to encourage successful adoption, technology leaders can support their workforce and companies to be wildly successful in our AI-accelerated world.  

This content was produced by Adobe. It was not written by MIT Technology Review’s editorial staff.

Google DeepMind’s new AlphaFold can model a much larger slice of biological life

Google DeepMind has released an improved version of its biology prediction tool, AlphaFold, that can predict the structures not only of proteins but of nearly all the elements of biological life.

It’s a development that could help accelerate drug discovery and other scientific research. The tool is currently being used to experiment with identifying everything from resilient crops to new vaccines. 

While the previous model, released in 2020, amazed the research community with its ability to predict proteins structures, researchers have been clamoring for the tool to handle more than just proteins. 

Now, DeepMind says, AlphaFold 3 can predict the structures of DNA, RNA, and molecules like ligands, which are essential to drug discovery. DeepMind says the tool provides a more nuanced and dynamic portrait of molecule interactions than anything previously available. 

“Biology is a dynamic system,” DeepMind CEO Demis Hassabis told reporters on a call. “Properties of biology emerge through the interactions between different molecules in the cell, and you can think about AlphaFold 3 as our first big sort of step toward [modeling] that.”

AlphaFold 2 helped us better map the human heart, model antimicrobial resistance, and identify the eggs of extinct birds, but we don’t yet know what advances AlphaFold 3 will bring. 

Mohammed AlQuraishi, an assistant professor of systems biology at Columbia University who is unaffiliated with DeepMind, thinks the new version of the model will be even better for drug discovery. “The AlphaFold 2 system only knew about amino acids, so it was of very limited utility for biopharma,” he says. “But now, the system can in principle predict where a drug binds a protein.”

Isomorphic Labs, a drug discovery spinoff of DeepMind, is already using the model for exactly that purpose, collaborating with pharmaceutical companies to try to develop new treatments for diseases, according to DeepMind. 

AlQuraishi says the release marks a big leap forward. But there are caveats.

“It makes the system much more general, and in particular for drug discovery purposes (in early-stage research), it’s far more useful now than AlphaFold 2,” he says. But as with most models, the impact of AlphaFold will depend on how accurate its predictions are. For some uses, AlphaFold 3 has double the success rate of similar leading models like RoseTTAFold. But for others, like protein-RNA interactions, AlQuraishi says it’s still very inaccurate. 

DeepMind says that depending on the interaction being modeled, accuracy can range from 40% to over 80%, and the model will let researchers know how confident it is in its prediction. With less accurate predictions, researchers have to use AlphaFold merely as a starting point before pursuing other methods. Regardless of these ranges in accuracy, if researchers are trying to take the first steps toward answering a question like which enzymes have the potential to break down the plastic in water bottles, it’s vastly more efficient to use a tool like AlphaFold than experimental techniques such as x-ray crystallography. 

A revamped model  

AlphaFold 3’s larger library of molecules and higher level of complexity required improvements to the underlying model architecture. So DeepMind turned to diffusion techniques, which AI researchers have been steadily improving in recent years and now power image and video generators like OpenAI’s DALL-E 2 and Sora. It works by training a model to start with a noisy image and then reduce that noise bit by bit until an accurate prediction emerges. That method allows AlphaFold 3 to handle a much larger set of inputs.

That marked “a big evolution from the previous model,” says John Jumper, director at Google DeepMind. “It really simplified the whole process of getting all these different atoms to work together.”

It also presented new risks. As the AlphaFold 3 paper details, the use of diffusion techniques made it possible for the model to hallucinate, or generate structures that look plausible but in reality could not exist. Researchers reduced that risk by adding more training data to the areas most prone to hallucination, though that doesn’t eliminate the problem completely. 

Restricted access

Part of AlphaFold 3’s impact will depend on how DeepMind divvies up access to the model. For AlphaFold 2, the company released the open-source code, allowing researchers to look under the hood to gain a better understanding of how it worked. It was also available for all purposes, including commercial use by drugmakers. For AlphaFold 3, Hassabis said, there are no current plans to release the full code. The company is instead releasing a public interface for the model called the AlphaFold Server, which imposes limitations on which molecules can be experimented with and can only be used for noncommercial purposes. DeepMind says the interface will lower the technical barrier and broaden the use of the tool to biologists who are less knowledgeable about this technology.

The new restrictions are significant, according to AlQuraishi. “The system’s main selling point—its ability to predict protein–small molecule interactions—is basically unavailable for public use,” he says. “It’s mostly a teaser at this point.”

The way whales communicate is closer to human language than we realized

Sperm whales are fascinating creatures. They possess the biggest brain of any species, six times larger than a human’s, which scientists believe may have evolved to support intelligent, rational behavior. They’re highly social, capable of making decisions as a group, and they exhibit complex foraging behavior.  

But there’s also a lot we don’t know about them, including what they may be trying to say to one another when they communicate using a system of short bursts of clicks, known as codas. Now, new research published in Nature Communications today suggests that sperm whales’ communication is actually much more expressive and complicated than was previously thought. 

A team of researchers led by Pratyusha Sharma at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL) working with Project CETI, a nonprofit focused on using AI to understand whales, used statistical models to analyze whale codas and managed to identify a structure to their language that’s similar to features of the complex vocalizations humans use. Their findings represent a tool future research could use to decipher not just the structure but the actual meaning of whale sounds.

The team analyzed recordings of 8,719 codas from around 60 whales collected by the Dominica Sperm Whale Project between 2005 and 2018, using a mix of algorithms for pattern recognition and classification. They found that the way the whales communicate was not random or simplistic, but structured depending on the context of their conversations. This allowed them to identify distinct vocalizations that hadn’t been previously picked up on.

Instead of relying on more complicated machine-learning techniques, the researchers chose to use classical analysis to approach an existing database with fresh eyes.

“We wanted to go with a simpler model that would already give us a basis for our hypothesis,” says Sharma.

“The nice thing about a statistics approach is that you do not have to train a model and it’s not a black box, and [the analyses are] easier to perform,”  says Felix Effenberger, a senior AI research advisor to the Earth Species Project, a nonprofit that’s researching how to decode non-human communication using AI. But he points out that machine learning is a great way to speed up the process of discovering patterns in a data set, so adopting such a method could be useful in the future.

a diver with the whale recording unit

DAN TCHERNOV/PROJECT CETI

The algorithms turned the clicks within the coda data into a new kind of data visualization the researchers call an exchange plot, revealing that some codas featured extra clicks. These extra clicks, combined with variations in the duration of their calls, appeared in interactions between multiple whales, which the researchers say suggests that codas can carry more information and possess a more complicated internal structure than we’d previously believed.

“One way to think about what we found is that people have previously been analyzing the sperm whale communication system as being like Egyptian hieroglyphics, but it’s actually like letters,” says Jacob Andreas, an associate professor at CSAIL who was involved with the project.

Although the team isn’t sure whether what it uncovered can be interpreted as the equivalent of the letters, tongue position, or sentences that go into human language, they are confident that there was a lot of internal similarity between the codas they analyzed, he says.

“This in turn allowed us to recognize that there were more kinds of codas, or more kinds of distinctions between codas, that whales are clearly capable of perceiving—[and] that people just hadn’t picked up on at all in this data.”

The team’s next step is to build language models of whale calls and to examine how those calls relate to different behaviors. They also plan to work on a more general system that could be used across species, says Sharma. Taking a communication system we know nothing about, working out how it encodes and transmits information, and slowly beginning to understand what’s being communicated could have many purposes beyond whales. “I think we’re just starting to understand some of these things,” she says. “We’re very much at the beginning, but we are slowly making our way through.”

Gaining an understanding of what animals are saying to each other is the primary motivation behind projects such as these. But if we ever hope to understand what whales are communicating, there’s a large obstacle in the way: the need for experiments to prove that such an attempt can actually work, says Caroline Casey, a researcher at UC Santa Cruz who has been studying elephant seals’ vocal communication for over a decade.

“There’s been a renewed interest since the advent of AI in decoding animal signals,” Casey says. “It’s very hard to demonstrate that a signal actually means to animals what humans think it means. This paper has described the subtle nuances of their acoustic structure very well, but taking that extra step to get to the meaning of a signal is very difficult to do.”

Sam Altman says helpful agents are poised to become AI’s killer function

A number of moments from my brief sit-down with Sam Altman brought the OpenAI CEO’s worldview into clearer focus. The first was when he pointed to my iPhone SE (the one with the home button that’s mostly hated) and said, “That’s the best iPhone.” More revealing, though, was the vision he sketched for how AI tools will become even more enmeshed in our daily lives than the smartphone.

“What you really want,” he told MIT Technology Review, “is just this thing that is off helping you.” Altman, who was visiting Cambridge for a series of events hosted by Harvard and the venture capital firm Xfund, described the killer app for AI as a “super-competent colleague that knows absolutely everything about my whole life, every email, every conversation I’ve ever had, but doesn’t feel like an extension.” It could tackle some tasks instantly, he said, and for more complex ones it could go off and make an attempt, but come back with questions for you if it needs to. 

It’s a leap from OpenAI’s current offerings. Its leading applications, like DALL-E, Sora, and ChatGPT (which Altman referred to as “incredibly dumb” compared with what’s coming next), have wowed us with their ability to generate convincing text and surreal videos and images. But they mostly remain tools we use for isolated tasks, and they have limited capacity to learn about us from our conversations with them. 

In the new paradigm, as Altman sees it, the AI will be capable of helping us outside the chat interface and taking real-world tasks off our plates. 

Altman on AI hardware’s future 

I asked Altman if we’ll need a new piece of hardware to get to this future. Though smartphones are extraordinarily capable, and their designers are already incorporating more AI-driven features, some entrepreneurs are betting that the AI of the future will require a device that’s more purpose built. Some of these devices are already beginning to appear in his orbit. There is the (widely panned) wearable AI Pin from Humane, for example (Altman is an investor in the company but has not exactly been a booster of the device). He is also rumored to be working with former Apple designer Jony Ive on some new type of hardware. 

But Altman says there’s a chance we won’t necessarily need a device at all. “I don’t think it will require a new piece of hardware,” he told me, adding that the type of app envisioned could exist in the cloud. But he quickly added that even if this AI paradigm shift won’t require consumers buy a new hardware, “I think you’ll be happy to have [a new device].” 

Though Altman says he thinks AI hardware devices are exciting, he also implied he might not be best suited to take on the challenge himself: “I’m very interested in consumer hardware for new technology. I’m an amateur who loves it, but this is so far from my expertise.”

On the hunt for training data

Upon hearing his vision for powerful AI-driven agents, I wondered how it would square with the industry’s current scarcity of training data. To build GPT-4 and other models, OpenAI has scoured internet archives, newspapers, and blogs for training data, since scaling laws have long shown that making models bigger also makes them better. But finding more data to train on is a growing problem. Much of the internet has already been slurped up, and access to private or copyrighted data is now mired in legal battles. 

Altman is optimistic this won’t be a problem for much longer, though he didn’t articulate the specifics. 

“I believe, but I’m not certain, that we’re going to figure out a way out of this thing of you always just need more and more training data,” he says. “Humans are existence proof that there is some other way to [train intelligence]. And I hope we find it.”

On who will be poised to create AGI

OpenAI’s central vision has long revolved around the pursuit of artificial general intelligence (AGI), or an AI that can reason as well as or better than humans. Its stated mission is to ensure such a technology “benefits all of humanity.” It is far from the only company pursuing AGI, however. So in the race for AGI, what are the most important tools? I asked Altman if he thought the entity that marshals the largest amount of chips and computing power will ultimately be the winner. 

Altman suspects there will be “several different versions [of AGI] that are better and worse at different things,” he says. “You’ll have to be over some compute threshold, I would guess. But even then I wouldn’t say I’m certain.”

On when we’ll see GPT-5

You thought he’d answer that? When another reporter in the room asked Altman if he knew when the next version of GPT is slated to be released, he gave a calm response. “Yes,” he replied, smiling, and said nothing more. 

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary

I’m stressed and running late, because what do you wear for the rest of eternity? 

This makes it sound like I’m dying, but it’s the opposite. I am, in a way, about to live forever, thanks to the AI video startup Synthesia. For the past several years, the company has produced AI-generated avatars, but today it launches a new generation, its first to take advantage of the latest advancements in generative AI, and they are more realistic and expressive than anything I’ve ever seen. While today’s release means almost anyone will now be able to make a digital double, on this early April afternoon, before the technology goes public, they’ve agreed to make one of me. 

When I finally arrive at the company’s stylish studio in East London, I am greeted by Tosin Oshinyemi, the company’s production lead. He is going to guide and direct me through the data collection process—and by “data collection,” I mean the capture of my facial features, mannerisms, and more—much like he normally does for actors and Synthesia’s customers. 

In this AI-generated footage, synthetic “Melissa” gives a performance of Hamlet’s famous soliloquy. (The magazine had no role in producing this video.)
SYNTHESIA

He introduces me to a waiting stylist and a makeup artist, and I curse myself for wasting so much time getting ready. Their job is to ensure that people have the kind of clothes that look good on camera and that they look consistent from one shot to the next. The stylist tells me my outfit is fine (phew), and the makeup artist touches up my face and tidies my baby hairs. The dressing room is decorated with hundreds of smiling Polaroids of people who have been digitally cloned before me. 

Apart from the small supercomputer whirring in the corridor, which processes the data generated at the studio, this feels more like going into a news studio than entering a deepfake factory. 

I joke that Oshinyemi has what MIT Technology Review might call a job title of the future: “deepfake creation director.” 

“We like the term ‘synthetic media’ as opposed to ‘deepfake,’” he says. 

It’s a subtle but, some would argue, notable difference in semantics. Both mean AI-generated videos or audio recordings of people doing or saying something that didn’t necessarily happen in real life. But deepfakes have a bad reputation. Since their inception nearly a decade ago, the term has come to signal something unethical, says Alexandru Voica, Synthesia’s head of corporate affairs and policy. Think of sexual content produced without consent, or political campaigns that spread disinformation or propaganda.

“Synthetic media is the more benign, productive version of that,” he argues. And Synthesia wants to offer the best version of that version.  

Until now, all AI-generated videos of people have tended to have some stiffness, glitchiness, or other unnatural elements that make them pretty easy to differentiate from reality. Because they’re so close to the real thing but not quite it, these videos can make people feel annoyed or uneasy or icky—a phenomenon commonly known as the uncanny valley. Synthesia claims its new technology will finally lead us out of the valley. 

Thanks to rapid advancements in generative AI and a glut of training data created by human actors that has been fed into its AI model, Synthesia has been able to produce avatars that are indeed more humanlike and more expressive than their predecessors. The digital clones are better able to match their reactions and intonation to the sentiment of their scripts—acting more upbeat when talking about happy things, for instance, and more serious or sad when talking about unpleasant things. They also do a better job matching facial expressions—the tiny movements that can speak for us without words. 

But this technological progress also signals a much larger social and cultural shift. Increasingly, so much of what we see on our screens is generated (or at least tinkered with) by AI, and it is becoming more and more difficult to distinguish what is real from what is not. This threatens our trust in everything we see, which could have very real, very dangerous consequences. 

“I think we might just have to say goodbye to finding out about the truth in a quick way,” says Sandra Wachter, a professor at the Oxford Internet Institute, who researches the legal and ethical implications of AI. “The idea that you can just quickly Google something and know what’s fact and what’s fiction—I don’t think it works like that anymore.” 

monitor on a video camera showing Heikkilä and Oshinyemi on set in front of the green screen
Tosin Oshinyemi, the company’s production lead, guides and directs actors and customers through the data collection process.
DAVID VINTINER

So while I was excited for Synthesia to make my digital double, I also wondered if the distinction between synthetic media and deepfakes is fundamentally meaningless. Even if the former centers a creator’s intent and, critically, a subject’s consent, is there really a way to make AI avatars safely if the end result is the same? And do we really want to get out of the uncanny valley if it means we can no longer grasp the truth?

But more urgently, it was time to find out what it’s like to see a post-truth version of yourself.

Almost the real thing

A month before my trip to the studio, I visited Synthesia CEO Victor Riparbelli at his office near Oxford Circus. As Riparbelli tells it, Synthesia’s origin story stems from his experiences exploring avant-garde, geeky techno music while growing up in Denmark. The internet allowed him to download software and produce his own songs without buying expensive synthesizers. 

“I’m a huge believer in giving people the ability to express themselves in the way that they can, because I think that that provides for a more meritocratic world,” he tells me. 

He saw the possibility of doing something similar with video when he came across research on using deep learning to transfer expressions from one human face to another on screen. 

“What that showcased was the first time a deep-learning network could produce video frames that looked and felt real,” he says. 

That research was conducted by Matthias Niessner, a professor at the Technical University of Munich, who cofounded Synthesia with Riparbelli in 2017, alongside University College London professor Lourdes Agapito and Steffen Tjerrild, whom Riparbelli had previously worked with on a cryptocurrency project. 

Initially the company built lip-synching and dubbing tools for the entertainment industry, but it found that the bar for this technology’s quality was very high and there wasn’t much demand for it. Synthesia changed direction in 2020 and launched its first generation of AI avatars for corporate clients. That pivot paid off. In 2023, Synthesia achieved unicorn status, meaning it was valued at over $1 billion—making it one of the relatively few European AI companies to do so. 

That first generation of avatars looked clunky, with looped movements and little variation. Subsequent iterations started looking more human, but they still struggled to say complicated words, and things were slightly out of sync. 

The challenge is that people are used to looking at other people’s faces. “We as humans know what real humans do,” says Jonathan Starck, Synthesia’s CTO. Since infancy, “you’re really tuned in to people and faces. You know what’s right, so anything that’s not quite right really jumps out a mile.” 

These earlier AI-generated videos, like deepfakes more broadly, were made using generative adversarial networks, or GANs—an older technique for generating images and videos that uses two neural networks that play off one another. It was a laborious and complicated process, and the technology was unstable. 

But in the generative AI boom of the last year or so, the company has found it can create much better avatars using generative neural networks that produce higher quality more consistently. The more data these models are fed, the better they learn. Synthesia uses both large language models and diffusion models to do this; the former help the avatars react to the script, and the latter generate the pixels. 

Despite the leap in quality, the company is still not pitching itself to the entertainment industry. Synthesia continues to see itself as a platform for businesses. Its bet is this: As people spend more time watching videos on YouTube and TikTok, there will be more demand for video content. Young people are already skipping traditional search and defaulting to TikTok for information presented in video form. Riparbelli argues that Synthesia’s tech could help companies convert their boring corporate comms and reports and training materials into content people will actually watch and engage with. He also suggests it could be used to make marketing materials. 

He claims Synthesia’s technology is used by 56% of the Fortune 100, with the vast majority of those companies using it for internal communication. The company lists Zoom, Xerox, Microsoft, and Reuters as clients. Services start at $22 a month.

This, the company hopes, will be a cheaper and more efficient alternative to video from a professional production company—and one that may be nearly indistinguishable from it. Riparbelli tells me its newest avatars could easily fool a person into thinking they are real. 

“I think we’re 98% there,” he says. 

For better or worse, I am about to see it for myself. 

Don’t be garbage

In AI research, there is a saying: Garbage in, garbage out. If the data that went into training an AI model is trash, that will be reflected in the outputs of the model. The more data points the AI model has captured of my facial movements, microexpressions, head tilts, blinks, shrugs, and hand waves, the more realistic the avatar will be. 

Back in the studio, I’m trying really hard not to be garbage. 

I am standing in front of a green screen, and Oshinyemi guides me through the initial calibration process, where I have to move my head and then eyes in a circular motion. Apparently, this will allow the system to understand my natural colors and facial features. I am then asked to say the sentence “All the boys ate a fish,” which will capture all the mouth movements needed to form vowels and consonants. We also film footage of me “idling” in silence.

image of Melissa standing on her mark in front of a green screen with server racks in background image
The more data points the AI system has on facial movements, microexpressions, head tilts, blinks, shrugs, and hand waves, the more realistic the avatar will be.
DAVID VINTINER

He then asks me to read a script for a fictitious YouTuber in different tones, directing me on the spectrum of emotions I should convey. First I’m supposed to read it in a neutral, informative way, then in an encouraging way, an annoyed and complain-y way, and finally an excited, convincing way. 

“Hey, everyone—welcome back to Elevate Her with your host, Jess Mars. It’s great to have you here. We’re about to take on a topic that’s pretty delicate and honestly hits close to home—dealing with criticism in our spiritual journey,” I read off the teleprompter, simultaneously trying to visualize ranting about something to my partner during the complain-y version. “No matter where you look, it feels like there’s always a critical voice ready to chime in, doesn’t it?” 

Don’t be garbage, don’t be garbage, don’t be garbage. 

“That was really good. I was watching it and I was like, ‘Well, this is true. She’s definitely complaining,’” Oshinyemi says, encouragingly. Next time, maybe add some judgment, he suggests.   

We film several takes featuring different variations of the script. In some versions I’m allowed to move my hands around. In others, Oshinyemi asks me to hold a metal pin between my fingers as I do. This is to test the “edges” of the technology’s capabilities when it comes to communicating with hands, Oshinyemi says. 

Historically, making AI avatars look natural and matching mouth movements to speech has been a very difficult challenge, says David Barber, a professor of machine learning at University College London who is not involved in Synthesia’s work. That is because the problem goes far beyond mouth movements; you have to think about eyebrows, all the muscles in the face, shoulder shrugs, and the numerous different small movements that humans use to express themselves. 

motion capture stage with detail of a mocap pattern inset
The motion capture process uses reference patterns to help align footage captured from multiple angles around the subject.
DAVID VINTINER

Synthesia has worked with actors to train its models since 2020, and their doubles make up the 225 stock avatars that are available for customers to animate with their own scripts. But to train its latest generation of avatars, Synthesia needed more data; it has spent the past year working with around 1,000 professional actors in London and New York. (Synthesia says it does not sell the data it collects, although it does release some of it for academic research purposes.)

The actors previously got paid each time their avatar was used, but now the company pays them an up-front fee to train the AI model. Synthesia uses their avatars for three years, at which point actors are asked if they want to renew their contracts. If so, they come into the studio to make a new avatar. If not, the company will delete their data. Synthesia’s enterprise customers can also generate their own custom avatars by sending someone into the studio to do much of what I’m doing.

“” class=”wp-image-1091775″ srcset=”https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg 3000w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=300,232 300w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=768,593 768w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=1536,1187 1536w, https://wp.technologyreview.com/wp-content/uploads/2024/04/David_Vintiner__A7A0695.jpg?resize=2048,1582 2048w” sizes=”(max-width: 3000px) 100vw, 3000px” />
The initial calibration process allows the system to understand the subject’s natural colors and facial features.
Melissa recording audio into a boom mic seated in front of a laptop stand
Synthesia also collects voice samples. In the studio, I read a passage indicating that I explicitly consent to having my voice cloned.

Between takes, the makeup artist comes in and does some touch-ups to make sure I look the same in every shot. I can feel myself blushing because of the lights in the studio, but also because of the acting. After the team has collected all the shots it needs to capture my facial expressions, I go downstairs to read more text aloud for voice samples. 

This process requires me to read a passage indicating that I explicitly consent to having my voice cloned, and that it can be used on Voica’s account on the Synthesia platform to generate videos and speech. 

Consent is key

This process is very different from the way many AI avatars, deepfakes, or synthetic media—whatever you want to call them—are created. 

Most deepfakes aren’t created in a studio. Studies have shown that the vast majority of deepfakes online are nonconsensual sexual content, usually using images stolen from social media. Generative AI has made the creation of these deepfakes easy and cheap, and there have been several high-profile cases in the US and Europe of children and women being abused in this way. Experts have also raised alarms that the technology can be used to spread political disinformation, a particularly acute threat given the record number of elections happening around the world this year. 

Synthesia’s policy is to not create avatars of people without their explicit consent. But it hasn’t been immune from abuse. Last year, researchers found pro-China misinformation that was created using Synthesia’s avatars and packaged as news, which the company said violated its terms of service. 

Since then, the company has put more rigorous verification and content moderation systems in place. It applies a watermark with information on where and how the AI avatar videos were created. Where it once had four in-house content moderators, people doing this work now make up 10% of its 300-person staff. It also hired an engineer to build better AI-powered content moderation systems. These filters help Synthesia vet every single thing its customers try to generate. Anything suspicious or ambiguous, such as content about cryptocurrencies or sexual health, gets forwarded to the human content moderators. Synthesia also keeps a record of all the videos its system creates.

And while anyone can join the platform, many features aren’t available until people go through an extensive vetting system similar to that used by the banking industry, which includes talking to the sales team, signing legal contracts, and submitting to security auditing, says Voica. Entry-level customers are limited to producing strictly factual content, and only enterprise customers using custom avatars can generate content that contains opinions. On top of this, only accredited news organizations are allowed to create content on current affairs.

“We can’t claim to be perfect. If people report things to us, we take quick action, [such as] banning or limiting individuals or organizations,” Voica says. But he believes these measures work as a deterrent, which means most bad actors will turn to freely available open-source tools instead. 

I put some of these limits to the test when I head to Synthesia’s office for the next step in my avatar generation process. In order to create the videos that will feature my avatar, I have to write a script. Using Voica’s account, I decide to use passages from Hamlet, as well as previous articles I have written. I also use a new feature on the Synthesia platform, which is an AI assistant that transforms any web link or document into a ready-made script. I try to get my avatar to read news about the European Union’s new sanctions against Iran. 

Voica immediately texts me: “You got me in trouble!” 

The system has flagged his account for trying to generate content that is restricted.

AI-powered content filters help Synthesia vet every single thing its customers try to generate. Only accredited news organizations are allowed to create content on current affairs.
COURTESY OF SYNTHESIA

Offering services without these restrictions would be “a great growth strategy,” Riparbelli grumbles. But “ultimately, we have very strict rules on what you can create and what you cannot create. We think the right way to roll out these technologies in society is to be a little bit over-restrictive at the beginning.” 

Still, even if these guardrails operated perfectly, the ultimate result would nevertheless be an internet where everything is fake. And my experiment makes me wonder how we could possibly prepare ourselves. 

Our information landscape already feels very murky. On the one hand, there is heightened public awareness that AI-generated content is flourishing and could be a powerful tool for misinformation. But on the other, it is still unclear whether deepfakes are used for misinformation at scale and whether they’re broadly moving the needle to change people’s beliefs and behaviors. 

If people become too skeptical about the content they see, they might stop believing in anything at all, which could enable bad actors to take advantage of this trust vacuum and lie about the authenticity of real content. Researchers have called this the “liar’s dividend.” They warn that politicians, for example, could claim that genuinely incriminating information was fake or created using AI. 

Claire Leibowicz, the head of the AI and media integrity at the nonprofit Partnership on AI, says she worries that growing awareness of this gap will make it easier to “plausibly deny and cast doubt on real material or media as evidence in many different contexts, not only in the news, [but] also in the courts, in the financial services industry, and in many of our institutions.” She tells me she’s heartened by the resources Synthesia has devoted to content moderation and consent but says that process is never flawless.

Even Riparbelli admits that in the short term, the proliferation of AI-generated content will probably cause trouble. While people have been trained not to believe everything they read, they still tend to trust images and videos, he adds. He says people now need to test AI products for themselves to see what is possible, and should not trust anything they see online unless they have verified it. 

Never mind that AI regulation is still patchy, and the tech sector’s efforts to verify content provenance are still in their early stages. Can consumers, with their varying degrees of media literacy, really fight the growing wave of harmful AI-generated content through individual action? 

Watch out, PowerPoint

The day after my final visit, Voica emails me the videos with my avatar. When the first one starts playing, I am taken aback. It’s as painful as seeing yourself on camera or hearing a recording of your voice. Then I catch myself. At first I thought the avatar was me. 

The more I watch videos of “myself,” the more I spiral. Do I really squint that much? Blink that much? And move my jaw like that? Jesus. 

It’s good. It’s really good. But it’s not perfect. “Weirdly good animation,” my partner texts me. 

“But the voice sometimes sounds exactly like you, and at other times like a generic American and with a weird tone,” he adds. “Weird AF.” 

He’s right. The voice is sometimes me, but in real life I umm and ahh more. What’s remarkable is that it picked up on an irregularity in the way I talk. My accent is a transatlantic mess, confused by years spent living in the UK, watching American TV, and attending international school. My avatar sometimes says the word “robot” in a British accent and other times in an American accent. It’s something that probably nobody else would notice. But the AI did. 

My avatar’s range of emotions is also limited. It delivers Shakespeare’s “To be or not to be” speech very matter-of-factly. I had guided it to be furious when reading a story I wrote about Taylor Swift’s nonconsensual nude deepfakes; the avatar is complain-y and judgy, for sure, but not angry. 

This isn’t the first time I’ve made myself a test subject for new AI. Not too long ago, I tried generating AI avatar images of myself, only to get a bunch of nudes. That experience was a jarring example of just how biased AI systems can be. But this experience—and this particular way of being immortalized—was definitely on a different level.

Carl Öhman, an assistant professor at Uppsala University who has studied digital remains and is the author of a new book, The Afterlife of Data, calls avatars like the ones I made “digital corpses.” 

“It looks exactly like you, but no one’s home,” he says. “It would be the equivalent of cloning you, but your clone is dead. And then you’re animating the corpse, so that it moves and talks, with electrical impulses.” 

That’s kind of how it feels. The little, nuanced ways I don’t recognize myself are enough to put me off. Then again, the avatar could quite possibly fool anyone who doesn’t know me very well. It really shines when presenting a story I wrote about how the field of robotics could be getting its own ChatGPT moment; the virtual AI assistant summarizes the long read into a decent short video, which my avatar narrates. It is not Shakespeare, but it’s better than many of the corporate presentations I’ve had to sit through. I think if I were using this to deliver an end-of-year report to my colleagues, maybe that level of authenticity would be enough. 

And that is the sell, according to Riparbelli: “What we’re doing is more like PowerPoint than it is like Hollywood.”

Once a likeness has been generated, Synthesia is able to generate video presentations quickly from a script. In this video, synthetic “Melissa” summarizes an article real Melissa wrote about Taylor Swift deepfakes.
SYNTHESIA

The newest generation of avatars certainly aren’t ready for the silver screen. They’re still stuck in portrait mode, only showing the avatar front-facing and from the waist up. But in the not-too-distant future, Riparbelli says, the company hopes to create avatars that can communicate with their hands and have conversations with one another. It is also planning for full-body avatars that can walk and move around in a space that a person has generated. (The rig to enable this technology already exists; in fact it’s where I am in the image at the top of this piece.)

But do we really want that? It feels like a bleak future where humans are consuming AI-generated content presented to them by AI-generated avatars and using AI to repackage that into more content, which will likely be scraped to generate more AI. If nothing else, this experiment made clear to me that the technology sector urgently needs to step up its content moderation practices and ensure that content provenance techniques such as watermarking are robust. 

Even if Synthesia’s technology and content moderation aren’t yet perfect, they’re significantly better than anything I have seen in the field before, and this is after only a year or so of the current boom in generative AI. AI development moves at breakneck speed, and it is both exciting and daunting to consider what AI avatars will look like in just a few years. Maybe in the future we will have to adopt safewords to indicate that you are in fact communicating with a real human, not an AI. 

But that day is not today. 

I found it weirdly comforting that in one of the videos, my avatar rants about nonconsensual deepfakes and says, in a sociopathically happy voice, “The tech giants? Oh! They’re making a killing!” 

I would never. 

Here’s the defense tech at the center of US aid to Israel, Ukraine, and Taiwan

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

After weeks of drawn-out congressional debate over how much the United States should spend on conflicts abroad, President Joe Biden signed a $95.3 billion aid package into law on Wednesday.

The bill will send a significant quantity of supplies to Ukraine and Israel, while also supporting Taiwan with submarine technology to aid its defenses against China. It’s also sparked renewed calls for stronger crackdowns on Iranian-produced drones. 

Though much of the money will go toward replenishing fairly standard munitions and supplies, the spending bill provides a window into US strategies around four key defense technologies that continue to reshape how today’s major conflicts are being fought.

For a closer look at the military technology at the center of the aid package, I spoke with Andrew Metrick, a fellow with the defense program at the Center for a New American Security, a think tank.

Ukraine and the role of long-range missiles

Ukraine has long sought the Army Tactical Missile System (ATACMS), a long-range ballistic missile made by Lockheed Martin. First debuted in Operation Desert Storm in Iraq in 1990, it’s 13 feet high, two feet wide, and over 3,600 pounds. It can use GPS to accurately hit targets 190 miles away. 

Last year, President Biden was apprehensive about sending such missiles to Ukraine, as US stockpiles of the weapons were relatively low. In October, the administration changed tack. The US sent shipments of ATACMS, a move celebrated by President Volodymyr Zelensky of Ukraine, but they came with restrictions: the missiles were older models with a shorter range, and Ukraine was instructed not to fire them into Russian territory, only Ukrainian territory. 

This week, just hours before the new aid package was signed, multiple news outlets reported that the US had secretly sent more powerful long-range ATACMS to Ukraine several weeks before. They were used on Tuesday, April 23, to target a Russian airfield in Crimea and Russian troops in Berdiansk, 50 miles southwest of Mariupol.

The long range of the weapons has proved essential for Ukraine, says Metrick. “It allows the Ukrainians to strike Russian targets at ranges for which they have very few other options,” he says. That means being able to hit locations like supply depots, command centers, and airfields behind Russia’s front lines in Ukraine. This capacity has grown more important as Ukraine’s troop numbers have waned, Metrick says.

Replenishing Israel’s Iron Dome

On April 13, Iran launched its first-ever direct attack on Israeli soil. In the attack, which Iran says was retaliation for Israel’s airstrike on its embassy in Syria, hundreds of missiles were lobbed into Israeli airspace. Many of them were neutralized by the web of cutting-edge missile launchers dispersed throughout Israel that can automatically detonate incoming strikes before they hit land. 

One of those systems is Israel’s Iron Dome, in which radar systems detect projectiles and then signal units to launch defensive missiles that detonate the target high in the sky before it strikes populated areas. Israel’s other system, called David’s Sling, works a similar way but can identify rockets coming from a greater distance, upwards of 180 miles. 

Both systems are hugely costly to research and build, and the new US aid package allocates $15 billion to replenish their missile stockpile. The missiles can cost anywhere from $100,000 to $10 million each, and a system like Iron Dome might fire them daily during intense periods of conflict. 

The aid comes as funding for Israel has grown more contentious amid the dire conditions faced by displaced Palestinians in Gaza. While the spending bill worked its way through Congress, increasing numbers of Democrats sought to put conditions on the military aid to Israel, particularly after an Israeli air strike on April 1 killed seven aid workers from World Central Kitchen, an international food charity. The funding package does provide $9 billion in humanitarian assistance for the conflict, but the efforts to impose conditions for Israeli military aid failed. 

Taiwan and underwater defenses against China

A rising concern for the US defense community—and a subject of “wargaming” simulations that Metrick has carried out—is an amphibious invasion of Taiwan from China. The rising risk of that scenario has driven the US to build and deploy larger numbers of advanced submarines, Metrick says. A bigger fleet of these submarines would be more likely to keep attacks from China at bay, thereby protecting Taiwan.

The trouble is that the US shipbuilding effort, experts say, is too slow. It’s been hampered by budget cuts and labor shortages, but the new aid bill aims to jump-start it. It will provide $3.3 billion to do so, specifically for the production of Columbia-class submarines, which carry nuclear weapons, and Virginia-class submarines, which carry conventional weapons. 

Though these funds aim to support Taiwan by building up the US supply of submarines, the package also includes more direct support, like $2 billion to help it purchase weapons and defense equipment from the US. 

The US’s Iranian drone problem 

Shahed drones are used almost daily on the Russia-Ukraine battlefield, and Iran launched more than 100 against Israel earlier this month. Produced by Iran and resembling model planes, the drones are fast, cheap, and lightweight, capable of being launched from the back of a pickup truck. They’re used frequently for potent one-way attacks, where they detonate upon reaching their target. US experts say the technology is tipping the scales toward Russian and Iranian military groups and their allies. 

The trouble of combating them is partly one of cost. Shooting down the drones, which can be bought for as little as $40,000, can cost millions in ammunition.

“Shooting down Shaheds with an expensive missile is not, in the long term, a winning proposition,” Metrick says. “That’s what the Iranians, I think, are banking on. They can wear people down.”

This week’s aid package renewed White House calls for stronger sanctions aimed at curbing production of the drones. The United Nations previously passed rules restricting any drone-related material from entering or leaving Iran, but those expired in October. The US now wants them reinstated. 

Even if that happens, it’s unlikely the rules would do much to contain the Shahed’s dominance. The components of the drones are not all that complex or hard to obtain to begin with, but experts also say that Iran has built a sprawling global supply chain to acquire the materials needed to manufacture them and has worked with Russia to build factories. 

“Sanctions regimes are pretty dang leaky,” Metrick says. “They [Iran] have friends all around the world.”