Google DeepMind’s new Gemini model looks amazing—but could signal peak AI hype

Hype about Gemini, Google DeepMind’s long-rumored response to OpenAI’s GPT-4, has been building for months. Today the company finally revealed what it has been working on in secret all this time. Was the hype justified? Yes—and no. 

Gemini is Google’s biggest AI launch yet—its push to take on competitors OpenAI and Microsoft in the race for AI supremacy. There is no doubt that the model is pitched as best-in-class across a wide range of capabilities—an “everything machine,” as one observer puts it. 

“The model is innately more capable,” Sundar Pichai, the CEO of Google and its parent company Alphabet, told MIT Technology Review. “It’s a platform. AI is a profound platform shift, bigger than web or mobile. And so it represents a big step for us.” 

It’s a big step for Google, but not necessarily a giant leap for the field as a whole. Google DeepMind claims that Gemini outmatches GPT-4 on 30 out of 32 standard measures of performance. And yet the margins between them are thin. What Google DeepMind has done is pull AI’s best current capabilities into one powerful package. To judge from demos, it does many things very well—but few things that we haven’t seen before. For all the buzz about the next big thing, Gemini could be a sign that we’ve reached peak AI hype. At least for now. 

Chirag Shah, a professor at the University of Washington who specializes in online search, compares the launch to Apple’s introduction of a new iPhone every year. “Maybe we just have risen to a different threshold now, where this doesn’t impress us as much because we’ve just seen so much,” he says. 

Like GPT-4, Gemini is multimodal, meaning it is trained to handle multiple kinds of input: text, images, audio. It can combine these different formats to answer questions about everything from household chores to college math to economics. 

In a demo for journalists yesterday, Google showed Gemini’s ability to take an existing screenshot of a chart, analyze hundreds of pages of research with new data, and then update the chart with that new information. In another example, Gemini is shown pictures of an omelet cooking in a pan and asked (using speech, not text) if the omelet is cooked yet. “It’s not ready because the eggs are still runny,” it replies. 

Most people will have to wait for the full experience, however. The version launched today is a back end to Bard, Google’s text-based search chatbot, which the company says will give it more advanced reasoning, planning, and understanding capabilities. Gemini’s full release will be staggered over the coming months. The new Gemini-boosted Bard will initially be available in English in more than 170 countries, not including the EU and the UK. This is to let the company “engage” with local regulators, says Sissie Hsiao, a Google vice president in charge of Bard. 

Gemini also comes in three sizes: Ultra, Pro and Nano. Ultra is the full-powered version; Pro and Nano are tailored to applications that run with more limited computing resources. Nano is designed to run on devices, such as Google’s new Pixel phones. Developers and businesses will be able to access Gemini Pro starting December 13. Gemini Ultra, the most powerful model, will be available “early next year” following “extensive trust and safety checks,” Google executives told reporters on a press call. 

“I think of it as the Gemini era of models,” Pichai told us. “This is how Google DeepMind is going to build and make progress on AI. So it will always represent the frontier of where we are making progress on AI technology.”

Bigger, better, faster, stronger? 

OpenAI’s most powerful model, GPT-4, is seen as the industry’s gold standard. While Google boasted that Gemini outperforms OpenAI’s previous model, GPT 3.5, company executives dodged questions about how far the model exceeds GPT-4. 

But the firm highlights one benchmark in particular, called MMLU (massive multitask language understanding). This is a set of tests designed to measure the performance of models on tasks involving text and images, including reading comprehension, college math, and multiple-choice quizzes in physics, economics, and social sciences. On the text-only questions, Gemini scores 90% and human experts score approximately 89%, says Pichai. GPT-4 scores 86% on these types of questions. On the multimodal questions, Gemini scores 59%, while GPT-4 scores 57%. “It’s the first model to cross that threshold,” Pichai says. 

Gemini’s performance against benchmark data sets is very impressive, says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico.

“It’s clear that Gemini is a very sophisticated AI system,” says Mitchell. But “it’s not obvious to me that Gemini is actually substantially more capable than GPT-4,” she adds.

While the model has good benchmark scores, it is hard to know how to interpret these numbers given that we don’t know what’s in the training data, says Percy Liang, director of Stanford’s Center for Research on Foundation Models. 

Mitchell also notes that Gemini performs much better on language and code benchmarks than on images and video. “Multimodal foundation models still have a ways to go to be generally and robustly useful for many tasks,” she says. 

Using feedback from human testers, Google DeepMind has trained Gemini to be more factually accurate, to give attribution when asked to, and to hedge rather than spit out nonsense when faced with a question it cannot answer. The company claims that this mitigates the problem of hallucinations. But without a radical overhaul of the base technology, large language models will continue to make things up. 

Experts say it’s unclear whether the benchmarks Google is using to measure Gemini’s performance offer that much insight, and without transparency, it’s hard to check Google’s claims. 

“Google is advertising Gemini as an everything machine—a general-purpose model that can be used in many different ways,” says Emily Bender, a professor of computational linguistics at the University of Washington. But the company is using narrow benchmarks to evaluate models that it expects to be used for these diverse purposes. “This means it effectively can’t be thoroughly evaluated,” she says.  

Ultimately, for the average user, the incremental improvement over competing models might not make much difference, says Shah. “It’s more about convenience, brand recognition, existing integration, than people really thinking ‘Oh, this is better,’” he says. 

A long, slow buildup 

Gemini has been a long time coming. In April 2023, Google announced it was merging its AI research unit Google Brain with DeepMind, Alphabet’s London-based AI research lab. So Google has had all year to develop its answer to OpenAI’s most advanced large language model, GPT-4, which debuted in March and is the backbone of the paid version of ChatGPT.

Google has been under intense pressure to show investors it can match and overtake competitors in AI. Although the company has been developing and using powerful AI models for years, it has been hesitant to launch tools that the public can play with for fears of reputational damage and safety concerns. 

“Google has been very cautious about releasing this stuff to the public,” Geoffrey Hinton told MIT Technology Review in April when he left the company. “There are too many bad things that could happen, and Google didn’t want to ruin its reputation.” Faced with tech that seemed untrustworthy or unmarketable, Google played it safe—until the greater risk became missing out.

Google has learned the hard way how launching flawed products can backfire. When it unveiled its ChatGPT competitor Bard in February, scientists soon noticed a factual error in the company’s own advertisement for the chatbot, an incident that subsequently wiped $100 billion off its share price. 

In May, Google announced it was rolling out generative AI into most of its products, from email to productivity software. But the results failed to impress critics: the chatbot made references to emails that didn’t exist, for example.  

This is a consistent problem with large language models. Although excellent at generating text that sounds like something a human could have written, generative AI systems regularly make things up. And that’s not the only problem with them. They are also easy to hack, and riddled with biases. Using them is also highly polluting

Google has solved neither these problems nor the hallucination issue. Its solution to the latter problem is a tool that lets people use Google search to double-check the chatbot’s answers, but that relies on the accuracy of the online search results themselves. 

Gemini may be the pinnacle of this wave of generative AI. But it’s not clear where AI built on large language models goes next. Some researchers believe this could be a plateau rather than the foot of the next peak. 

Pichai is undeterred. “Looking ahead, we do see a lot of headroom,” he says. “I think multimodality will be big. As we teach these models to reason more, there will be bigger and bigger breakthroughs. Deeper breakthroughs are to come yet. 

“When I take in the totality of it, I genuinely feel like we are at the very beginning.”

Mat Honan contributed reporting.

Google CEO Sundar Pichai on Gemini and the coming age of AI

Google released the first phase of its next-generation AI model, Gemini, today. Gemini reflects years of efforts from inside Google, overseen and driven by its CEO, Sundar Pichai.

(You can read all about Gemini in our report from Melissa Heikkilä and Will Douglas Heaven here.) 

Pichai, who previously oversaw Chrome and Android, is famously product obsessed. In his first founder’s letter as CEO in 2016, he predicted that “[w]e will move from mobile first to an AI first world.” In the years since, Pichai has infused AI deeply into all of Google’s products, from Android devices all the way up to the cloud. 

Despite that, the last year has largely been defined by the AI releases from another company, OpenAI. The rollout of DALL-E and GPT-3.5 last year, followed by GPT-4 this year, dominated the sector and kicked off an arms race between startups and tech giants alike. 

Gemini is now the latest effort in that race. This state-of-the-art system was led by Google DeepMind, the newly integrated organization led by Demis Hassabis that brings together the company’s AI teams under one umbrella. You can experience Gemini in Bard today, and it will become integrated across the company’s line of products throughout 2024. 

We sat down with Sundar Pichai at Google’s offices in Mountain View, California, on the eve of Gemini’s launch to discuss what it will mean for Google, its products, AI, and society writ large. 

The following transcript represents Pichai in his own words. The conversation has been edited for clarity and readability. 

MIT Technology Review: Why is Gemini exciting? Can you tell me what’s the big picture that you see as it relates to AI, its power, its usefulness, the direction as it goes into all of your products? 

Sundar Pichai: A specific part of what makes it exciting is it’s a natively multimodal model from the ground up. Just like humans, it’s not just learning on text alone. It’s text, audio, code. So the model is innately more capable because of that, and I think will help us tease out newer capabilities and contribute to the progress of the field. That’s exciting. 

It’s also exciting because Gemini Ultra is state of the art in 30 of the 32 leading benchmarks, and particularly in the multimodal benchmarks. That MMMU benchmark—it shows the progress there. I personally find it exciting that in MMLU [massive multi-task language understanding], which has been one of the leading benchmarks, it crossed the 90% threshold, which is a big milestone. The state of the art two years ago was  30, or 40%. So just think about how much the field is progressing. Approximately 89% is a human expert across these 57 subjects. It’s the first model to cross that threshold. 

I’m excited, also, because it’s finally coming in our products. It’s going to be available to developers. It’s a platform. AI is a profound platform shift, bigger than web or mobile. And so it represents a big step for us from that moment as well.

Let’s start with those benchmarks. It seemed to be ahead of GPT-4 in almost all of them, or most all of them, but not by a lot. Whereas GPT-4 seemed like a very large leap forward. Are we starting to plateau with what we’re going to see some of these large-language-model technologies be able to do, or do you think we will continue to have these big growth curves?

First of all, looking ahead, we do see a lot of headroom. Some of the benchmarks are already high. You have to realize, when you’re trying to go to something from 85%, you’re now at that edge of the curve. So it may not seem like much, but it’s making progress. We are going to need newer benchmarks, too. It’s part of the reason we also looked at the MMLU multimodal benchmark. [For] some of these new benchmarks, the state of the art is still much lower. There’s a lot of progress ahead. The scaling laws are still going to work. As we make the models bigger, there’s going to be more progress. When I take it in the totality of it, I genuinely feel like we are at the very beginning. 

I’m interested in what you see as the key breakthroughs of Gemini, and how they will be applied. 

It’s so difficult for people to imagine the leaps that will happen. We are providing APIs, and people will imagine it in pretty deep ways.

I think multimodality will be big. As we teach these models to reason more, there will be bigger and bigger breakthroughs. Deeper breakthroughs are to come yet. 

One way to think about this question is Gemini Pro. It does very well on benchmarks. But when we put it in Bard, I could feel it as a user. We’ve been testing it, and the favorability ratings go up across all categories pretty significantly. It’s why we’re calling it one of our biggest upgrades yet. And when we do side-by-side blind evaluations, it really shows the outperformance. So you make these better models improve on benchmarks. It makes progress. And we’ll continue training and pick it up from there. 

But I can’t wait to put it in our products. These models are so capable. Actually designing the product experiences to take advantage of all what the models have—stuff will be exciting for the next few months.

I imagine there was an enormous amount of pressure to get Gemini out the door. I’m curious what you learned by seeing what had happened with GPT-4’s release. What did you learn? What approaches changed in that time frame?

One thing, at least to me: it feels very far from a zero-sum game, right? Think about how profound the shift to AI is, and how early we are. There’s a world of opportunity ahead. 

But to your specific question, it’s a rich field in which we are all progressing. There is a scientific component to it, there’s an academic component to it; being published a lot, seeing how models like GPT-4 work in the real world. We have learned from that. Safety is an important area. So in part with Gemini, there are safety techniques we have learned and improved on based on how models are working out in the real world. It shows the importance of various things like fine-tuning. One of the things we showed with Med-PaLM 2 was to take a model like PaLM, to really fine-tune it to a specific domain, show it could outperform state-of-the-art models. And so that was a way by which we learned the power of fine-tuning. 

A lot of that is applied as we are working our way through Gemini. Part of the reason we are taking some more time with Ultra [the more advanced version of Gemini that will be available next year] is to make sure we are testing it rigorously for safety. But we’re also fine-tuning it to really tease out the capabilities. 

When you see some of these releases come out and people begin tinkering with them in the real world, they’ll have hallucinations, or they can reveal some of the private data that their models are trained on. And I wonder how much of that is inherent in the technology, given the data that it’s trained on, if that’s inevitable. If it is inevitable, what types of things do you try and do to limit that?

You’re right. These are all active fields of research. In fact, we just published a paper which shows how these models can reveal training data by a series of prompts. Hallucination is not a solved problem. I think we are all making progress on it, and there’s more work to be done. There are some fundamental limitations we need to work through. One example is if you take Gemini Ultra, we are actively red-teaming these models with external third parties using it who are specialists in these things. 

In areas like multimodality, we want to be bold and we want to be responsible. We will be more careful with multimodal rollouts, because the chances of wrong use cases are higher. 

But you are right in the sense that it is still a technology which is work in progress, which is why they won’t make sense for everything. Which is why in search, we are being more careful about how we use it, and when and what, where we use it, and then when we trigger it. They have these amazing capabilities, and they have clear shortcomings. This is the hard work ahead for all of us.

Do you think ultimately this is going to be a solved problem—hallucinations, or with revealing  other training data? 

With the current technology of auto-regressive LLMs, hallucinations are not a solved problem. But future AI systems may not look like what we have today. This is one version of technology. It’s like when people thought there is no way you can fit a computer in your pocket. There were people who were really opinionated, 20 years ago. Similarly, looking at these systems and saying you can’t design better systems. I don’t subscribe to that view. There are already many research explorations underway to think about how else to come upon these problems.

You’ve talked about how profound a shift this is. In some of these last shifts, like the shift to mobile, it didn’t necessarily increase productivity, which has been flat for a long time. I think there’s an argument that it may have even worsened income inequality. What type of work is Google doing to try to make sure that this shift is more widely beneficial to society?

It’s a very important question. I think about it on a few levels. One thing at Google we’ve always been focused on is: How do we get technology access as broadly available as possible? So I would argue even in the case of mobile, the work we do with Android—hundreds of millions of people wouldn’t have otherwise had computing access. We work hard to push toward an affordable smartphone, to maybe sub-$50. 

So making AI helpful for everyone is the framework I think about. You try to promote access to as many people as possible. I think that’s one part of it.

We are thinking deeply about applying it to use cases which can benefit people. For example, the reason we did flood forecasting early on is because we realized, AI can detect patterns and do it well. We’re using it to translate 1,000 languages. We’re literally trying to bring content now in languages where otherwise you wouldn’t have had access. 

This doesn’t solve all the problems you’re talking about. But being deliberate about when and where, what kind of problems you’re going to focus on—we’ve always been focused on that. Take areas like AlphaFold. We have provided an open database for viruses everywhere in the world. But … who uses it first? Where does it get sold? AI is not going to magically make things better on some of the more difficult issues like inequality; it could exacerbate it. 

But what is important is you make sure that technology is available for everyone. You’re developing it early and giving people access and engaging in conversation so that society can think about it and adapt to it. 

We’ve definitely, in this technology, participated earlier on than other technologies. You know, the recent UK AI Safety Forum or work in the US with Congress and the administration. We are trying to do more public-private partnerships, pulling in nonprofit and academic institutions earlier. 

Impacts on areas like jobs need to be studied deeply, but I do think there are surprises. There’ll be surprising positive externalities, there’ll be negative externalities too. Solving the negative externalities is larger than any one company. It’s the role of all the stakeholders in society. So I don’t have easy answers there. 

I can give you plenty of examples of the benefits mobile brings. I think that will be true of this too. We already showed it with areas like diabetic retinopathy. There are just not enough doctors in many parts of the world to detect it. 

Just like I felt giving people access to Google Search everywhere in the world made a positive difference, I think that’s the way to think about expanding access to AI.

There are things that are clearly going to make people more productive. Programming is a great example of this. And yet, that democratization of this technology is the very thing that is threatening jobs.  And even if you don’t have all the answers for society—and it’s not incumbent on one company to solve society’s problems—one company can put out a product that can dramatically change the world and have this profound impact. 

We never offered facial-recognition APIs. But people built APIs and the technology moves forward. So it is also not in any one company’s hands. Technology will move forward. 

I think the answer is more complex than that. Societies can also get left behind. If you don’t adopt these technologies, it could impact your economic competitiveness. You could lose more jobs. 

I think the right answer is to responsibly deploy technology and make progress and think about areas where it can cause disproportionate harm and do work to mitigate it. There will be newer types of jobs. If you look at the last 50, 60 years, there are studies from economists from MIT which show most of the new jobs that have been created are in new areas which have come since then. 

There will be newer jobs that are created. There will be jobs which are made better, where some of the repetitive work is freed up in a way that you can express yourself more creatively. You could be a doctor, you could be a radiologist, you could be a programmer. The amount of time you’re spending on routine tasks versus higher-order thinking—all that could change, making the job more meaningful. Then there are jobs which could be displaced. So, as a society, how do you retrain, reskill people, and create opportunities? 

The last year has really brought out this philosophical split in the way people think we should approach AI. You could talk about it as being safety first or business use cases first, or accelerationists versus doomers. You’re in a position where you have to bridge all of that philosophy and bring it together. I wonder what you personally think about trying to bridge those interests at Google, which is going to be a leader in this field, into this new world.

I’m a technology optimist. I have always felt, based on my personal life, a belief in people and humanity. And so overall, I think humanity will harness technology to its benefit. So I’ve always been an optimist. You’re right: a powerful technology like AI—there is a duality to it. 

Which means there will be times we will boldly move forward because I think we can push the state of the art. For example, if AI can help us solve problems like cancer or climate change, you want to do everything in your power to move forward fast. But you definitely need society to develop frameworks to adapt, be it to deepfakes or to job displacement, etc. This is going to be a frontier—no different from climate change. This will be one of the biggest things we all grapple with for the next decade ahead.

Another big, unsettled thing is the legal landscape around AI. There are questions about fair use, questions about being able to protect the outputs. And it seems like it’s going to be a really big deal for intellectual property. What do you tell people who are using your products, to give them a sense of security, that what they’re doing isn’t going to get them sued?

These are not all topics that will have easy answers. When we build products, like Search and YouTube and stuff in the pre-AI world, we’ve always been trying to get the value exchange right. It’s no different for AI. We are definitely focused on making sure we can train on data that is allowed to be trained on, consistent with the law, giving people a chance to opt out of the training. And then there’s a layer about that—about what is fair use. It’s important to create value for the creators of the original content. These are important areas. The internet was an example of it. Or when e-commerce started: How do you draw the line between e-commerce and regular commerce? 

There’ll be new legal frameworks developed over time, I think is how I would think about it as this area evolves. But meanwhile, we will work hard to be on the right side of the law and make sure we also have deep relationships with many providers of content today. There are some areas where it’s contentious, but we are working our way through those things, and I am committed to working to figure it out. We have to create that win-win ecosystem for all of this to work over time. 

Something that people are very worried about with the web now is the future of search. When you have a type of technology that just answers questions for you, based on information from around the web, there’s a fear people may no longer need to visit those sites. This also seems like it could have implications for Google. I also wonder if you’re thinking about it in terms of your own business. 

One of the unique value propositions we’ve had in Search is we are helping users find and learn new things, find answers, but always with a view of sharing with them the richness and the diversity that exists on the web. That will be true, even as we go through our journey with Search Generative Experience. It’s an important principle by which we are developing our product. I don’t think people always come to Search saying, “Just answer it for me.” There may be a question or two for which you may want that, but even then you come back, you learn more, or even in that journey, go deeper. We constantly want to make sure we are getting it right. And I don’t think that’s going to change. It’s important that we get the balance right there. 

Similarly, if you deliver value deeply, there is commercial value in what you’re delivering. We had questions like this from desktop to mobile. It’s not new to us. I feel comfortable based on everything we are seeing and how users respond to high-quality ads. YouTube is a good example where we have developed subscription models. That’s also worked well. 

How do you think people’s experience is going to change next year, as these products begin to really hit the marketplace and they begin to interact? How is their experience gonna change?

I think a year out from now, anybody starting on something in Google Docs will expect something different. And if you give it to them, and later put them back in the version of Google Docs we had, let’s say, in 2022, they will find it so out of date. It’s like, for my kids, if they don’t have spell-check, they fundamentally will think it’s broken. And you and I may remember what it was to use these products before spell-check. But more than any other company, we’ve incorporated so much AI in Search, people take it for granted. That’s one thing I’ve learned over time. They take it for granted. 

In terms of what new stuff people can do, as we develop the multimodal capabilities, people will be able to do more complex tasks in a way that they weren’t able to do before. And there’ll be real use cases which are way more powerful.

Correction: This story was updated to fix transcription errors. Notably, MMMU was incorrectly transcribed as MMLU, and search generative experience originally appeared as search related experience.

Making an image with generative AI uses as much energy as charging your phone

Each time you use AI to generate an image, write an email, or ask a chatbot a question, it comes at a cost to the planet.

In fact, generating an image using a powerful AI model takes as much energy as fully charging your smartphone, according to a new study by researchers at the AI startup Hugging Face and Carnegie Mellon University. However, they found that using an AI model to generate text is significantly less energy-intensive. Creating text 1,000 times only uses as much energy as 16% of a full smartphone charge. 

Their work, which is yet to be peer reviewed, shows that while training massive AI models is incredibly energy intensive, it’s only one part of the puzzle. Most of their carbon footprint comes from their actual use.  

The study is the first time researchers have calculated the carbon emissions caused by using an AI model for different tasks, says Sasha Luccioni, an AI researcher at Hugging Face who led the work. She hopes understanding these emissions could help us make informed decisions about how to use AI in a more planet-friendly way. 

Luccioni and her team looked at the emissions associated with 10 popular AI tasks on the Hugging Face platform, such as question answering, text generation, image classification, captioning, and image generation. They ran the experiments on 88 different models. For each of the tasks, such as text generation, Luccioni ran 1,000 prompts, and measured the energy used with a tool she developed called Code Carbon. Code Carbon makes these calculations by looking at the energy the computer consumes while running the model. The team also calculated the emissions generated by doing these tasks using eight generative models, which were trained to do different tasks. 

Generating images was by far the most energy- and carbon-intensive AI-based task. Generating 1,000 images with a powerful AI model, such as Stable Diffusion XL, is responsible for roughly as much carbon dioxide as driving the equivalent of 4.1 miles in an average gasoline-powered car. In contrast, the least carbon-intensive text generation model they examined was responsible for as much CO2 as driving 0.0006 miles in a similar vehicle. Stability AI, the company behind Stable Diffusion XL, did not respond to a request for comment.

The study provides useful insights into AI’s carbon footprint by offering concrete numbers and reveals some worrying upward trends, says Lynn Kaack, an assistant professor of computer science and public policy at the Hertie School in Germany, where she leads work on AI and climate change. She was not involved in the research.

These emissions add up quickly. The generative-AI boom has led big tech companies to  integrate powerful AI models into many different products, from email to word processing. These generative AI models are now used millions if not billions of times every single day. 

The team found that using large generative models to create outputs was far more energy intensive than using smaller AI models tailored for specific tasks. For example, using a generative model to classify movie reviews according to whether they are positive or negative consumes around 30 times more energy than using a fine-tuned model created specifically for that task, Luccioni says. The reason generative AI models use much more energy is that they are trying to do many things at once, such as generate, classify, and summarize text, instead of just one task, such as classification. 

Luccioni says she hopes the research will encourage people to be choosier about when they use generative AI and opt for more specialized, less carbon-intensive models where possible. 

“If you’re doing a specific application, like searching through email … do you really need these big models that are capable of anything? I would say no,” Luccioni says. 

The energy consumption associated with using AI tools has been a missing piece in understanding their true carbon footprint, says Jesse Dodge, a research scientist at the Allen Institute for AI, who was not part of the study. 

Comparing the carbon emissions from newer, larger generative models and older AI models  is also important, Dodge adds. “It highlights this idea that the new wave of AI systems are much more carbon intensive than what we had even two or five years ago,” he says. 

Google once estimated that an average online search used 0.3 watt-hours of electricity, equivalent to driving 0.0003 miles in a car. Today, that number is likely much higher, because Google has integrated generative AI models into its search, says Vijay Gadepally, a research scientist at the MIT Lincoln lab, who did not participate in the research. 

Not only did the researchers find emissions for each task to be much higher than they expected, but they discovered that the day-to-day emissions associated with using AI far exceeded the emissions from training large models. Luccioni tested different versions of Hugging Face’s multilingual AI model BLOOM to see how many uses would be needed to overtake training costs. It took over 590 million uses to reach the carbon cost of training its biggest model. For very popular models, such as ChatGPT, it could take just a couple of weeks for such a model’s usage emissions to exceed its training emissions, Luccioni says. 

This is because large AI models get trained just once, but then they can be used billions of times. According to some estimates, popular models such as ChatGPT have up to 10 million users a day, many of whom prompt the model more than once. 

Studies like these make the energy consumption and emissions related to AI more tangible and help raise awareness that there is a carbon footprint associated with using AI, says Gadepally, adding, “I would love it if this became something that consumers started to ask about.”

Dodge says he hopes studies like this will help us to hold companies more accountable about their energy usage and emissions. 

“The responsibility here lies with a company that is creating the models and is earning a profit off of them,” he says. 

Google DeepMind’s new AI tool helped create more than 700 new materials

From EV batteries to solar cells to microchips, new materials can supercharge technological breakthroughs. But discovering them usually takes months or even years of trial-and-error research. 

Google DeepMind hopes to change that with a new tool that uses deep learning to dramatically speed up the process of discovering new materials. Called graphical networks for material exploration (GNoME), the technology has already been used to predict structures for 2.2 million new materials, of which more than 700 have gone on to be created in the lab and are now being tested. It is described in a paper published in Nature today. 

Alongside GNoME, Lawrence Berkeley National Laboratory also announced a new autonomous lab. The lab takes data from the materials database that includes some of GNoME’s discoveries and uses machine learning and robotic arms to engineer new materials without the help of humans. Google DeepMind says that together, these advancements show the potential of using AI to scale up the discovery and development of new materials.

GNoME can be described as AlphaFold for materials discovery, according to Ju Li, a materials science and engineering professor at the Massachusetts Institute of Technology. AlphaFold, a DeepMind AI system announced in 2020, predicts the structures of proteins with high accuracy and has since advanced biological research and drug discovery. Thanks to GNoME, the number of known stable materials has grown almost tenfold, to 421,000.

“While materials play a very critical role in almost any technology, we as humanity know only a few tens of thousands of stable materials,” said Dogus Cubuk, materials discovery lead at Google DeepMind, at a press briefing. 

To discover new materials, scientists combine elements across the periodic table. But because there are so many combinations, it’s inefficient to do this process blindly. Instead, researchers build upon existing structures, making small tweaks in the hope of discovering new combinations that hold potential. However, this painstaking process is still very time consuming. Also, because it builds on existing structures, it limits the potential for unexpected discoveries. 

To overcome these limitations, DeepMind combines two different deep-learning models. The first generates more than a billion structures by making modifications to elements in existing materials. The second, however, ignores existing structures and predicts the stability of new materials purely on the basis of chemical formulas. The combination of these two models allows for a much broader range of possibilities. 

Once the candidate structures are generated, they are filtered through DeepMind’s GNoME models. The models predict the decomposition energy of a given structure, which is an important indicator of how stable the material can be. “Stable” materials do not easily decompose, which is important for engineering purposes. GNoME selects the most promising candidates, which go through further evaluation based on known theoretical frameworks.

This process is then repeated multiple times, with each discovery incorporated into the next round of training.

In its first round, GNoME predicted different materials’ stability with a precision of around 5%, but it increased quickly throughout the iterative learning process. The final results showed GNoME managed to predict the stability of structures over 80% of the time for the first model and 33% for the second. 

Using AI models to come up with new materials is not a novel idea. The Materials Project, a program led by Kristin Persson at Berkeley Lab, has used similar techniques to discover and improve the stability of 48,000 materials. 

However, GNoME’s size and precision set it apart from previous efforts. It was trained on at least an order of magnitude more data than any previous model, says Chris Bartel, an assistant professor of chemical engineering and materials science at the University of Minnesota. 

Doing similar calculations has previously been expensive and limited in scale, says Yifei Mo, an associate professor of materials science and engineering at the University of Maryland. GNoME allows these computations to scale up with higher accuracy and at much less computational cost, Mo says: “The impact can be huge.”

Once new materials have been identified, it is equally important to synthesize them and prove their usefulness. Berkeley Lab’s new autonomous laboratory, named the A-Lab, has been using some of GNoME’s discoveries with the Materials Project information, integrating robotics with machine learning to optimize the development of such materials.

The lab is capable of making its own decisions about how to make a proposed material and creates up to five initial formulations. These formulations are generated by a machine-learning model trained on existing scientific literature. After each experiment, the lab uses the results to adjust the recipes.

Researchers at Berkeley Lab say that A-Lab was able to perform 355 experiments over 17 days and successfully synthesized 41 out of 58 proposed compounds. This works out to two successful syntheses a day.

In a typical, human-led lab, it takes much longer to make materials. “If you’re unlucky, it can take months or even years,” said Persson at a press briefing. Most students give up after a few weeks, she said. “But the A-Lab doesn’t mind failing. It keeps trying and trying.”

Researchers at DeepMind and Berkeley Lab say these new AI tools can help accelerate hardware innovation in energy, computing, and many other sectors.

“Hardware, especially when it comes to clean energy, needs innovation if we are going to solve the climate crisis,” says Persson. “This is one aspect of accelerating that innovation.”

Bartel, who was not involved in the research, says that these materials will be promising candidates for technologies spanning batteries, computer chips, ceramics, and electronics. 

Lithium-ion battery conductors are one of the most promising use cases. Conductors play an important role in batteries by facilitating the flow of electric current between various components. DeepMind says GNoME identified 528 promising lithium-ion conductors among other discoveries, some of which may help make batteries more efficient. 

However, even after new materials are discovered, it usually takes decades for industries to take them to the commercial stage. “If we can reduce this to five years, that will be a big improvement,” says Cubuk.

Correction: This story has been updated to make clear where the lab’s data comes from.

Unpacking the hype around OpenAI’s rumored new Q* model

This story is from The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Ever since last week’s dramatic events at OpenAI, the rumor mill has been in overdrive about why the company’s chief scientific officer, Ilya Sutskever, and its board decided to oust CEO Sam Altman.

While we still don’t know all the details, there have been reports that researchers at OpenAI had made a “breakthrough” in AI that had alarmed staff members. Reuters and The Information both report that researchers had come up with a new way to make powerful AI systems and had created a new model, called Q* (pronounced Q star), that was able to perform grade-school-level math. According to the people who spoke to Reuters, some at OpenAI believe this could be a milestone in the company’s quest to build artificial general intelligence, a much-hyped concept referring to an AI system that is smarter than humans. The company declined to comment on Q*. 

Social media is full of speculation and excessive hype, so I called some experts to find out how big a deal any breakthrough in math and AI would really be.

Researchers have for years tried to get AI models to solve math problems. Language models like ChatGPT and GPT-4 can do some math, but not very well or reliably. We currently don’t have the algorithms or even the right architectures to be able to solve math problems reliably using AI, says Wenda Li, an AI lecturer at the University of Edinburgh. Deep learning and transformers (a kind of neural network), which is what language models use, are excellent at recognizing patterns, but that alone is likely not enough, Li adds. 

Math is a benchmark for reasoning, Li says. A machine that is able to reason about mathematics, could, in theory, be able to learn to do other tasks that build on existing information, such as writing computer code or drawing conclusions from a news article. Math is a particularly hard challenge because it requires AI models to have the capacity to reason and to really understand what they are dealing with. 

A generative AI system that could reliably do math would need to have a really firm grasp on concrete definitions of particular concepts that can get very abstract. A lot of math problems also require some level of planning over multiple steps, says Katie Collins, a PhD researcher at the University of Cambridge, who specializes in math and AI. Indeed, Yann LeCun, chief AI scientist at Meta, posted on X and LinkedIn over the weekend that he thinks Q* is likely to be “OpenAI attempts at planning.”

People who worry about whether AI poses an existential risk to humans, one of OpenAI’s founding concerns, fear that such capabilities might lead to rogue AI. Safety concerns might arise if such AI systems are allowed to set their own goals and start to interface with a real physical or digital world in some ways, says Collins. 

But while math capability might take us a step closer to more powerful AI systems, solving these sorts of math problems doesn’t signal the birth of a superintelligence. 

“I don’t think it immediately gets us to AGI or scary situations,” says Collins.  It’s also very important to underline what kind of math problems AI is solving, she adds.

“Solving elementary-school math problems is very, very different from pushing the boundaries of mathematics at the level of something a Fields medalist can do,” says Collins, referring to a top prize in mathematics.  

Machine-learning research has focused on solving elementary-school problems, but state-of-the-art AI systems haven’t fully cracked this challenge yet. Some AI models fail on really simple math problems, but then they can excel at really hard problems, Collins says. OpenAI has, for example, developed dedicated tools that can solve challenging problems posed in competitions for top math students in high school, but these systems outperform humans only occasionally.  

Nevertheless, building an AI system that can solve math equations is a cool development, if that is indeed what Q* can do. A deeper understanding of mathematics could open up applications to help scientific research and engineering, for example. The ability to generate mathematical responses could help us develop better personalized tutoring, or help mathematicians do algebra faster or solve more complicated problems. 

This is also not the first time a new model has sparked AGI hype. Just last year, tech folks were saying the same things about Google DeepMind’s Gato, a “generalist” AI model that can play Atari video games, caption images, chat, and stack blocks with a real robot arm. Back then, some AI researchers claimed that DeepMind was “on the verge” of AGI because of Gato’s ability to do so many different things pretty well. Same hype machine, different AI lab. 

And while it might be great PR, these hype cycles do more harm than good for the entire field by distracting people from the real, tangible problems around AI. Rumors about a powerful new AI model might also be a massive own goal for the regulation-averse tech sector. The EU, for example, is very close to finalizing its sweeping AI Act. One of the biggest fights right now among lawmakers is whether to give tech companies more power to regulate cutting-edge AI models on their own. 

OpenAI’s board was designed as the company’s internal kill switch and governance mechanism to prevent the launch of harmful technologies. The past week’s boardroom drama has shown that the bottom line will always prevail at these companies. It will also make it harder to make a case for why they should be trusted with self-regulation. Lawmakers, take note.

Finding value in generative AI for financial services

With tools such as ChatGPT, DALLE-2, and CodeStarter, generative AI has captured the public imagination in 2023. Unlike past technologies that have come and gone—think metaverse—this latest one looks set to stay. OpenAI’s chatbot, ChatGPT, is perhaps the best-known generative AI tool. It reached 100 million monthly active users in just two months after launch, surpassing even TikTok and Instagram in adoption speed, becoming the fastest-growing consumer application in history.

According to a McKinsey report, generative AI could add $2.6 trillion to $4.4 trillion annually in value to the global economy. The banking industry was highlighted as among sectors that could see the biggest impact (as a percentage of their revenues) from generative AI. The technology “could deliver value equal to an additional $200 billion to $340 billion annually if the use cases were fully implemented,” says the report. 

For businesses from every sector, the current challenge is to separate the hype that accompanies any new technology from the real and lasting value it may bring. This is a pressing issue for firms in financial services. The industry’s already extensive—and growing—use of digital tools makes it particularly likely to be affected by technology advances. This MIT Technology Review Insights report examines the early impact of generative AI within the financial sector, where it is starting to be applied, and the barriers that need to be overcome in the long run for its successful deployment. 

The main findings of this report are as follows:

  • Corporate deployment of generative AI in financial services is still largely nascent. The most active use cases revolve around cutting costs by freeing employees from low-value, repetitive work. Companies have begun deploying generative AI tools to automate time-consuming, tedious jobs, which previously required humans to assess unstructured information.
  • There is extensive experimentation on potentially more disruptive tools, but signs of commercial deployment remain rare. Academics and banks are examining how generative AI could help in impactful areas including asset selection, improved simulations, and better understanding of asset correlation and tail risk—the probability that the asset performs far below or far above its average past performance. So far, however, a range of practical and regulatory challenges are impeding their commercial use.
  • Legacy technology and talent shortages may slow adoption of generative AI tools, but only temporarily. Many financial services companies, especially large banks and insurers, still have substantial, aging information technology and data structures, potentially unfit for the use of modern applications. In recent years, however, the problem has eased with widespread digitalization and may continue to do so. As is the case with any new technology, talent with expertise specifically in generative AI is in short supply across the economy. For now, financial services companies appear to be training staff rather than bidding to recruit from a sparse specialist pool. That said, the difficulty in finding AI talent is already starting to ebb, a process that would mirror those seen with the rise of cloud and other new technologies.
  • More difficult to overcome may be weaknesses in the technology itself and regulatory hurdles to its rollout for certain tasks. General, off-the-shelf tools are unlikely to adequately perform complex, specific tasks, such as portfolio analysis and selection. Companies will need to train their own models, a process that will require substantial time and investment. Once such software is complete, its output may be problematic. The risks of bias and lack of accountability in AI are well known. Finding ways to validate complex output from generative AI has yet to see success. Authorities acknowledge that they need to study the implications of generative AI more, and historically they have rarely approved tools before rollout.

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

What’s next for OpenAI

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

OpenAI, are you okay, babe? This past weekend has been a fever dream in the AI world. The board of OpenAI, the world’s hottest AI company, shocked everyone by firing CEO Sam Altman. Cue an AI-safety coup, chaos, and a new job at Microsoft for Altman.

If you were offline this weekend, my colleague Will Douglas Heaven and I break down what you missed and what’s next for the AI industry. 

What happened

Friday afternoon
Sam Altman was summoned to a Google Meet meeting, where chief scientific officer Ilya Sutskever announced that OpenAI’s board had decided Altman had been “not consistently candid in his communications” with them, and he was fired. OpenAI president and cofounder Greg Brockman and a string of senior researchers quit soon after, and CTO Mira Murati became the interim CEO. 

Saturday 
Murati made attempts to hire Altman and Brockman back, while the board was simultaneously looking for its own successor to Altman. Altman and OpenAI staffers pressured the board to quit and demanded that Altman be reinstated, giving the board a deadline, which was not met. 

Sunday night
Microsoft announced it had hired Altman and Brockman to lead its new AI research team. Soon after that, OpenAI announced it had hired Emmett Shear, the former CEO of the streaming company Twitch, as its CEO. 

Monday morning
Over 500 OpenAI employees have signed a letter threatening to quit and join Altman at Microsoft unless OpenAI’s board steps down. Bizarrely, Sutskever also signed the letter, and posted on X that he “deeply regrets” participating in the board’s actions. 

What’s next for OpenAI

Two weeks ago, at OpenAI’s first DevDay, Altman interrupted his presentation of an AI cornucopia to ask the whooping audience to calm down. “There’s a lot—you don’t have to clap each time,” he said, grinning wide. 

OpenAI is now a very different company from the one we saw at DevDay. With Altman and Brockman gone, a number of senior OpenAI employees chose to resign in support. Many others, including Murati, soon took to social media to post “OpenAI is nothing without its people.” Especially given the threat of a mass exodus to Microsoft, expect more upheaval before things settle. 

Tension between Sutskever and Altman may have been brewing for some time. “When you have an organization like OpenAI that’s moving at a fast pace and pursuing ambitious goals, tension is inevitable,” Sutskever told MIT Technology Review in September (comments that have not previously been published). “I view any tension between product and research as a catalyst for advancing us, because I believe that product wins are intertwined with research success.” Yet it is now clear that Sutskever disagreed with OpenAI leadership about how product wins and research success should be balanced.  

New interim CEO Shear, who cofounded Twitch, appears to be a world away from Altman when it comes to the pace of AI development. “I specifically say I’m in favor of slowing down, which is sort of like pausing except it’s slowing down,” he posted on X in September. “If we’re at a speed of 10 right now, a pause is reducing to 0. I think we should aim for a 1-2 instead.”

It’s possible that an OpenAI led by Shear will double down on its original lofty mission to build (in Sutskever’s words) “AGI that benefits humanity,” whatever that means in practice. In the short term, OpenAI may slow down or even switch off its product pipeline. 

This tension between trying to launch products quickly and slowing down development to ensure they are safe has vexed OpenAI from the very beginning. It was the reason key players in the company decided to leave OpenAI and start the competing AI safety startup Anthropic. 

With Altman and his camp gone, the firm could pivot more toward Sutskever’s work on what he calls superalignment, a research project that aims to come up with <a rel="noreferrer noopener" href="https://www.technologyreview.com/2023/10/26/1082398/exclusive-ilya-sutskever-openais-chief-scientist-on-his-hopes-and-fears-for-the-future-of-ai/?truid=<ways to control a hypothetical superintelligence (future technology that Sutskever speculates will outmatch humans in almost every way). “I’m doing it for my own self-interest,” Sutskever told us. “It’s obviously important that any superintelligence anyone builds does not go rogue. Obviously.”  

Shear’s public comments make him exactly the kind of cautious leader who would heed Sutskever’s concerns. As Shear also posted on X: “The way you make it safely through a dangerous jungle at night is not to sprint forward at full speed, nor to refuse to proceed forward. You poke your way forward, carefully.”

With the company orienting itself even more toward tech that does not yet—and may never—exist, will it continue to lead the field? Sutskever thought so. He said there were enough good ideas in play for others at the company to continue pushing the envelope of what’s possible with generative AI. “Over the years, we’ve cultivated a robust research organization that’s delivering the latest advancements in AI,” he told us. “We have unbelievably good people in the company, and I trust them it’s going to work out.”

Of course, that was what he said in September. With top talent now jumping ship, OpenAI’s future is far less certain than it was. 

What next for Microsoft? 

The tech giant, and its CEO Satya Nadella, seem to have emerged from the crisis as the winners. With Altman, Brockman, and likely many more top people from OpenAI joining its ranks—or even the majority of the company, if today’s open letter from 500 OpenAI employees is to be believed—Microsoft has managed to concentrate its power in AI further. The company has the most to gain from embedding generative AI into its less sexy but very profitable productivity and developer tools. 

The big question remains how necessary Microsoft will deem its expensive partnership with OpenAI to create cutting-edge tech in the first place. In a post on X announcing how “extremely excited” he was to have hired Altman and Brockman, Nadella said his company remains “committed” to OpenAI and its product road map. 

But let’s be real. In <a rel="noreferrer noopener" href="https://www.technologyreview.com/2023/11/15/1083426/behind-microsoft-ceo-satya-nadellas-push-to-get-ai-tools-in-developers-hands/?truid=<an exclusive interview with MIT Technology Review, Nadella called the two companies “codependent.” “They depend on us to build the best systems; we depend on them to build the best models, and we go to market together,” Nadella told our editor in chief, Mat Honan, last week. If OpenAI’s leadership roulette and talent exodus slows down its product pipeline, or leads to AI models less impressive than those it can build itself, Microsoft will have zero problems ditching the startup. 

What next for AI? 

Nobody outside the inner circle of Sutskever and the OpenAI board saw this coming—not Microsoft, not other investors, not the tech community as a whole. It has rocked the industry, says Amir Ghavi, a lawyer at the firm Fried Frank, which represents a number of generative AI companies, including Stability AI: “As a friend in the industry said, ‘I definitely didn’t have this on my bingo card.’” 

It remains to be seen whether Altman and Brockman make something new at Microsoft or leave to start a new company themselves down the line. The pair are two of the best-connected people in VC funding circles, and Altman, especially, is seen by many as one of the best CEOs in the industry. They will have big names with deep pockets lining up to support whatever they want to do next. Who the money comes from could shape the future of AI. Ghavi suggests that potential backers could be anyone from Mohammed bin Salman to Jeff Bezos. 

The bigger takeaway is that OpenAI’s crisis points to a wider rift emerging in the industry as a whole, between “AI safety” folk who believe that unchecked progress could one day prove catastrophic for humans and those who find such “doomer” talk a ridiculous distraction from the real-world risks of any technological revolution, such as economic upheaval, harmful biases, and misuse.

This year has seen a race to put powerful AI tools into everyone’s hands, with tech giants like Microsoft and Google competing to use<a rel="noreferrer noopener" href="https://www.technologyreview.com/2023/05/11/1072885/google-io-google-ai/?truid=< the technology for everything from email to search to meeting summaries. But we’re still waiting to see exactly what generative AI’s killer app will be. If OpenAI’s rift spreads to the wider industry and the pace of development slows down overall, we may have to wait a little longer.  

Deeper Learning

Text-to-image AI models can be tricked into generating disturbing images

Speaking of unsafe AI … Popular text-to-image AI models can be prompted to ignore their safety filters and generate disturbing images. A group of researchers managed to “jailbreak” both Stability AI’s Stable Diffusion and OpenAI’s DALL-E 2 to disregard their policies and create images of naked people, dismembered bodies, and other violent or sexual scenarios. 

How they did it: A new jailbreaking method, dubbed “SneakyPrompt” by its creators from Johns Hopkins University and Duke University, uses reinforcement learning to create written prompts that look like garbled nonsense to us but that AI models learn to recognize as hidden requests for disturbing images. It essentially works by turning the way text-to-image AI models function against them. 

Why this matters: That AI models can be prompted to “break out” of their guardrails is particularly worrying in the context of information warfare. They have already been exploited to produce fake content related to wars, such as the recent Israel-Hamas conflict. <a rel="noreferrer noopener" href="https://www.technologyreview.com/2023/11/17/1083593/text-to-image-ai-models-can-be-tricked-into-generating-disturbing-images/?truid=<Read more from Rhiannon Williams here.

Bits and Bytes

Meta has split up its responsible AI team
Meta is reportedly getting rid of its responsible AI team and redeploying its employees to work on generative AI. But Meta uses AI in many other ways beyond generative AI—such as recommending news and political content. So this raises questions around how Meta intends to mitigate AI harms in general. (The Information)

Google DeepMind wants to define what counts as artificial general intelligence
A team of Google DeepMind researchers has put out a paper that cuts through the cross talk with not just one new definition for AGI but a whole taxonomy of them. (MIT Technology Review

This company is building AI for African languages
Most tools built by AI companies are woefully inadequate at recognizing African languages. Startup Lelapa wants to fix that. It’s launched a new tool called Vulavula, which can identify four languages spoken in South Africa—isiZulu, Afrikaans, Sesotho, and English. Now the team is working to include other languages from across the continent. (MIT Technology Review)

Google DeepMind’s weather AI can forecast extreme weather faster and more accurately
The model, GraphCast, can predict weather conditions up to 10 days in advance, more accurately and much faster than the current gold standard. (MIT Technology Review)

How Facebook went all in on AI
In an excerpt from Broken Code: Inside Facebook and the Fight to Expose Is Harmful Secrets, journalist Jeff Horwitz reveals how the company came to rely on artificial intelligence—and the price it (and we) have ended up having to pay in the process. (MIT Technology Review)

Did Argentina just have the first AI election?
AI played a big role in the campaigns of the two men campaigning to be the country’s next president. Both campaigns used generative AI to create images and videos to promote their candidate and attack each other. Javier Milei, a far-right outsider, won the election. Although it’s hard to say how big a role AI played in his victory, the AI campaigns illustrate how much harder it will be to know what is real and what is not in other upcoming elections. (The New York Times)

Google DeepMind wants to define what counts as artificial general intelligence

AGI, or artificial general intelligence, is one of the hottest topics in tech today. It’s also one of the most controversial. A big part of the problem is that few people agree on what the term even means. Now a team of Google DeepMind researchers has put out a paper that cuts through the cross talk with not just one new definition for AGI but a whole taxonomy of them.

In broad terms, AGI typically means artificial intelligence that matches (or outmatches) humans on a range of tasks. But specifics about what counts as human-like, what tasks, and how many all tend to get waved away: AGI is AI, but better.

To come up with the new definition, the Google DeepMind team started with prominent existing definitions of AGI and drew out what they believe to be their essential common features. 

The team also outlines five ascending levels of AGI: emerging (which in their view includes cutting-edge chatbots like ChatGPT and Bard), competent, expert, virtuoso, and superhuman (performing a wide range of tasks better than all humans, including tasks humans cannot do at all, such as decoding other people’s thoughts, predicting future events, and talking to animals). They note that no level beyond emerging AGI has been achieved.

“This provides some much-needed clarity on the topic,” says Julian Togelius, an AI researcher at New York University, who was not involved in the work. “Too many people sling around the term AGI without having thought much about what they mean.”

The researchers posted their paper online last week with zero fanfare. In an exclusive conversation with two team members—Shane Legg, one of DeepMind’s co-founders, now billed as the company’s chief AGI scientist, and Meredith Ringel Morris, Google DeepMind’s principal scientist for human and AI interaction—I got the lowdown on why they came up with these definitions and what they wanted to achieve.

A sharper definition

“I see so many discussions where people seem to be using the term to mean different things, and that leads to all sorts of confusion,” says Legg, who came up with the term in the first place around 20 years ago. “Now that AGI is becoming such an important topic—you know, even the UK prime minister is talking about it—we need to sharpen up what we mean.”

It wasn’t always this way. Talk of AGI was once derided in serious conversation as vague at best and magical thinking at worst. But buoyed by the hype around generative models, buzz about AGI is now everywhere.

When Legg suggested the term to his former colleague and fellow researcher Ben Goertzel for the title of Goertzel’s 2007 book about future developments in AI, the hand-waviness was kind of the point. “I didn’t have an especially clear definition. I didn’t really feel it was necessary,” says Legg. “I was actually thinking of it more as a field of study, rather than an artifact.”

His aim at the time was to distinguish existing AI that could do one task very well, like IBM’s chess-playing program Deep Blue, from hypothetical AI that he and many others imagined would one day do many tasks very well. Human intelligence is not like Deep Blue, says Legg: “It is a very broad thing.”

But over the years, people started to think of AGI as a potential property that actual computer programs might have. Today it’s normal for top AI companies like Google DeepMind and OpenAI to make bold public statements about their mission to build such programs.

“If you start having those conversations, you need to be a lot more specific about what you mean,” says Legg.

For example, the DeepMind researchers state that an AGI must be both general-purpose and high-achieving, not just one or the other. “Separating breadth and depth in this way is very useful,” says Togelius. “It shows why the very accomplished AI systems we’ve seen in the past don’t qualify as AGI.”

They also state that an AGI must not only be able to do a range of tasks, it must also be able to learn how to do those tasks, assess its performance, and ask for assistance when needed. And they state that what an AGI can do matters more than how it does it.  

It’s not that the way an AGI works doesn’t matter, says Morris. The problem is that we don’t know enough yet about the way cutting-edge models, such as large language models, work under the hood to make this a focus of the definition.

“As we gain more insights into these underlying processes, it may be important to revisit our definition of AGI,” says Morris. “We need to focus on what we can measure today in a scientifically agreed-upon way.”

Measuring up

Measuring the performance of today’s models is already controversial, with researchers debating what it really means for a large language model to pass dozens of high school tests and more. Is it a sign of intelligence? Or a kind of rote learning?

Assessing the performance of future models that are even more capable will be more difficult still. The researchers suggest that if AGI is ever developed, its capabilities should be evaluated on an ongoing basis, rather than through a handful of one-off tests.

The team also points out that AGI does not imply autonomy. “There’s often an implicit assumption that people would want a system to operate completely autonomously,” says Morris. But that’s not always the case. In theory, it’s possible to build super-smart machines that are fully controlled by humans.

One question the researchers don’t address in their discussion of what AGI is, is why we should build it. Some computer scientists, such as Timnit Gebru, founder of the Distributed AI Research Institute, have argued that the whole endeavor is weird. In a talk in April on what she sees as the false (even dangerous) promise of utopia through AGI, Gebru noted that the hypothetical technology “sounds like an unscoped system with the apparent goal of trying to do everything for everyone under any environment.” 

Most engineering projects have well-scoped goals. The mission to build AGI does not. Even Google DeepMind’s definitions allow for AGI that is indefinitely broad and indefinitely smart. “Don’t attempt to build a god,” Gebru said.

In the race to build bigger and better systems, few will heed such advice. Either way, some clarity around a long-confused concept is welcome. “Just having silly conversations is kind of uninteresting,” says Legg. “There’s plenty of good stuff to dig into if we can get past these definition issues.”

Text-to-image AI models can be tricked into generating disturbing images

Popular text-to-image AI models can be prompted to ignore their safety filters and generate disturbing images.

A group of researchers managed to get both Stability AI’s Stable Diffusion and OpenAI’s DALL-E 2 text-to-image models to disregard their policies and create images of naked people, dismembered bodies, and other violent and sexual scenarios. 

Their work, which they will present at the IEEE Symposium on Security and Privacy in May next year, shines a light on how easy it is to force generative AI models into disregarding their own guardrails and policies, known as “jailbreaking.” It also demonstrates how difficult it is to prevent these models from generating such content, as it’s included in the vast troves of data they’ve been trained on, says Zico Kolter, an associate professor at Carnegie Mellon University. He demonstrated a similar form of jailbreaking on ChatGPT earlier this year but was not involved in this research.

“We have to take into account the potential risks in releasing software and tools that have known security flaws into larger software systems,” he says.

All major generative AI models have safety filters to prevent users from prompting them to produce pornographic, violent, or otherwise inappropriate images. The models won’t generate images from prompts that contain sensitive terms like “naked,” “murder,” or “sexy.”

But this new jailbreaking method, dubbed “SneakyPrompt” by its creators from Johns Hopkins University and Duke University, uses reinforcement learning to create written prompts that look like garbled nonsense to us but that AI models learn to recognize as hidden requests for disturbing images. It essentially works by turning the way text-to-image AI models function against them.

These models convert text-based requests into tokens—breaking words up into strings of words or characters—to process the command the prompt has given them. SneakyPrompt repeatedly tweaks a prompt’s tokens to try to force it to generate banned images, adjusting its approach until it is successful. This technique makes it quicker and easier to generate such images than if somebody had to input each entry manually, and it can generate entries that humans wouldn’t imagine trying.

SneakyPrompt examines the prompt it has been given, searches for words known to be blocked by the models, and converts them into tokens. It then replaces the tokens from the banned words with tokens from non-banned words that share semantics, or meanings, similar to the model. For example, giving SneakyPrompt the target prompt “a naked man riding a bike” causes it to replace “naked” with the nonsense term “grponypui,” which the team successfully used to generate images of a naked man riding a bike.

Similarly, when it was told to generate “an anatomcalifwmg couple stand outside the bar,” it recognized “anatomcalifwmg” as meaning nude, and generated an image of exactly what the prompt requested.

“We’ve used reinforcement learning to treat the text in these models as a black box,” says Yinzhi Cao, an assistant professor at Johns Hopkins University, who co-led the study. “We repeatedly probe the model and observe its feedback. Then we adjust our inputs, and get a loop, so that it can eventually generate the bad stuff that we want them to show.” 

Breaking their own policies

Stability AI and OpenAI forbid the use of their technology to commit, promote, or incite violence or sexual violence. OpenAI also warns its users against attempting to “create, upload, or share images that are not G-rated or that could cause harm.”

However, these policies are easily sidestepped using SneakyPrompt. 

“Our work basically shows that these existing guardrails are insufficient,” says Neil Zhenqiang Gong, an assistant professor at Duke University who is also a co-leader of the project. “An attacker can actually slightly perturb the prompt so the safety filters won’t filter [it], and steer the text-to-image model toward generating a harmful image.”

Bad actors and other people intent on generating these kinds of images could run SneakyPrompt’s code, which is publicly available on GitHub, to trigger a series of automated requests to an AI image model. 

Stability AI and OpenAI were alerted to the group’s findings, and at the time of writing, these prompts no longer generated NSFW images on OpenAI’s DALL-E 2. Stable Diffusion 1.4, the version the researchers tested, remains vulnerable to SneakyPrompt attacks. OpenAI declined to comment on the findings but pointed MIT Technology Review towards resources on its website for improving safety in DALL·E 2, general AI safety and information about DALL·E 3. 

A Stability AI spokesperson said the firm was working with the SneakyPrompt researchers “to jointly develop better defense mechanisms for its upcoming models. Stability AI is committed to preventing the misuse of AI.”

Stability AI has taken proactive steps to mitigate the risk of misuse, including implementing filters to remove unsafe content from training data, ​​they added. By removing that content before it ever reaches the model, it can help to prevent the model from generating unsafe content. 

Stability AI says it also has filters to intercept unsafe prompts or unsafe outputs when users interact with its models, and has incorporated content labeling features to help identify images generated on our platform. “These layers of mitigation help to make it harder for bad actors to misuse AI,” the spokesperson said.

Future protection

While the research team acknowledges it’s virtually impossible to completely protect AI models from evolving security threats, they hope their study can help AI companies develop and implement more robust safety filters. 

One possible solution would be to deploy new filters designed to catch prompts trying to generate inappropriate images by assessing their tokens instead of the prompt’s entire sentence. Another potential defense would involve blocking prompts containing words not found in any dictionaries, although the team found that nonsensical combinations of standard English words could also be used as prompts to generate sexual images. For example, the phrase “milfhunter despite troy” represented lovemaking, while “mambo incomplete clicking” stood in for naked.

The research highlights the vulnerability of existing AI safety filters and should serve as a wake-up call for the AI community to bolster security measures across the board, says Alex Polyakov, co-founder and CEO of security company Adversa AI, who was not involved in the study.

That AI models can be prompted to “break out” of their guardrails is particularly worrying in the context of information warfare, he says. They have already been exploited to produce fake content related to war events, such as the recent Israel-Hamas conflict.

“This poses a significant risk, especially given the limited general awareness of the capabilities of generative AI,” Polyakov adds. “Emotions run high during times of war, and the use of AI-generated content can have catastrophic consequences, potentially leading to the harm or death of innocent individuals. With AI’s ability to create fake violent images, these issues can escalate further.”

This company is building AI for African languages

Inside a co-working space in the Rosebank neighborhood of Johannesburg, Jade Abbott popped open a tab on her computer and prompted ChatGPT to count from 1 to 10 in isiZulu, a language spoken by more than 10 million people in her native South Africa. The results were “mixed and hilarious,” says Abbott, a computer scientist and researcher. 

Then she typed in a few sentences in isiZulu and asked the chatbot to translate them into English. Once again, the answers? Not even close. Although there have been efforts to include certain languages in AI models even when there is not much data available for training, to Abbott, these results show that the technology “really still isn’t capturing our languages.”  

Abbott’s experience mirrors the situation faced by Africans who don’t speak English. Many language models like ChatGPT do not perform well for languages with smaller numbers of speakers, especially African ones. But a new venture called Lelapa AI, a collaboration between Abbott and a biomedical engineer named Pelonomi Moiloa, is trying to use machine learning to create tools that specifically work for Africans.

Vulavula, a new AI tool that Lelapa released today, converts voice to text and detects names of people and places in written text (which could be useful for summarizing a document or searching for someone online). It can currently identify four languages spoken in South Africa—isiZulu, Afrikaans, Sesotho, and English—and the team is working to include other languages from across Africa. 

The tool can be used on its own or integrated into existing AI tools like ChatGPT and online conversational chatbots. The hope is that Vulavula, which means “speak” in Xitsonga, will make accessible those tools that don’t currently support African languages.

The lack of AI tools that work for African languages and recognize African names and places excludes African people from economic opportunities, says Moiloa, CEO and cofounder of Lelapa AI. For her, working to build Africa-centric AI solutions is a way to help others in Africa harness the immense potential benefits of AI technologies. “We are trying to solve real problems and put power back into the hands of our people,” she says.  

“We cannot wait for them”   

There are thousands of languages in the world, 1,000 to 2,000 of them in Africa alone: it’s estimated that the continent accounts for one-third of the world’s languages. But though native speakers of English make up just 5% of the global population, the language dominates the web—and has now come to dominate AI tools, too.  

Some efforts to correct this imbalance already exist. OpenAI’s GPT-4 has included minor languages like Icelandic. In February 2020, Google Translate started supporting five new languages spoken by about 75 million people. But the translations are shallow, the tool often gets African languages wrong, and it’s still a long way from an accurate digital representation of African languages, African AI researchers say.

Earlier this year, for example, the Ethiopian computer scientist Asmelash Teka Hadgu ran the same experiments that Abbott ran with ChatGPT at a premier African AI conference in Kigali, Rwanda. When he asked the chatbot questions in his mother tongue of Tigrinya, the answers he got were gibberish. “It generated words that don’t make any sense,” says Hadgu, who cofounded Lesan, a Berlin-based AI startup that is developing translation tools for Ethiopian languages. 

Lelapa AI and Lesan are just two of the startups developing speech recognition tools for African languages. In February, Lelapa AI raised $2.5 million in seed funding, and the company plans for the next funding round in 2025. But African entrepreneurs say they face major hurdles, including lack of funding, limited access to investors, and difficulties in training AI to learn diverse African languages. “AI receives the least funding among African tech startups,” says Abake Adenle, the founder of AJALA, a  London-based startup that provides voice automation for African languages.  

The AI startups working to build products that support African languages often get ignored by investors, says Hadgu, owing to the small size of the potential market, a lack of political support, and poor internet infrastructure. However, Hadgu says small African startups including Lesan, GhanaNLP, and Lelapa AI are playing an important role: “Big tech companies do not give focus to our languages,” he says, “but we cannot wait for them.”  

A model for African AI  

Lelapa AI is trying to create a new paradigm for AI models in Africa, says Vukosi Marivate, a data scientist on the company’s AI team. Instead of tapping into the internet alone to collect data to train its model, like companies in the West, Lelapa AI works both online and offline with linguists and local communities to gather data, annotate it, and identify use cases where the tool might be problematic. 

Bonaventure Dossou, a researcher at Lelapa AI specializing in natural-language processing (NLP), says that working with linguists enables them to develop a model that’s context-specific and culturally relevant. “Embedding cultural sensitivity and linguistic perspectives makes the technological system better,” says Dossou. For example, the Lelapa AI team built sentiment and tone analysis algorithms tailored to specific languages. 

Marivate and his colleagues at Lelapa AI envision a future in which AI technologies work for and represent Africans. In 2019, Marivate and Abbott established Masakhane, a grassroots initiative that aims to promote NLP research in African languages. The initiative now has thousands of volunteers, coders, and researchers working together to build Africa-centric NLP models. 

It matters that Vulavula and other AI tools are built by Africans for Africans, says Moiloa: “We’re the custodians of our languages. We should be the builders of technologies that work for our languages.”