Google DeepMind is using Gemini to train agents inside Goat Simulator 3

Google DeepMind has built a new video-game-playing agent called SIMA 2 that can navigate and solve problems in a wide range of 3D virtual worlds. The company claims it’s a big step toward more general-purpose agents and better real-world robots.   

Google DeepMind first demoed SIMA (which stands for “scalable instructable multiworld agent”) last year. But SIMA 2 has been built on top of Gemini, the firm’s flagship large language model, which gives the agent a huge boost in capability.

The researchers claim that SIMA 2 can carry out a range of more complex tasks inside virtual worlds, figure out how to solve certain challenges by itself, and chat with its users. It can also improve itself by tackling harder tasks multiple times and learning through trial and error.

“Games have been a driving force behind agent research for quite a while,” Joe Marino, a research scientist at Google DeepMind, said in a press conference this week. He noted that even a simple action in a game, such as lighting a lantern, can involve multiple steps: “It’s a really complex set of tasks you need to solve to progress.”

The ultimate aim is to develop next-generation agents that are able to follow instructions and carry out open-ended tasks inside more complex environments than a web browser. In the long run, Google DeepMind wants to use such agents to drive real-world robots. Marino claimed that the skills SIMA 2 has learned, such as navigating an environment, using tools, and collaborating with humans to solve problems, are essential building blocks for future robot companions.

Unlike previous work on game-playing agents such as AlphaZero, which beat a Go grandmaster in 2016, or AlphaStar, which beat 99.8% of ranked human competition players at the video game StarCraft 2 in 2019, the idea behind SIMA is to train an agent to play an open-ended game without preset goals. Instead, the agent learns to carry out instructions given to it by people.

Humans control SIMA 2 via text chat, by talking to it out loud, or by drawing on the game’s screen. The agent takes in a video game’s pixels frame by frame and figures out what actions it needs to take to carry out its tasks.

Like its predecessor, SIMA 2 was trained on footage of humans playing eight commercial video games, including No Man’s Sky and Goat Simulator 3, as well as three virtual worlds created by the company. The agent learned to match keyboard and mouse inputs to actions.

Hooked up to Gemini, the researchers claim, SIMA 2 is far better at following instructions (asking questions and providing updates as it goes) and figuring out for itself how to perform certain more complex tasks.  

Google DeepMind tested the agent inside environments it had never seen before. In one set of experiments, researchers asked Genie 3, the latest version of the firm’s world model, to produce environments from scratch and dropped SIMA 2 into them. They found that the agent was able to navigate and carry out instructions there.

The researchers also used Gemini to generate new tasks for SIMA 2. If the agent failed, at first Gemini generated tips that SIMA 2 took on board when it tried again. Repeating a task multiple times in this way often allowed SIMA 2 to improve by trial and error until it succeeded, Marino said.

Git gud

SIMA 2 is still an experiment. The agent struggles with complex tasks that require multiple steps and more time to complete. It also remembers only its most recent interactions (to make SIMA 2 more responsive, the team cut its long-term memory). It’s also still nowhere near as good as people at using a mouse and keyboard to interact with a virtual world.

Julian Togelius, an AI researcher at New York University who works on creativity and video games, thinks it’s an interesting result. Previous attempts at training a single system to play multiple games haven’t gone too well, he says. That’s because training models to control multiple games just by watching the screen isn’t easy: “Playing in real time from visual input only is ‘hard mode,’” he says.

In particular, Togelius calls out GATO, a previous system from Google DeepMind, which—despite being hyped at the time—could not transfer skills across a significant number of virtual environments.  

Still, he is open-minded about whether or not SIMA 2 could lead to better robots. “The real world is both harder and easier than video games,” he says. It’s harder because you can’t just press A to open a door. At the same time, a robot in the real world will know exactly what its body can and can’t do at any time. That’s not the case in video games, where the rules inside each virtual world can differ.

Others are more skeptical. Matthew Guzdial, an AI researcher at the University of Alberta, isn’t too surprised that SIMA 2 can play many different video games. He notes that most games have very similar keyboard and mouse controls: Learn one and you learn them all. “If you put a game with weird input in front of it, I don’t think it’d be able to perform well,” he says.

Guzdial also questions how much of what SIMA 2 has learned would really carry over to robots. “It’s much harder to understand visuals from cameras in the real world compared to games, which are designed with easily parsable visuals for human players,” he says.

Still, Marino and his colleagues hope to continue their work with Genie 3 to allow the agent to improve inside a kind of endless virtual training dojo, where Genie generates worlds for SIMA to learn in via trial and error guided by Gemini’s feedback. “We’ve kind of just scratched the surface of what’s possible,” he said at the press conference.  

OpenAI’s new LLM exposes the secrets of how AI really works

ChatGPT maker OpenAI has built an experimental large language model that is far easier to understand than typical models.

That’s a big deal, because today’s LLMs are black boxes: Nobody fully understands how they do what they do. Building a model that is more transparent sheds light on how LLMs work in general, helping researchers figure out why models hallucinate, why they go off the rails, and just how far we should trust them with critical tasks.

“As these AI systems get more powerful, they’re going to get integrated more and more into very important domains,” Leo Gao, a research scientist at OpenAI, told MIT Technology Review in an exclusive preview of the new work. “It’s very important to make sure they’re safe.”

This is still early research. The new model, called a weight-sparse transformer, is far smaller and far less capable than top-tier mass-market models like the firm’s GPT-5, Anthropic’s Claude, and Google DeepMind’s Gemini. At most it’s as capable as GPT-1, a model that OpenAI developed back in 2018, says Gao (though he and his colleagues haven’t done a direct comparison).    

But the aim isn’t to compete with the best in class (at least, not yet). Instead, by looking at how this experimental model works, OpenAI hopes to learn about the hidden mechanisms inside those bigger and better versions of the technology.

It’s interesting research, says Elisenda Grigsby, a mathematician at Boston College who studies how LLMs work and who was not involved in the project: “I’m sure the methods it introduces will have a significant impact.” 

Lee Sharkey, a research scientist at AI startup Goodfire, agrees. “This work aims at the right target and seems well executed,” he says.

Why models are so hard to understand

OpenAI’s work is part of a hot new field of research known as mechanistic interpretability, which is trying to map the internal mechanisms that models use when they carry out different tasks.

That’s harder than it sounds. LLMs are built from neural networks, which consist of nodes, called neurons, arranged in layers. In most networks, each neuron is connected to every other neuron in its adjacent layers. Such a network is known as a dense network.

Dense networks are relatively efficient to train and run, but they spread what they learn across a vast knot of connections. The result is that simple concepts or functions can be split up between neurons in different parts of a model. At the same time, specific neurons can also end up representing multiple different features, a phenomenon known as superposition (a term borrowed from quantum physics). The upshot is that you can’t relate specific parts of a model to specific concepts.

“Neural networks are big and complicated and tangled up and very difficult to understand,” says Dan Mossing, who leads the mechanistic interpretability team at OpenAI. “We’ve sort of said: ‘Okay, what if we tried to make that not the case?’”

Instead of building a model using a dense network, OpenAI started with a type of neural network known as a weight-sparse transformer, in which each neuron is connected to only a few other neurons. This forced the model to represent features in localized clusters rather than spread them out.

Their model is far slower than any LLM on the market. But it is easier to relate its neurons or groups of neurons to specific concepts and functions. “There’s a really drastic difference in how interpretable the model is,” says Gao.

Gao and his colleagues have tested the new model with very simple tasks. For example, they asked it to complete a block of text that opens with quotation marks by adding matching marks at the end.  

It’s a trivial request for an LLM. The point is that figuring out how a model does even a straightforward task like that involves unpicking a complicated tangle of neurons and connections, says Gao. But with the new model, they were able to follow the exact steps the model took.

“We actually found a circuit that’s exactly the algorithm you would think to implement by hand, but it’s fully learned by the model,” he says. “I think this is really cool and exciting.”

Where will the research go next? Grigsby is not convinced the technique would scale up to larger models that have to handle a variety of more difficult tasks.    

Gao and Mossing acknowledge that this is a big limitation of the model they have built so far and agree that the approach will never lead to models that match the performance of cutting-edge products like GPT-5. And yet OpenAI thinks it might be able to improve the technique enough to build a transparent model on a par with GPT-3, the firm’s breakthrough 2021 LLM. 

“Maybe within a few years, we could have a fully interpretable GPT-3, so that you could go inside every single part of it and you could understand how it does every single thing,” says Gao. “If we had such a system, we would learn so much.”