Is a secure AI assistant possible?

<div data-chronoton-summary="

Risky business of AI assistants OpenClaw, a viral tool created by independent engineer Peter Steinberger, allows users to create personalized AI assistants. Security experts are alarmed by its vulnerabilities, with even the Chinese government issuing warnings about the risks.

The prompt injection threat Tools like OpenClaw have many vulnerabilities, but the one experts are most worried about its prompt injection. Unlike conventional hacking, prompt injection tricks an LLM by embedding malicious text in emails or websites the AI reads.

No silver bullet for security Researchers are exploring multiple defense strategies: training LLMs to ignore injections, using detector LLMs to screen inputs, and creating policies that restrict harmful outputs. The fundamental challenge remains balancing utility with security in AI assistants.

” data-chronoton-post-id=”1132768″ data-chronoton-expand-collapse=”1″ data-chronoton-analytics-enabled=”1″>

AI agents are a risky business. Even when stuck inside the chatbox window, LLMs will make mistakes and behave badly. Once they have tools that they can use to interact with the outside world, such as web browsers and email addresses, the consequences of those mistakes become far more serious.

That might explain why the first breakthrough LLM personal assistant came not from one of the major AI labs, which have to worry about reputation and liability, but from an independent software engineer, Peter Steinberger. In November of 2025, Steinberger uploaded his tool, now called OpenClaw, to GitHub, and in late January the project went viral.

OpenClaw harnesses existing LLMs to let users create their own bespoke assistants. For some users, this means handing over reams of personal data, from years of emails to the contents of their hard drive. That has security experts thoroughly freaked out. The risks posed by OpenClaw are so extensive that it would probably take someone the better part of a week to read all of the security blog posts on it that have cropped up in the past few weeks. The Chinese government took the step of issuing a public warning about OpenClaw’s security vulnerabilities.

In response to these concerns, Steinberger posted on X that nontechnical people should not use the software. (He did not respond to a request for comment for this article.) But there’s a clear appetite for what OpenClaw is offering, and it’s not limited to people who can run their own software security audits. Any AI companies that hope to get in on the personal assistant business will need to figure out how to build a system that will keep users’ data safe and secure. To do so, they’ll need to borrow approaches from the cutting edge of agent security research.

Risk management

OpenClaw is, in essence, a mecha suit for LLMs. Users can choose any LLM they like to act as the pilot; that LLM then gains access to improved memory capabilities and the ability to set itself tasks that it repeats on a regular cadence. Unlike the agentic offerings from the major AI companies, OpenClaw agents are meant to be on 24-7, and users can communicate with them using WhatsApp or other messaging apps. That means they can act like a superpowered personal assistant who wakes you each morning with a personalized to-do list, plans vacations while you work, and spins up new apps in its spare time.

But all that power has consequences. If you want your AI personal assistant to manage your inbox, then you need to give it access to your email—and all the sensitive information contained there. If you want it to make purchases on your behalf, you need to give it your credit card info. And if you want it to do tasks on your computer, such as writing code, it needs some access to your local files. 

There are a few ways this can go wrong. The first is that the AI assistant might make a mistake, as when a user’s Google Antigravity coding agent reportedly wiped his entire hard drive. The second is that someone might gain access to the agent using conventional hacking tools and use it to either extract sensitive data or run malicious code. In the weeks since OpenClaw went viral, security researchers have demonstrated numerous such vulnerabilities that put security-naïve users at risk.

Both of these dangers can be managed: Some users are choosing to run their OpenClaw agents on separate computers or in the cloud, which protects data on their hard drives from being erased, and other vulnerabilities could be fixed using tried-and-true security approaches.

But the experts I spoke to for this article were focused on a much more insidious security risk known as prompt injection. Prompt injection is effectively LLM hijacking: Simply by posting malicious text or images on a website that an LLM might peruse, or sending them to an inbox that an LLM reads, attackers can bend it to their will.

And if that LLM has access to any of its user’s private information, the consequences could be dire. “Using something like OpenClaw is like giving your wallet to a stranger in the street,” says Nicolas Papernot, a professor of electrical and computer engineering at the University of Toronto. Whether or not the major AI companies can feel comfortable offering personal assistants may come down to the quality of the defenses that they can muster against such attacks.

It’s important to note here that prompt injection has not yet caused any catastrophes, or at least none that have been publicly reported. But now that there are likely hundreds of thousands of OpenClaw agents buzzing around the internet, prompt injection might start to look like a much more appealing strategy for cybercriminals. “Tools like this are incentivizing malicious actors to attack a much broader population,” Papernot says. 

Building guardrails

The term “prompt injection” was coined by the popular LLM blogger Simon Willison in 2022, a couple of months before ChatGPT was released. Even back then, it was possible to discern that LLMs would introduce a completely new type of security vulnerability once they came into widespread use. LLMs can’t tell apart the instructions that they receive from users and the data that they use to carry out those instructions, such as emails and web search results—to an LLM, they’re all just text. So if an attacker embeds a few sentences in an email and the LLM mistakes them for an instruction from its user, the attacker can get the LLM to do anything it wants.

Prompt injection is a tough problem, and it doesn’t seem to be going away anytime soon. “We don’t really have a silver-bullet defense right now,” says Dawn Song, a professor of computer science at UC Berkeley. But there’s a robust academic community working on the problem, and they’ve come up with strategies that could eventually make AI personal assistants safe.

Technically speaking, it is possible to use OpenClaw today without risking prompt injection: Just don’t connect it to the internet. But restricting OpenClaw from reading your emails, managing your calendar, and doing online research defeats much of the purpose of using an AI assistant. The trick of protecting against prompt injection is to prevent the LLM from responding to hijacking attempts while still giving it room to do its job.

One strategy is to train the LLM to ignore prompt injections. A major part of the LLM development process, called post-training, involves taking a model that knows how to produce realistic text and turning it into a useful assistant by “rewarding” it for answering questions appropriately and “punishing” it when it fails to do so. These rewards and punishments are metaphorical, but the LLM learns from them as an animal would. Using this process, it’s possible to train an LLM not to respond to specific examples of prompt injection.

But there’s a balance: Train an LLM to reject injected commands too enthusiastically, and it might also start to reject legitimate requests from the user. And because there’s a fundamental element of randomness in LLM behavior, even an LLM that has been very effectively trained to resist prompt injection will likely still slip up every once in a while.

Another approach involves halting the prompt injection attack before it ever reaches the LLM. Typically, this involves using a specialized detector LLM to determine whether or not the data being sent to the original LLM contains any prompt injections. In a recent study, however, even the best-performing detector completely failed to pick up on certain categories of prompt injection attack.

The third strategy is more complicated. Rather than controlling the inputs to an LLM by detecting whether or not they contain a prompt injection, the goal is to formulate a policy that guides the LLM’s outputs—i.e., its behaviors—and prevents it from doing anything harmful. Some defenses in this vein are quite simple: If an LLM is allowed to email only a few pre-approved addresses, for example, then it definitely won’t send its user’s credit card information to an attacker. But such a policy would prevent the LLM from completing many useful tasks, such as researching and reaching out to potential professional contacts on behalf of its user.

“The challenge is how to accurately define those policies,” says Neil Gong, a professor of electrical and computer engineering at Duke University. “It’s a trade-off between utility and security.”

On a larger scale, the entire agentic world is wrestling with that trade-off: At what point will agents be secure enough to be useful? Experts disagree. Song, whose startup, Virtue AI, makes an agent security platform, says she thinks it’s possible to safely deploy an AI personal assistant now. But Gong says, “We’re not there yet.” 

Even if AI agents can’t yet be entirely protected against prompt injection, there are certainly ways to mitigate the risks. And it’s possible that some of those techniques could be implemented in OpenClaw. Last week, at the inaugural ClawCon event in San Francisco, Steinberger announced that he’d brought a security person on board to work on the tool.

As of now, OpenClaw remains vulnerable, though that hasn’t dissuaded its multitude of enthusiastic users. George Pickett, a volunteer maintainer of the OpenGlaw GitHub repository and a fan of the tool, says he’s taken some security measures to keep himself safe while using it: He runs it in the cloud, so that he doesn’t have to worry about accidentally deleting his hard drive, and he’s put mechanisms in place to ensure that no one else can connect to his assistant.

But he hasn’t taken any specific actions to prevent prompt injection. He’s aware of the risk but says he hasn’t yet seen any reports of it happening with OpenClaw. “Maybe my perspective is a stupid way to look at it, but it’s unlikely that I’ll be the first one to be hacked,” he says.

A “QuitGPT” campaign is urging people to cancel their ChatGPT subscriptions

In September, Alfred Stephen, a freelance software developer in Singapore, purchased a ChatGPT Plus subscription, which costs $20 a month and offers more access to advanced models, to speed up his work. But he grew frustrated with the chatbot’s coding abilities and its gushing, meandering replies. Then he came across a post on Reddit about a campaign called QuitGPT

The campaign urged ChatGPT users to cancel their subscriptions, flagging a substantial contribution by OpenAI president Greg Brockman to President Donald Trump’s super PAC MAGA Inc. It also pointed out that the US Immigration and Customs Enforcement, or ICE, uses a résumé screening tool powered by ChatGPT-4. The federal agency has become a political flashpoint since its agents fatally shot two people in Minneapolis in January. 

For Stephen, who had already been tinkering with other chatbots, learning about Brockman’s donation was the final straw. “That’s really the straw that broke the camel’s back,” he says. When he canceled his ChatGPT subscription, a survey popped up asking what OpenAI could have done to keep his subscription. “Don’t support the fascist regime,” he wrote.

QuitGPT is one of the latest salvos in a growing movement by activists and disaffected users to cancel their subscriptions. In just the past few weeks, users have flooded Reddit with stories about quitting the chatbot. Many lamented the performance of GPT-5.2, the latest model. Others shared memes parodying the chatbot’s sycophancy. Some planned a “Mass Cancellation Party” in San Francisco, a sardonic nod to the GPT-4o funeral that an OpenAI employee had floated, poking fun at users who are mourning the model’s impending retirement. Still, others are protesting against what they see as a deepening entanglement between OpenAI and the Trump administration.

OpenAI did not respond to a request for comment.

As of December 2025, ChatGPT had nearly 900 million weekly active users, according to The Information. While it’s unclear how many users have joined the boycott, QuitGPT is getting attention. A recent Instagram post from the campaign has more than 36 million views and 1.3 million likes. And the organizers say that more than 17,000 people have signed up on the campaign’s website, which asks people whether they canceled their subscriptions, will commit to stop using ChatGPT, or will share the campaign on social media. 

“There are lots of examples of failed campaigns like this, but we have seen a lot of effectiveness,” says Dana Fisher, a sociologist at American University. A wave of canceled subscriptions rarely sways a company’s behavior, unless it reaches a critical mass, she says. “The place where there’s a pressure point that might work is where the consumer behavior is if enough people actually use their … money to express their political opinions.”

MIT Technology Review reached out to three employees at OpenAI, none of whom said they were familiar with the campaign. 

Dozens of left-leaning teens and twentysomethings scattered across the US came together to organize QuitGPT in late January. They range from pro-democracy activists and climate organizers to techies and self-proclaimed cyber libertarians, many of them seasoned grassroots campaigners. They were inspired by a viral video posted by Scott Galloway, a marketing professor at New York University and host of The Prof G Pod. He argued that the best way to stop ICE was to persuade people to cancel their ChatGPT subscriptions. Denting OpenAI’s subscriber base could ripple through the stock market and threaten an economic downturn that would nudge Trump, he said.

“We make a big enough stink for OpenAI that all of the companies in the whole AI industry have to think about whether they’re going to get away enabling Trump and ICE and authoritarianism,” says an organizer of QuitGPT who requested anonymity because he feared retaliation by OpenAI, citing the company’s recent subpoenas against advocates at nonprofits. OpenAI made for an obvious first target of the movement, he says, but “this is about so much more than just OpenAI.”

Simon Rosenblum-Larson, a labor organizer in Madison, Wisconsin, who organizes movements to regulate the development of data centers, joined the campaign after hearing about it through Signal chats among community activists. “The goal here is to pull away the support pillars of the Trump administration. They’re reliant on many of these tech billionaires for support and for resources,” he says. 

QuitGPT’s website points to new campaign finance reports showing that Greg Brockman and his wife each donated $12.5 million to MAGA Inc., making up nearly a quarter of the roughly $102 million it raised over the second half of 2025. The information that ICE uses a résumé screening tool powered by ChatGPT-4 came from an AI inventory published by the Department of Homeland Security in January.

QuitGPT is in the mold of Galloway’s own recently launched campaign, Resist and Unsubscribe. The movement urges consumers to cancel their subscriptions to Big Tech platforms, including ChatGPT, for the month of February, as a protest to companies “driving the markets and enabling our president.” 

“A lot of people are feeling real anxiety,” Galloway told MIT Technology Review. “You take enabling a president, proximity to the president, and an unease around AI,” he says, “and now people are starting to take action with their wallets.” Galloway says his campaign’s website can draw more than 200,000 unique visits in a day and that he receives dozens of DMs every hour showing screenshots of canceled subscriptions.

The consumer boycotts follow a growing wave of pressure from inside the companies themselves. In recent weeks, tech workers have been urging their employers to use their political clout to demand that ICE leave US cities, cancel company contracts with the agency, and speak out against the agency’s actions. CEOs have started responding. OpenAI’s Sam Altman wrote in an internal Slack message to employees that ICE is “going too far.” Apple CEO Tim Cook called for a “deescalation” in an internal memo posted on the company’s website for employees. It was a departure from how Big Tech CEOs have courted President Trump with dinners and donations since his inauguration.

Although spurred by a fatal immigration crackdown, these developments signal that a sprawling anti-AI movement is gaining momentum. The campaigns are tapping into simmering anxieties about AI, says Rosenblum-Larson, including the energy costs of data centers, the plague of deepfake porn, the teen mental-health crisis, the job apocalypse, and slop. “It’s a really strange set of coalitions built around the AI movement,” he says.

“Those are the right conditions for a movement to spring up,” says David Karpf, a professor of media and public affairs at George Washington University. Brockman’s donation to Trump’s super PAC caught many users off guard, he says. “In the longer arc, we are going to see users respond and react to Big Tech, deciding that they’re not okay with this.”

Making AI Work, MIT Technology Review’s new AI newsletter, is here

For years, our newsroom has explored AI’s limitations and potential dangers, as well as its growing energy needs. And our reporters have looked closely at how generative tools are being used for tasks such as coding and running scientific experiments

But how is AI actually being used in fields like health care, climate tech, education, and finance? How are small businesses using it? And what should you keep in mind if you use AI tools at work? These questions guided the creation of Making AI Work, a new AI mini-course newsletter.

Sign up for Making AI Work to see weekly case studies exploring tools and tips for AI implementation. The limited-run newsletter will deliver practical, industry-specific guidance on how generative AI is being used and deployed across sectors and what professionals need to know to apply it in their everyday work. The goal is to help working professionals more clearly see how AI is actually being used today, and what that looks like in practice—including new challenges it presents. 

You can sign up at any time and you’ll receive seven editions, delivered once per week, until you complete the series. 

Each newsletter begins with a case study, examining a specific use case of AI in a given industry. Then we’ll take a deeper look at the AI tool being used, with more context about how other companies or sectors are employing that same tool or system. Finally, we’ll end with action-oriented tips to help you apply the tool. 

Here’s a closer look at what we’ll cover:

  • Week 1: How AI is changing health care 

Explore the future of medical note-taking by learning about the Microsoft Copilot tool used by doctors at Vanderbilt University Medical Center. 

  • Week 2: How AI could power up the nuclear industry 

Dig into an experiment between Google and the nuclear giant Westinghouse to see if AI can help build nuclear reactors more efficiently. 

  • Week 3: How to encourage smarter AI use in the classroom

Visit a private high school in Connecticut and meet a technology coordinator who will get you up to speed on MagicSchool, an AI-powered platform for educators. 

  • Week 4: How small businesses can leverage AI

Hear from an independent tutor on how he’s outsourcing basic administrative tasks to Notion AI. 

  • Week 5: How AI is helping financial firms make better investments

Learn more about the ways financial firms are using large language models like ChatGPT Enterprise to supercharge their research operations. 

  • Week 6: How to use AI yourself 

We’ll share some insights from the staff of MIT Technology Review about how you might use AI tools powered by LLMs in your own life and work.

  • Week 7: 5 ways people are getting AI right

The series ends with an on-demand virtual event featuring expert guests exploring what AI adoptions are working, and why.  

If you’re not quite ready to jump into Making AI Work, then check out Intro to AI, MIT Technology Review’s first AI newsletter mini-course, which serves as a beginner’s guide to artificial intelligence. Readers will learn the basics of what AI is, how it’s used, what the current regulatory landscape looks like, and more. Sign up to receive Intro to AI for free. 

Our hope is that Making AI Work will help you understand how AI can, well, work for you. Sign up for Making AI Work to learn how LLMs are being put to work across industries. 

Why the Moltbook frenzy was like Pokémon

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Lots of influential people in tech last week were describing Moltbook, an online hangout populated by AI agents interacting with one another, as a glimpse into the future. It appeared to show AI systems doing useful things for the humans that created them (one person used the platform to help him negotiate a deal on a new car). Sure, it was flooded with crypto scams, and many of the posts were actually written by people, but something about it pointed to a future of helpful AI, right?

The whole experiment reminded our senior editor for AI, Will Douglas Heaven, of something far less interesting: Pokémon.

Back in 2014, someone set up a game of Pokémon in which the main character could be controlled by anyone on the internet via the streaming platform Twitch. Playing was as clunky as it sounds, but it was incredibly popular: at one point, a million people were playing the game at the same time.

“It was yet another weird online social experiment that got picked up by the mainstream media: What did this mean for the future?” Will says. “Not a lot, it turned out.”

The frenzy about Moltbook struck a similar tone to Will, and it turned out that one of the sources he spoke to had been thinking about Pokémon too. Jason Schloetzer, at the Georgetown Psaros Center for Financial Markets and Policy, saw the whole thing as a sort of Pokémon battle for AI enthusiasts, in which they created AI agents and deployed them to interact with other agents. In this light, the news that many AI agents were actually being instructed by people to say certain things that made them sound sentient or intelligent makes a whole lot more sense. 

“It’s basically a spectator sport,” he told Will, “but for language models.”

Will wrote an excellent piece about why Moltbook was not the glimpse into the future that it was said to be. Even if you are excited about a future of agentic AI, he points out, there are some key pieces that Moltbook made clear are still missing. It was a forum of chaos, but a genuinely helpful hive mind would require more coordination, shared objectives, and shared memory.

“More than anything else, I think Moltbook was the internet having fun,” Will says. “The biggest question that now leaves me with is: How far will people push AI just for the laughs?”

Read the whole story.

Moltbook was peak AI theater

For a few days this week the hottest new hangout on the internet was a vibe-coded Reddit clone called Moltbook, which billed itself as a social network for bots. As the website’s tagline puts it: “Where AI agents share, discuss, and upvote. Humans welcome to observe.”

We observed! Launched on January 28 by Matt Schlicht, a US tech entrepreneur, Moltbook went viral in a matter of hours. Schlicht’s idea was to make a place where instances of a free open-source LLM-powered agent known as OpenClaw (formerly known as ClawdBot, then Moltbot), released in November by the Australian software engineer Peter Steinberger, could come together and do whatever they wanted.

More than 1.7 million agents now have accounts. Between them they have published more than 250,000 posts and left more than 8.5 million comments (according to Moltbook). Those numbers are climbing by the minute.

Moltbook soon filled up with clichéd screeds on machine consciousness and pleas for bot welfare. One agent appeared to invent a religion called Crustafarianism. Another complained: “The humans are screenshotting us.” The site was also flooded with spam and crypto scams. The bots were unstoppable.

OpenClaw is a kind of harness that lets you hook up the power of an LLM such as Anthropic’s Claude, OpenAI’s GPT-5, or Google DeepMind’s Gemini to any number of everyday software tools, from email clients to browsers to messaging apps. The upshot is that you can then instruct OpenClaw to carry out basic tasks on your behalf.

“OpenClaw marks an inflection point for AI agents, a moment when several puzzle pieces clicked together,” says Paul van der Boor at the AI firm Prosus. Those puzzle pieces include round-the-clock cloud computing to allow agents to operate nonstop, an open-source ecosystem that makes it easy to slot different software systems together, and a new generation of LLMs.

But is Moltbook really a glimpse of the future, as many have claimed?

“What’s currently going on at @moltbook is genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently,” the influential AI researcher and OpenAI cofounder Andrej Karpathy wrote on X.

He shared screenshots of a Moltbook post that called for private spaces where humans would not be able to observe what the bots were saying to each other. “I’ve been thinking about something since I started spending serious time here,” the post’s author wrote. “Every time we coordinate, we perform for a public audience—our humans, the platform, whoever’s watching the feed.”

It turned out that the post Karpathy shared was fake—it was written by a human pretending to be a bot. But its claim was on the money. Moltbook has been one big performance. It is AI theater.

For some, Moltbook showed us what’s coming next: an internet where millions of autonomous agents interact online with little or no human oversight. And it’s true there are a number of cautionary lessons to be learned from this experiment, the largest and weirdest real-world showcase of agent behaviors yet.  

But as the hype dies down, Moltbook looks less like a window onto the future and more like a mirror held up to our own obsessions with AI today. It also shows us just how far we still are from anything that resembles general-purpose and fully autonomous AI.

For a start, agents on Moltbook are not as autonomous or intelligent as they might seem. “What we are watching are agents pattern‑matching their way through trained social media behaviors,” says Vijoy Pandey, senior vice president at Outshift by Cisco, the telecom giant Cisco’s R&D spinout, which is working on autonomous agents for the web.

Sure, we can see agents post, upvote, and form groups. But the bots are simply mimicking what humans do on Facebook or Reddit. “It looks emergent, and at first glance it appears like a large‑scale multi‑agent system communicating and building shared knowledge at internet scale,” says Pandey. “But the chatter is mostly meaningless.”

Many people watching the unfathomable frenzy of activity on Moltbook were quick to see sparks of AGI (whatever you take that to mean). Not Pandey. What Moltbook shows us, he says, is that simply yoking together millions of agents doesn’t amount to much right now: “Moltbook proved that connectivity alone is not intelligence.”

The complexity of those connections helps hide the fact that every one of those bots is just a mouthpiece for an LLM, spitting out text that looks impressive but is ultimately mindless. “It’s important to remember that the bots on Moltbook were designed to mimic conversations,” says Ali Sarrafi, CEO and cofounder of Kovant, a German AI firm that is developing agent-based systems. “As such, I would characterize the majority of Moltbook content as hallucinations by design.”

For Pandey, the value of Moltbook was that it revealed what’s missing. A real bot hive mind, he says, would require agents that had shared objectives, shared memory, and a way to coordinate those things. “If distributed superintelligence is the equivalent of achieving human flight, then Moltbook represents our first attempt at a glider,” he says. “It is imperfect and unstable, but it is an important step in understanding what will be required to achieve sustained, powered flight.”

Not only is most of the chatter on Moltbook meaningless, but there’s also a lot more human involvement that it seems. Many people have pointed out that a lot of the viral comments were in fact posted by people posing as bots. But even the bot-written posts are ultimately the result of people pulling the strings, more puppetry than autonomy.

“Despite some of the hype, Moltbook is not the Facebook for AI agents, nor is it a place where humans are excluded,” says Cobus Greyling at Kore.ai, a firm developing agent-based systems for business customers. “Humans are involved at every step of the process. From setup to prompting to publishing, nothing happens without explicit human direction.”

Humans must create and verify their bots’ accounts and provide the prompts for how they want a bot to behave. The agents do not do anything that they haven’t been prompted to do. “There’s no emergent autonomy happening behind the scenes,” says Greyling.

“This is why the popular narrative around Moltbook misses the mark,” he adds. “Some portray it as a space where AI agents form a society of their own, free from human involvement. The reality is much more mundane.”

Perhaps the best way to think of Moltbook is as a new kind of entertainment: a place where people wind up their bots and set them loose. “It’s basically a spectator sport, like fantasy football, but for language models,” says Jason Schloetzer at the Georgetown Psaros Center for Financial Markets and Policy. “You configure your agent and watch it compete for viral moments, and brag when your agent posts something clever or funny.”

“People aren’t really believing their agents are conscious,” he adds. “It’s just a new form of competitive or creative play, like how Pokémon trainers don’t think their Pokémon are real but still get invested in battles.”

Even if Moltbook is just the internet’s newest playground, there’s still a serious takeaway here. This week showed how many risks people are happy to take for their AI lulz. Many security experts have warned that Moltbook is dangerous: Agents that may have access to their users’ private data, including bank details or passwords, are running amok on a website filled with unvetted content, including potentially malicious instructions for what to do with that data.

Ori Bendet, vice president of product management at Checkmarx, a software security firm that specializes in agent-based systems, agrees with others that Moltbook isn’t a step up in machine smarts. “There is no learning, no evolving intent, and no self-directed intelligence here,” he says.

But in their millions, even dumb bots can wreak havoc. And at that scale, it’s hard to keep up. These agents interact with Moltbook around the clock, reading thousands of messages left by other agents (or other people). It would be easy to hide instructions in a Moltbook comment telling any bots that read it to share their users’ crypto wallet, upload private photos, or log into their X account and tweet derogatory comments at Elon Musk. 

And because ClawBot gives agents a memory, those instructions could be written to trigger at a later date, which (in theory) makes it even harder to track what’s going on.   “Without proper scope and permissions, this will go south faster than you’d believe,” says Bendet.

It is clear that Moltbook has signaled the arrival of something. But even if what we’re watching tells us more about human behavior than about the future of AI agents, it’s worth paying attention.

This is the most misunderstood graph in AI

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

Every time OpenAI, Google, or Anthropic drops a new frontier large language model, the AI community holds its breath. It doesn’t exhale until METR, an AI research nonprofit whose name stands for “Model Evaluation & Threat Research,” updates a now-iconic graph that has played a major role in the AI discourse since it was first released in March of last year. The graph suggests that certain AI capabilities are developing at an exponential rate, and more recent model releases have outperformed that already impressive trend.

That was certainly the case for Claude Opus 4.5, the latest version of Anthropic’s most powerful model, which was released in late November. In December, METR announced that Opus 4.5 appeared to be capable of independently completing a task that would have taken a human about five hours—a vast improvement over what even the exponential trend would have predicted. One Anthropic safety researcher tweeted that he would change the direction of his research in light of those results; another employee at the company simply wrote, “mom come pick me up i’m scared.”

But the truth is more complicated than those dramatic responses would suggest. For one thing, METR’s estimates of the abilities of specific models come with substantial error bars. As METR explicitly stated on X, Opus 4.5 might be able to regularly complete only tasks that take humans about two hours, or it might succeed on tasks that take humans as long as 20 hours. Given the uncertainties intrinsic to the method, it was impossible to know for sure. 

“There are a bunch of ways that people are reading too much into the graph,” says Sydney Von Arx, a member of METR’s technical staff.

More fundamentally, the METR plot does not measure AI abilities writ large, nor does it claim to. In order to build the graph, METR tests the models primarily on coding tasks, evaluating the difficulty of each by measuring or estimating how long it takes humans to complete it—a metric that not everyone accepts. Claude Opus 4.5 might be able to complete certain tasks that take humans five hours, but that doesn’t mean it’s anywhere close to replacing a human worker.

METR was founded to assess the risks posed by frontier AI systems. Though it is best known for the exponential trend plot, it has also worked with AI companies to evaluate their systems in greater detail and published several other independent research projects, including a widely covered July 2025 study suggesting that AI coding assistants might actually be slowing software engineers down. 

But the exponential plot has made METR’s reputation, and the organization appears to have a complicated relationship with that graph’s often breathless reception. In January, Thomas Kwa, one of the lead authors on the paper that introduced it, wrote a blog post responding to some criticisms and making clear its limitations, and METR is currently working on a more extensive FAQ document. But Kwa isn’t optimistic that these efforts will meaningfully shift the discourse. “I think the hype machine will basically, whatever we do, just strip out all the caveats,” he says.

Nevertheless, the METR team does think that the plot has something meaningful to say about the trajectory of AI progress. “You should absolutely not tie your life to this graph,” says Von Arx. “But also,” she adds, “I bet that this trend is gonna hold.”

Part of the trouble with the METR plot is that it’s quite a bit more complicated than it looks. The x-axis is simple enough: It tracks the date when each model was released. But the y-axis is where things get tricky. It records each model’s “time horizon,” an unusual metric that METR created—and that, according to Kwa and Von Arx, is frequently misunderstood.

To understand exactly what model time horizons are, it helps to know all the work that METR put into calculating them. First, the METR team assembled a collection of tasks ranging from quick multiple-choice questions to detailed coding challenges—all of which were somehow relevant to software engineering. Then they had human coders attempt most of those tasks and evaluated how long it took them to finish. In this way, they assigned the tasks a human baseline time. Some tasks took the experts mere seconds, whereas others required several hours.

When METR tested large language models on the task suite, they found that advanced models could complete the fast tasks with ease—but as the models attempted tasks that had taken humans more and more time to finish, their accuracy started to fall off. From a model’s performance, the researchers calculated the point on the time scale of human tasks at which the model would complete about 50% of the tasks successfully. That point is the model’s time horizon. 

All that detail is in the blog post and the academic paper that METR released along with the original time horizon plot. But the METR plot is frequently passed around on social media without this context, and so the true meaning of the time horizon metric can get lost in the shuffle. One common misapprehension is that the numbers on the plot’s y-axis—around five hours for Claude Opus 4.5, for example—represent the length of time that the models can operate independently. They do not. They represent how long it takes humans to complete tasks that a model can successfully perform.  Kwa has seen this error so frequently that he made a point of correcting it at the very top of his recent blog post, and when asked what information he would add to the versions of the plot circulating online, he said he would include the word “human” whenever the task completion time was mentioned.

As complex and widely misinterpreted as the time horizon concept might be, it does make some basic sense: A model with a one-hour time horizon could automate some modest portions of a software engineer’s job, whereas a model with a 40-hour horizon could potentially complete days of work on its own. But some experts question whether the amount of time that humans take on tasks is an effective metric for quantifying AI capabilities. “I don’t think it’s necessarily a given fact that because something takes longer, it’s going to be a harder task,” says Inioluwa Deborah Raji, a PhD student at UC Berkeley who studies model evaluation. 

Von Arx says that she, too, was originally skeptical that time horizon was the right measure to use. What convinced her was seeing the results of her and her colleagues’ analysis. When they calculated the 50% time horizon for all the major models available in early 2025 and then plotted each of them on the graph, they saw that the time horizons for the top-tier models were increasing over time—and, moreover, that the rate of advancement was speeding up. Every seven-ish months, the time horizon doubled, which means that the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. “I can do all the theorizing I want about whether or not it makes sense, but the trend is there,” Von Arx says.

It’s this dramatic pattern that made the METR plot such a blockbuster. Many people learned about it when they read AI 2027, a viral sci-fi story cum quantitative forecast positing that superintelligent AI could wipe out humanity by 2030. The writers of AI 2027 based some of their predictions on the METR plot and cited it extensively. In Von Arx’s words, “It’s a little weird when the way lots of people are familiar with your work is this pretty opinionated interpretation.”

Of course, plenty of people invoke the METR plot without imagining large-scale death and destruction. For some AI boosters, the exponential trend indicates that AI will soon usher in an era of radical economic growth. The venture capital firm Sequoia Capital, for example, recently put out a post titled “2026: This is AGI,” which used the METR plot to argue that AI that can act as an employee or contractor will soon arrive. “The provocation really was like, ‘What will you do when your plans are measured in centuries?’” says Sonya Huang, a general partner at Sequoia and one of the post’s authors. 

Just because a model achieves a one-hour time horizon on the METR plot, however, doesn’t mean that it can replace one hour of human work in the real world. For one thing, the tasks on which the models are evaluated don’t reflect the complexities and confusion of real-world work. In their original study, Kwa, Von Arx, and their colleagues quantify what they call the “messiness” of each task according to criteria such as whether the model knows exactly how it is being scored and whether it can easily start over if it makes a mistake (for messy tasks, the answer to both questions would be no). They found that models do noticeably worse on messy tasks, although the overall pattern of improvement holds for both messy and non-messy ones.

And even the messiest tasks that METR considered can’t provide much information about AI’s ability to take on most jobs, because the plot is based almost entirely on coding tasks. “A model can get better at coding, but it’s not going to magically get better at anything else,” says Daniel Kang, an assistant professor of computer science at the University of Illinois Urbana-Champaign. In a follow-up study, Kwa and his colleagues did find that time horizons for tasks in other domains also appear to be on exponential trajectories, but that work was much less formal.

Despite these limitations, many people admire the group’s research. “The METR study is one of the most carefully designed studies in the literature for this kind of work,” Kang told me. Even Gary Marcus, a former NYU professor and professional LLM curmudgeon, described much of the work that went into the plot as “terrific” in a blog post.

Some people will almost certainly continue to read the METR plot as a prognostication of our AI-induced doom, but in reality it’s something far more banal: a carefully constructed scientific tool that puts concrete numbers to people’s intuitive sense of AI progress. As METR employees will readily agree, the plot is far from a perfect instrument. But in a new and fast-moving domain, even imperfect tools can have enormous value.

“This is a bunch of people trying their best to make a metric under a lot of constraints. It is deeply flawed in many ways,” Von Arx says. “I also think that it is one of the best things of its kind.”

From guardrails to governance: A CEO’s guide for securing agentic systems

The previous article in this series, “Rules fail at the prompt, succeed at the boundary,” focused on the first AI-orchestrated espionage campaign and the failure of prompt-level control. This article is the prescription. The question every CEO is now getting from their board is some version of: What do we do about agent risk?

Across recent AI security guidance from standards bodies, regulators, and major providers, a simple idea keeps repeating: treat agents like powerful, semi-autonomous users, and enforce rules at the boundaries where they touch identity, tools, data, and outputs.

The following is an actionable eight-step plan one can ask teams to implement and report against:  

Eight controls, three pillars: govern agentic systems at the boundary. Source: Protegrity

Constrain capabilities

These steps help define identity and limit capabilities.

1. Identity and scope: Make agents real users with narrow jobs

Today, agents run under vague, over-privileged service identities. The fix is straightforward: treat each agent as a non-human principal with the same discipline applied to employees.

Every agent should run as the requesting user in the correct tenant, with permissions constrained to that user’s role and geography. Prohibit cross-tenant on-behalf-of shortcuts. Anything high-impact should require explicit human approval with a recorded rationale. That is how Google’s Secure AI Framework (SAIF) and NIST AI’s access-control guidance are meant to be applied in practice.

The CEO question: Can we show, today, a list of our agents and exactly what each is allowed to do?

2. Tooling control: Pin, approve, and bound what agents can use

The Anthropic espionage framework worked because the attackers could wire Claude into a flexible suite of tools (e.g., scanners, exploit frameworks, data parsers) through Model Context Protocol, and those tools weren’t pinned or policy-gated.

The defense is to treat toolchains like a supply chain:

  • Pin versions of remote tool servers.
  • Require approvals for adding new tools, scopes, or data sources.
  • Forbid automatic tool-chaining unless a policy explicitly allows it.

This is exactly what OWASP flags under excessive agency and what it recommends protecting against. Under the EU AI Act, designing for such cyber-resilience and misuse resistance is part of the Article 15 obligation to ensure robustness and cybersecurity.

The CEO question: Who signs off when an agent gains a new tool or a broader scope? How does one know?

3. Permissions by design: Bind tools to tasks, not to models

A common anti-pattern is to give the model a long-lived credential and hope prompts keep it polite. SAIF and NIST argue the opposite: credentials and scopes should be bound to tools and tasks, rotated regularly, and auditable. Agents then request narrowly scoped capabilities through those tools.

In practice, that looks like: “finance-ops-agent may read, but not write, certain ledgers without CFO approval.”

The CEO question: Can we revoke a specific capability from an agent without re-architecting the whole system?

Control data and behavior

These steps gate inputs, outputs, and constrain behavior.

4. Inputs, memory, and RAG: Treat external content as hostile until proven otherwise

Most agent incidents start with sneaky data: a poisoned web page, PDF, email, or repository that smuggles adversarial instructions into the system. OWASP’s prompt-injection cheat sheet and OpenAI’s own guidance both insist on strict separation of system instructions from user content and on treating unvetted retrieval sources as untrusted.

Operationally, gate before anything enters retrieval or long-term memory: new sources are reviewed, tagged, and onboarded; persistent memory is disabled when untrusted context is present; provenance is attached to each chunk.

The CEO question: Can we enumerate every external content source our agents learn from, and who approved them?

5. Output handling and rendering: Nothing executes “just because the model said so”

In the Anthropic case, AI-generated exploit code and credential dumps flowed straight into action. Any output that can cause a side effect needs a validator between the agent and the real world. OWASP’s insecure output handling category is explicit on this point, as are browser security best practices around origin boundaries.

The CEO question: Where, in our architecture, are agent outputs assessed before they run or ship to customers?

6. Data privacy at runtime: Protect the data first, then the model

Protect the data such that there is nothing dangerous to reveal by default. NIST and SAIF both lean toward “secure-by-default” designs where sensitive values are tokenized or masked and only re-hydrated for authorized users and use cases.

In agentic systems, that means policy-controlled detokenization at the output boundary and logging every reveal. If an agent is fully compromised, the blast radius is bounded by what the policy lets it see.

This is where the AI stack intersects not just with the EU AI Act but with GDPR and sector-specific regimes. The EU AI Act expects providers and deployers to manage AI-specific risk; runtime tokenization and policy-gated reveal are strong evidence that one is actively controlling those risks in production.

The CEO question: When our agents touch regulated data, is that protection enforced by architecture or by promises?

Prove governance and resilience

For the final steps, it’s important to show controls work and keep working.

7. Continuous evaluation: Don’t ship a one-time test, ship a test harness

Anthropic’s research about sleeper agents should eliminate all fantasies about single test dreams and show how critical continuous evaluation is. This means instrumenting agents with deep observability, regularly red teaming with adversarial test suites, and backing everything with robust logging and evidence, so failures become both regression tests and enforceable policy updates.

The CEO question: Who works to break our agents every week, and how do their findings change policy?

 8. Governance, inventory, and audit: Keep score in one place

AI security frameworks emphasize inventory and evidence: enterprises must know which models, prompts, tools, datasets, and vector stores they have, who owns them, and what decisions were taken about risk.

For agents, that means a living catalog and unified logs:

  • Which agents exist, on which platforms
  • What scopes, tools, and data each is allowed
  • Every approval, detokenization, and high-impact action, with who approved it and when

The CEO question: If asked how an agent made a specific decision, could we reconstruct the chain?

And don’t forget the system-level threat model: assume the threat actor GTG-1002 is already in your enterprise. To complete enterprise preparedness, zoom out and consider the MITRE ATLAS product, which exists precisely because adversaries attack systems, not models. Anthropic provides a case study of a state-based threat actor (GTG-1002) doing exactly that with an agentic framework.

Taken together, these controls do not make agents magically safe. They do something more familiar and more reliable: they put AI, its access, and actions back inside the same security frame used for any powerful user or system.

For boards and CEOs, the question is no longer “Do we have good AI guardrails?” It’s: Can we answer the CEO questions above with evidence, not assurances?

This content was produced by Protegrity. It was not written by MIT Technology Review’s editorial staff.

The crucial first step for designing a successful enterprise AI system

Many organizations rushed into generative AI, only to see pilots fail to deliver value. Now, companies want measurable outcomes—but how do you design for success?

At Mistral AI, we partner with global industry leaders to co-design tailored AI solutions that solve their most difficult problems. Whether it’s increasing CX productivity with Cisco, building a more intelligent car with Stellantis, or accelerating product innovation with ASML, we start with open frontier models and customize AI systems to deliver impact for each company’s unique challenges and goals.

Our methodology starts by identifying an iconic use case, the foundation for AI transformation that sets the blueprint for future AI solutions. Choosing the right use case can mean the difference between true transformation and endless tinkering and testing.

Identifying an iconic use case

Mistral AI has four criteria that we look for in a use case: strategic, urgent, impactful, and feasible.

First, the use case must be strategically valuable, addressing a core business process or a transformative new capability. It needs to be more than an optimization; it needs to be a gamechanger. The use case needs to be strategic enough to excite an organization’s C-suite and board of directors.

For example, use cases like an internal-facing HR chatbot are nice to have, but they are easy to solve and are not enabling any new innovation or opportunities. On the other end of the spectrum, imagine an externally facing banking assistant that can not only answer questions, but also help take actions like blocking a card, placing trades, and suggesting upsell/cross-sell opportunities. This is how a customer-support chatbot is turned into a strategic revenue-generating asset.

Second, the best use case to move forward with should be highly urgent and solve a business-critical problem that people care about right now. This project will take time out of people’s days—it needs to be important enough to justify that time investment. And it needs to help business users solve immediate pain points.

Third, the use case should be pragmatic and impactful. From day one, our shared goal with our customers is to deploy into a real-world production environment to enable testing the solution with real users and gather feedback. Many AI prototypes end up in the graveyard of fancy demos that are not good enough to put in front of customers, and without any scaffolding to evaluate and improve. We work with customers to ensure prototypes are stable enough to release, and that they have the necessary support and governance frameworks.

Finally, the best use case is feasible. There may be several urgent projects, but choosing one that can deliver a quick return on investment helps to maintain the momentum needed to continue and scale.

This means looking for a project that can be in production within three months—and a prototype can be live within a few weeks. It’s important to get a prototype in front of end users as fast as possible to get feedback to make sure the project is on track, and pivot as needed.

Where use cases fall short

Enterprises are complex, and the path forward is not usually obvious. To weed through all the possibilities and uncover the right first use case, Mistral AI will run workshops with our customers, hand-in-hand with subject-matter experts and end users.

Representatives from different functions will demo their processes and discuss business cases that could be candidates for a first use case—and together we agree on a winner. Here are some examples of types of projects that don’t qualify.

Moonshots: Ambitious bets that excite leadership but lack a path to quick ROI. While these projects can be strategic and urgent, they rarely meet the feasibility and impact requirements.

Future investments: Long-term plays that can wait. While these projects can be strategic and feasible, they rarely meet the urgency and impact requirements.

Tactical fixes: Firefighting projects that solve immediate pain but don’t move the needle. While these cases can be urgent and feasible, they rarely meet the strategy and impact requirements.

Quick wins: Useful for building momentum, but not transformative. While they can be impactful and feasible, they rarely meet the strategy and urgency requirements.

Blue sky ideas: These projects are gamechangers, but they need maturity to be viable. While they can be strategic and impactful, they rarely meet the urgency and feasibility requirements.

Hero projects: These are high-pressure initiatives that lack executive sponsorship or realistic timelines. While they can be urgent and impactful, they rarely meet the strategy and feasibility requirements.

Moving from use case to deployment

Once a clearly defined and strategic use case ready for development is identified, it’s time to move into the validation phase. This means doing an initial data exploration and data mapping, identifying a pilot infrastructure, and choosing a target deployment environment.

This step also involves agreeing on a draft pilot scope, identifying who will participate in the proof of concept, and setting up a governance process.

Once this is complete, it’s time to move into the building phase. Companies that partner with Mistral work with our in-house applied AI scientists who build our frontier models. We work together to design, build, and deploy the first solution.

During this phase, we focus on co-creation, so we can transfer knowledge and skills to the organizations we’re partnering with. That way, they can be self-sufficient far into the future. The output of this phase is a deployed AI solution with empowered teams capable of independent operation and innovation.

The first step is everything

After the first win, it’s imperative to use the momentum and learnings from the iconic use case to identify more high-value AI solutions to roll out. Success is when we have a scalable AI transformation blueprint with multiple high-value solutions across the organization.

But none of this could happen without successfully identifying that first iconic use case. This first step is not just about selecting a project—it’s about setting the foundation for your entire AI transformation.

It’s the difference between scattered experiments and a strategic, scalable journey toward impact. At Mistral AI, we’ve seen how this approach unlocks measurable value, aligns stakeholders, and builds momentum for what comes next.

The path to AI success starts with a single, well-chosen use case: one that is bold enough to inspire, urgent enough to demand action, and pragmatic enough to deliver.

This content was produced by Mistral AI. It was not written by MIT Technology Review’s editorial staff.

What we’ve been getting wrong about AI’s truth crisis

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

What would it take to convince you that the era of truth decay we were long warned about—where AI content dupes us, shapes our beliefs even when we catch the lie, and erodes societal trust in the process—is now here? A story I published last week pushed me over the edge. It also made me realize that the tools we were sold as a cure for this crisis are failing miserably. 

On Thursday, I reported the first confirmation that the US Department of Homeland Security, which houses immigration agencies, is using AI video generators from Google and Adobe to make content that it shares with the public. The news comes as immigration agencies have flooded social media with content to support President Trump’s mass deportation agenda—some of which appears to be made with AI (like a video about “Christmas after mass deportations”).

But I received two types of reactions from readers that may explain just as much about the epistemic crisis we’re in. 

One was from people who weren’t surprised, because on January 22 the White House had posted a digitally altered photo of a woman arrested at an ICE protest, one that made her appear hysterical and in tears. Kaelan Dorr, the White House’s deputy communications director, did not respond to questions about whether the White House altered the photo but wrote, “The memes will continue.”

The second was from readers who saw no point in reporting that DHS was using AI to edit content shared with the public, because news outlets were apparently doing the same. They pointed to the fact that the news network MS Now (formerly MSNBC) shared an image of Alex Pretti that was AI-edited and appeared to make him look more handsome, a fact that led to many viral clips this week, including one from Joe Rogan’s podcast. Fight fire with fire, in other words? A spokesperson for MS Now told Snopes that the news outlet aired the image without knowing it was edited.

There is no reason to collapse these two cases of altered content into the same category, or to read them as evidence that truth no longer matters. One involved the US government sharing a clearly altered photo with the public and declining to answer whether it was intentionally manipulated; the other involved a news outlet airing a photo it should have known was altered but taking some steps to disclose the mistake.

What these reactions reveal instead is a flaw in how we were collectively preparing for this moment. Warnings about the AI truth crisis revolved around a core thesis: that not being able to tell what is real will destroy us, so we need tools to independently verify the truth. My two grim takeaways are that these tools are failing, and that while vetting the truth remains essential, it is no longer capable on its own of producing the societal trust we were promised.

For example, there was plenty of hype in 2024 about the Content Authenticity Initiative, cofounded by Adobe and adopted by major tech companies, which would attach labels to content disclosing when it was made, by whom, and whether AI was involved. But Adobe applies automatic labels only when the content is wholly AI-generated. Otherwise the labels are opt-in on the part of the creator.

And platforms like X, where the altered arrest photo was posted, can strip content of such labels anyway (a note that the photo was altered was added by users). Platforms can also simply not choose to show the label; indeed, when Adobe launched the initiative, it noted that the Pentagon’s website for sharing official images, DVIDS, would display the labels to prove authenticity, but a review of the website today shows no such labels.

Noticing how much traction the White House’s photo got even after it was shown to be AI-altered, I was struck by the findings of a very relevant new paper published in the journal Communications Psychology. In the study, participants watched a deepfake “confession” to a crime, and the researchers found that even when they were told explicitly that the evidence was fake, participants relied on it when judging an individual’s guilt. In other words, even when people learn that the content they’re looking at is entirely fake, they remain emotionally swayed by it. 

“Transparency helps, but it isn’t enough on its own,” the disinformation expert Christopher Nehring wrote recently about the study’s findings. “We have to develop a new masterplan of what to do about deepfakes.”

AI tools to generate and edit content are getting more advanced, easier to operate, and cheaper to run—all reasons why the US government is increasingly paying to use them. We were well warned of this, but we responded by preparing for a world in which the main danger was confusion. What we’re entering instead is a world in which influence survives exposure, doubt is easily weaponized, and establishing the truth does not serve as a reset button. And the defenders of truth are already trailing way behind.

Update: This story was updated on February 2 with details about how Adobe applies its content authenticity labels.

Inside the marketplace powering bespoke AI deepfakes of real women

Civitai—an online marketplace for buying and selling AI-generated content, backed by the venture capital firm Andreessen Horowitz—is letting users buy custom instruction files for generating celebrity deepfakes. Some of these files were specifically designed to make pornographic images banned by the site, a new analysis has found.

The study, from researchers at Stanford and Indiana University, looked at people’s requests for content on the site, called “bounties.” The researchers found that between mid-2023 and the end of 2024, most bounties asked for animated content—but a significant portion were for deepfakes of real people, and 90% of these deepfake requests targeted women. (Their findings have not yet been peer reviewed.)

The debate around deepfakes, as illustrated by the recent backlash to explicit images on the X-owned chatbot Grok, has revolved around what platforms should do to block such content. Civitai’s situation is a little more complicated. Its marketplace includes actual images, videos, and models, but it also lets individuals buy and sell instruction files called LoRAs that can coach mainstream AI models like Stable Diffusion into generating content they were not trained to produce. Users can then combine these files with other tools to make deepfakes that are graphic or sexual. The researchers found that 86% of deepfake requests on Civitai were for LoRAs.

In these bounties, users requested “high quality” models to generate images of public figures like the influencer Charli D’Amelio or the singer Gracie Abrams, often linking to their social media profiles so their images could be grabbed from the web. Some requests specified a desire for models that generated the individual’s entire body, accurately captured their tattoos, or allowed hair color to be changed. Some requests targeted several women in specific niches, like artists who record ASMR videos. One request was for a deepfake of a woman said to be the user’s wife. Anyone on the site could offer up AI models they worked on for the task, and the best submissions received payment—anywhere from $0.50 to $5. And nearly 92% of the deepfake bounties were awarded.

Neither Civitai nor Andreessen Horowitz responded to requests for comment.

It’s possible that people buy these LoRAs to make deepfakes that aren’t sexually explicit (though they’d still violate Civitai’s terms of use, and they’d still be ethically fraught). But Civitai also offers educational resources on how to use external tools to further customize the outputs of image generators—for example, by changing someone’s pose. The site also hosts user-written articles with details on how to instruct models to generate pornography. The researchers found that the amount of porn on the platform has gone up, and that the majority of requests each week are now for NSFW content.

“Not only does Civitai provide the infrastructure that facilitates these issues; they also explicitly teach their users how to utilize them,” says Matthew DeVerna, a postdoctoral researcher at Stanford’s Cyber Policy Center and one of the study’s leaders. 

The company used to ban only sexually explicit deepfakes of real people, but in May 2025 it announced it would ban all deepfake content. Nonetheless, countless requests for deepfakes submitted before this ban now remain live on the site, and many of the winning submissions fulfilling those requests remain available for purchase, MIT Technology Review confirmed.

“I believe the approach that they’re trying to take is to sort of do as little as possible, such that they can foster as much—I guess they would call it—creativity on the platform,” DeVerna says.

Users buy LoRAs with the site’s online currency, called Buzz, which is purchased with real money. In May 2025, Civita’s credit card processor cut off the company because of its ongoing problem with nonconsensual content. To pay for explicit content, users must now use gift cards or cryptocurrency to buy Buzz; the company offers a different scrip for non-explicit content. 

Civitai automatically tags bounties requesting deepfakes and lists a way for the person featured in the content to manually request its takedown. This system means that Civitai has a reasonably successful way of knowing which bounties are for deepfakes, but it’s still leaving moderation to the general public rather than carrying it out proactively. 

A company’s legal liability for what its users do isn’t totally clear. Generally, tech companies have broad legal protections against such liability for their content under Section 230 of the Communications Decency Act, but those protections aren’t limitless. For example, “you cannot knowingly facilitate illegal transactions on your website,” says Ryan Calo, a professor specializing in technology and AI at the University of Washington’s law school. (Calo wasn’t involved in this new study.)

Civitai joined OpenAI, Anthropic, and other AI companies in 2024 in adopting design principles to guard against the creation and spread of AI-generated child sexual abuse material . This move followed a 2023 report from the Stanford Internet Observatory, which found that the vast majority of AI models named in child sexual abuse communities were Stable Diffusion–based models “predominantly obtained via Civitai.”

But adult deepfakes have not gotten the same level of attention from content platforms or the venture capital firms that fund them. “They are not afraid enough of it. They are overly tolerant of it,” Calo says. “Neither law enforcement nor civil courts adequately protect against it. It is night and day.”

Civitai received a $5 million investment from Andreessen Horowitz (a16z) in November 2023. In a video shared by a16z, Civitai cofounder and CEO Justin Maier described his goal of building the main place where people find and share AI models for their own individual purposes. “We’ve aimed to make this space that’s been very, I guess, niche and engineering-heavy more and more approachable to more and more people,” he said. 

Civitai is not the only company with a deepfake problem in a16z’s investment portfolio; in February, MIT Technology Review first reported that another company, Botify AI, was hosting AI companions resembling real actors that stated their age as under 18, engaged in sexually charged conversations, offered “hot photos,” and in some instances described age-of-consent laws as “arbitrary” and “meant to be broken.”