Chinese tech workers are starting to train their AI doubles–and pushing back

Tech workers in China are being instructed by their bosses to train AI agents to replace them—and it’s prompting a wave of soul-searching among otherwise enthusiastic early adopters. 

Earlier this month a GitHub project called Colleague Skill, which claimed workers could use it to “distill” their colleagues’ skills and personality traits and replicate them with an AI agent, went viral on Chinese social media. Though the project was created as a spoof, it struck a nerve among tech workers, a number of whom told MIT Technology Review that their bosses are encouraging them to document their workflows in order to automate specific tasks and processes using AI agent tools like OpenClaw or Claude Code. 

To set up Colleague Skill, a user names the coworker whose tasks they want to replicate and adds basic profile details. The tool then automatically imports chat history and files from Lark and DingTalk, both popular workplace apps in China, and generates reusable manuals describing that coworker’s duties—and even their unique quirks—for an AI agent to replicate. 

Colleague Skill was created by Tianyi Zhou, who works as an engineer at the Shanghai Artificial Intelligence Laboratory. Earlier this week he told Chinese outlet Southern Metropolis Daily that the project was started as a stunt, prompted by AI-related layoffs and by the growing tendency of companies to ask employees to automate themselves. He didn’t respond to requests for further comment.

Internet users have found humor in the idea behind the tool, joking about automating their coworkers before themselves. However, Colleague Skill’s virality has sparked a lot of debate about workers’ dignity and individuality in the age of AI.

After seeing Colleague Skill on social media, Amber Li, 27, a tech worker in Shanghai, used it to recreate a former coworker as a personal experiment. Within minutes, the tool created a file detailing how that person did their job. “It is surprisingly good,” Li says. “It even captures the person’s little quirks, like how they react and their punctuation habits.” With this skill, Li can use an AI agent as a new “coworker” that helps debug her code and replies instantly. It felt uncanny and uncomfortable, Li says. 

Even so,  replacing coworkers with agents could become a norm. Since OpenClaw became a national craze, bosses in China have been pushing tech workers to experiment with agents. 

Although AI agents can take control of your computer, read and summarize news, reply to emails, and book restaurant reservations for you, tech workers on the ground say their utility has so far proven to be limited in business contexts. Asking employees to make manuals describing the minutiae of their day-to-day jobs the way Colleague Skill does is one way to help bridge that gap. 

Hancheng Cao, an assistant professor at Emory University who studies AI and work, believes that companies have good reasons to push employees to create work blueprints like these, beyond simply following a trend. “Firms gain not only internal experience with the tools, but also richer data on employee know-how, workflows, and decision patterns. That helps companies see which parts of work can be standardized or codified into systems, and which still depend on human judgment,” he says.

To employees, though, making agents or even blueprints for them can feel strange and alienating. One software engineer, who spoke with MIT Technology Review anonymously because of concerns about their job security, trained an AI (not Colleague Skill) on their workflow and found that the process felt reductive—as if their work had been flattened into modules in a way that made them easier to replace. On social media, workers have turned to bleak humor to express similar feelings. In one comment on Rednote, a user wrote that “a cold farewell can be turned into warm tokens,” quipping that if they use Colleague Skill to distill their coworkers into tasks first, they themselves might survive a little longer.

The push for creating agents has also spurred clever countermeasures. Irritated by the idea of reducing a person to a skill, Koki Xu, 26 an AI product manager in Beijing, published an “anti-distillation” skill on GitHub on April 4. The tool, which took Xu about an hour to build, is designed to sabotage the process of creating workflows for agents. Users can choose between light, medium, and heavy sabotage modes depending on how closely their boss is observing the process, and the agent rewrites the material into generic, non-actionable language that would produce a less useful AI stand-in. A video Xu posted about the project went viral, drawing more than 5 million likes across platforms.

Xu told MIT Technology Review that she has been following the Colleague Skill trend from the start and that it has made her think about alienation, disempowerment, and broader implications for labor. “I originally wanted to write an op-ed, but decided it would be more useful to make something that pushes back against it,” she says.

Xu, who has undergraduate and master’s degrees in law, said the trend also raises legal questions. While a company may be able to argue that work chat histories and materials created on a work laptop are corporate property, a skill like this can also capture elements of personality, tone, and judgment, making ownership much less clear. She said she hopes Colleague Skill prompts more discussion about how to protect workers’ dignity and identity in the age of AI. “I believe it’s important to keep up with these trends so we (employees) can participate in shaping how they are used,” she says. Xu herself is an avid AI adopter, with seven OpenClaw agents set up across her personal and work devices.

Li, the tech worker in Shanghai, says her company has not yet found a way to replace actual workers with AI tools, largely because they remain unreliable and require constant supervision. “I don’t feel like my job is immediately at risk,” she says. “But I do feel that my value is being cheapened, and I don’t know what to do about it.”

How robots learn: A brief, contemporary history

Roboticists used to dream big but build small. They’d hope to match or exceed the extraordinary complexity of the human body, and then they’d spend their career refining robotic arms for auto plants. Aim for C-3P0; end up with the Roomba. 

The real ambition for many of these researchers was the robot of science fiction—one that could move through the world, adapt to different environments, and interact safely and helpfully with people. For the socially minded, such a machine could help those with mobility issues, ease loneliness, or do work too dangerous for humans. For the more financially inclined, it would mean a bottomless source of wage-free labor. Either way, a long history of failure left most of Silicon Valley hesitant to bet on helpful robots.

That has changed. The machines are yet unbuilt, but the money is flowing: Companies and investors put $6.1 billion into humanoid robots in 2025 alone, four times what was invested in 2024. 

What happened? A revolution in how machines have learned to interact with the world. 

Imagine you’d like a pair of robot arms installed in your home purely to do one thing: fold clothes. How would it learn to do that? You could start by writing rules. Check the fabric to figure out how much deformation it can tolerate before tearing. Identify a shirt’s collar. Move the gripper to the left sleeve, lift it, and fold it inward by exactly this distance. Repeat for the right sleeve. If the shirt is rotated, turn the plan accordingly. If the sleeve is twisted, correct it. Very quickly the number of rules explodes, but a complete accounting of them could produce reliable results. This was the original craft of robotics: anticipating every possibility and encoding it in advance.

Around 2015, the cutting edge started to do things differently: Build a digital simulation of the robotic arms and the clothes, and give the program a reward signal every time it folds successfully and a ding every time it fails. This way, it gets better by trying all sorts of techniques through trial and error, with millions of iterations—the same way AI got good at playing games.

The arrival of ChatGPT in 2022 catalyzed the current boom. Trained on vast amounts of text, large language models work not through trial and error but by learning to predict what word should come next in a sentence. Similar models adapted to robotics were soon able to absorb pictures, sensor readings, and the position of a robot’s joints and predict the next action the machine should take, issuing dozens of motor commands every second.

This conceptual shift—to reliance on AI models that ingest large amounts of data—seems to work whether that helpful robot is supposed to talk to people, move through an environment, or even do complicated tasks. And it was paired with other ideas about how to accomplish this new way of learning, like deploying robots even if they aren’t yet perfect so they can learn from the environment they’re meant to work in. Today, Silicon Valley roboticists are dreaming big again. Here’s how that happened. 


Jibo

A movable social robot carried out conversations long before the age of LLMs.

An MIT robotics researcher named Cynthia Breazeal introduced an armless, legless, faceless robot called Jibo to the world in 2014. It looked, in fact, like a lamp. Breazeal’s aim was to create a social robot for families, and the idea pulled in $3.7 million in a crowdsourced funding campaign. Early preorders cost $749.

The early Jibo could introduce itself and dance to entertain kids, but that was about it. The vision was always for it to become a sort of embodied assistant that could handle everything from scheduling and emails to telling stories. It earned a number of devoted users, but ultimately the company shut down in 2019.

A crowdfunding campaign started in 2014 and drew 4,800 Jibo preorders.
COURTESY OF MIT MEDIA LAB

In retrospect, one thing that Jibo really needed was better language capabilities. It was competing against Apple’s Siri and Amazon’s Alexa, and all those technologies at the time relied on heavy scripting. In broad terms, when you spoke to them, software would translate your speech into text, analyze what you wanted, and create a response pulled from preapproved snippets. Those snippets could be charming, but they were also repetitive and simply boringdownright robotic. That was especially a challenge for a robot that was supposed to be social and family oriented. 

What has happened since, of course, is a revolution in how machines can generate language. Voice mode from any leading AI provider is now engaging and impressive, and multiple hardware startups are trying (and failing) to build products that take advantage of it. 

But that comes with a new risk: While scripted conversations can’t really go off the rails, ones generated by AI certainly can. Some popular AI toys have, for example, talked to kids about how to find matches and knives. 


Dactyl

A robot hand trained with simulations tries to model the unpredictability and variation of the real world.

By 2018, every leading robotics lab was trying to scrap the old scripted rules and train robots through trial and error. OpenAI tried to train its robotic hand, Dactyl, virtuallywith digital models of the hand and of the palm-size cubes Dactyl was supposed to manipulate. The cubes had letters and numbers on their faces; the model might set a task like “Rotate the cube so the red side with the letter O faces upward.”

Here’s the problem: A robotic hand might get really good at doing this in its simulated world, but when you take that program and ask it to work on a real version in the real world, the slight differences between the two can cause things to go awry. Colors might be slightly different, or the deformable rubber in the robot’s fingertips could turn out to be stretchier than it was in simulation.

a Dactyl robot hand holds a Rubix cube
Dactyl, part of OpenAI’s first attempt at robotics, was trained in simulation to solve Rubik’s Cubes.
COURTESY OF OPENAI

The solution is called domain randomization. You essentially create millions of simulated worlds that all vary slightly and randomly from one another. In each one the friction might be less, or the lighting more harsh, or the colors darkened. Exposure to enough of this variation means the robots will be better able to manipulate the cube in the real world. The approach worked on Dactyl, and one year later it was able to use the same core techniques to do something harder: solving Rubik’s Cubes (though it worked only 60% of the time, and just 20% when the scrambles were particularly hard). 

Still, the limits of simulation mean that this technique plays a far smaller role today than it did in 2018. OpenAI shuttered its robotics effort in 2021 but has recently started the division up againreportedly focusing on humanoids. 


RT-2

Training on images from across the internet helps robots translate language into action.

Around 2022, Google’s robotics team was up to some strange things. It spent 17 months handing people robot controllers and filming them doing everything from picking up bags of chips to opening jars. The team ended up cataloguing 700 different tasks.

The point was to build and test one of the first large-scale foundation models for robotics. As with large language models, the idea was to input lots of text, tokenize it into a format an algorithm could work with, and then generate an output. Google’s RT-1 received input about what the robot was looking at and how the many parts of the robotic arm were positioned; then it took an instruction and translated it into motor commands to move the robot. When it had seen tasks before, it carried out 97% of them successfully; it succeeded at 76% of the instructions it hadn’t seen before. 

a robot at a table of small toys
The model RT-2, for Robotic Transformer 2, incorporated internet data to help robots process what they were seeing.
COURTESY OF GOOGLE DEEPMIND

The second iteration, RT-2, came out the following year and went even further. Instead of training on data specific to robotics, it went broad: It trained on more general images from across the internet, like the vision-language models lots of researchers were working on at the time. That allowed the robot to interpret where certain objects were in the scene.

“All these other things were unlocked,” says Kanishka Rao, a roboticist at Google DeepMind who led work on both iterations. “We could do things now like ‘Put the Coke can near the picture of Taylor Swift.’” 

In 2025, Google DeepMind further fused the worlds of large language models and robotics, releasing a Gemini Robotics model with improved ability to understand commands in natural language. 


RFM-1

An AI model that allows robotic arms to act like coworkers.

In 2017, before OpenAI shuttered its first robotics team, a group of its engineers spun out a project called Covariant, aiming to build not sci-fi humanoids but the most pragmatic of all robots: an arm that could pick up and move things in warehouses. After building a system based on foundation models similar to Google’s, Covariant deployed this platform in warehouses like those operated by Crate & Barrel and treated it as a data collection pipeline. 

By 2024, Covariant had released a robotics model, RFM-1, that you could interact with like a coworker. If you showed an arm many sleeves of tennis balls, for example, you could then instruct it to move each sleeve to a separate area. And the robot could respondperhaps predicting that it wouldn’t be able to get a good grip on the item and then asking for advice on which particular suction cups it should use. 

This sort of thing had been done in experiments, but Covariant was launching it at significant scale. The company now had cameras and data collection machines in every customer location, feeding back even more data for the model to train on.

a warehouse robot arm lifts object with many suckers to place in a bin
A Covariant robot demonstrates “induction”—the common warehouse task of placing objects on sorters or conveyors.
COURTESY OF COVARIANT

It wasn’t perfect. In a demo in March 2024 with an array of kitchen items, the robot struggled when it was asked to “return the banana” to its original location. It picked up a sponge, then an apple, then a host of other items before it finally accomplished the task. 

It “doesn’t understand the new concept” of retracing its steps, cofounder Peter Chen told me at the time. “But it’s a good exampleit might not work well yet in the places where you don’t have good training data.”

Chen and fellow founder Pieter Abbeel were soon hired by Amazon, which is currently licensing Covariant’s robotics model (Amazon did not respond to questions about how it’s being used, but the company runs an estimated 1,300 warehouses in the US alone). 


Digit

Companies are putting this humanoid to the test in real-world settings.

The new investment dollars flowing to robotics startups are aimed largely at robots shaped not like lamps or arms but like people. Humanoid robots are supposed to be able to seamlessly enter the spaces and jobs where humans currently work, avoiding the need to retool assembly lines to accommodate new shapes such as giant arms. 

It’s easier said than done. In the rare cases where humanoids appear in real warehouses, they’re often confined to test zones and pilot programs. 

Digit humanoid robot putting a plastic bin on a conveyor belt
Amazon and other companies are using Digit to help move shipping totes.
COURTESY OF AGILITY ROBOTICS

That said, Agility’s humanoid Digit appears to be doing some real work. The designwith exposed joints and a distinctly unhuman headis driven more by function than by sci-fi aesthetics. Amazon, Toyota, and GXO (a logistics giant with customers like Apple and Nike) have all deployed itmaking it one of the first examples of a humanoid robot that companies see as providing actual cost savings rather than novelty. Their Digits spend their days picking up, moving, and stacking shipping totes.

The current Digit is still a long way from the humanlike helper Silicon Valley is betting on, though. It can lift only 35 pounds, for exampleand every time Agility makes Digit stronger, its battery gets heavier and it has to recharge more often. And standards organizations say humanoids need stricter safety rules than most industrial robots, because they’re designed to be mobile and spend time in proximity to people. 

But Digit shows that this revolution in robot training isn’t converging on a single method. Agility relies on simulation techniques like those OpenAI used to train its hand, and the company has worked with Google’s Gemini models to help its robots adapt to new environments. That’s where more than a decade of experiments have gotten the industry: Now it’s building big.

Why having “humans in the loop” in an AI war is an illusion

The availability of artificial intelligence for use in warfare is at the center of a legal battle between Anthropic and the Pentagon. This debate has become urgent, with AI playing a bigger role than ever before in the current conflict with Iran. AI is no longer just helping humans analyze intelligence. It is now an active player—generating targets in real time, controlling and coordinating missile interceptions, and guiding lethal swarms of autonomous drones.

Most of the public conversation regarding the use of AI-driven autonomous lethal weapons centers on how much humans should remain “in the loop.” Under the Pentagon’s current guidelines, human oversight supposedly provides accountability, context, and nuance while reducing the risk of hacking.

AI systems are opaque “black boxes”

But the debate over “humans in the loop” is a comforting distraction. The immediate danger is not that machines will act without human oversight; it is that human overseers have no idea what the machines are actually “thinking.” The Pentagon’s guidelines are fundamentally flawed because they rest on the dangerous assumption that humans understand how AI systems work.

Having studied intentions in the human brain for decades and in AI systems more recently, I can attest that state-of-the-art AI systems are essentially “black boxes.” We know the inputs and outputs, but the artificial “brain” processing them remains opaque. Even their creators cannot fully interpret them or understand how they work. And when AIs do provide reasons, they are not always trustworthy.

The illusion of human oversight in autonomous systems

In the debate over human oversight, a fundamental question is going unasked: Can we understand what an AI system intends to do before it acts?

Imagine an autonomous drone tasked with destroying an enemy munitions factory. The automated command and control system determines that the optimal target is a munitions storage building. It reports a 92% probability of mission success because secondary explosions of the munitions in the building will thoroughly destroy the facility. A human operator reviews the legitimate military objective, sees the high success rate, and approves the strike.

But what the operator does not know is that the AI system’s calculation included a hidden factor: Beyond devastating the munitions factory, the secondary explosions would also severely damage a nearby children’s hospital. The emergency response would then focus on the hospital, ensuring the factory burns down. To the AI, maximizing disruption in this way meets its given objective. But to a human, it is potentially committing a war crime by violating the rules regarding civilian life. 

Keeping a human in the loop may not provide the safeguard people imagine, because the human cannot know the AI’s intention before it acts. Advanced AI systems do not simply execute instructions; they interpret them. If operators fail to define their objectives carefully enough—a highly likely scenario in high-pressure situations—the “black box” system could be doing exactly what it was told and still not acting as humans intended.

This “intention gap” between AI systems and human operators is precisely why we hesitate to deploy frontier black-box AI in civilian health care or air traffic control, and why its integration into the workplace remains fraught—yet we are rushing to deploy it on the battlefield.

To make matters worse, if one side in a conflict deploys fully autonomous weapons, which operate at machine speed and scale, the pressure to remain competitive would push the other side to rely on such weapons too. This means the use of increasingly autonomous—and opaque—AI decision-making in war is only likely to grow.

The solution: Advance the science of AI intentions

The science of AI must comprise both building highly capable AI technology and understanding how this technology works. Huge advances have been made in developing and building more capable models, driven by record investments—forecast by Gartner to grow to around $2.5 trillion in 2026 alone. In contrast, the investment in understanding how the technology works has been minuscule.

We need a massive paradigm shift. Engineers are building increasingly capable systems. But understanding how these systems work is not just an engineering problem—it requires an interdisciplinary effort. We must build the tools to characterize, measure, and intervene in the intentions of AI agents before they act. We need to map the internal pathways of the neural networks that drive these agents so that we can build a true causal understanding of their decision-making, moving beyond merely observing inputs and outputs. 

A promising way forward is to combine techniques from mechanistic interpretability (breaking neural networks down into human-understandable components) with insights, tools, and models from the neuroscience of intentions. Another idea is to develop transparent, interpretable “auditor” AIs designed to monitor the behavior and emergent goals of more capable black-box systems in real time.  

Developing a better understanding of how AI functions will enable us to rely on AI systems for mission-critical applications. It will also make it easier to build more efficient, more capable, and safer systems.

Colleagues and I are exploring how ideas from neuroscience, cognitive science, and philosophy—fields that study how intentions arise in human decision-making—might help us understand the intentions of artificial systems. We must prioritize these kinds of interdisciplinary efforts, including collaborations between academia, government, and industry.

However, we need more than just academic exploration. The tech industry—and the philanthropists funding AI alignment, which strives to encode human values and goals into these models—must direct substantial investments toward interdisciplinary interpretability research. Furthermore, as the Pentagon pursues increasingly autonomous systems, Congress must mandate rigorous testing of AI systems’ intentions, not just their performance.

Until we achieve that, human oversight over AI may be more illusion than safeguard.

Uri Maoz is a cognitive and computational neuroscientist specializing in how the brain transforms intentions into actions. A professor at Chapman University with appointments at UCLA and Caltech, he leads an interdisciplinary initiative focused on understanding and measuring intentions in artificial intelligence systems (ai-intentions.org).

Making AI operational in constrained public sector environments

The AI boom has hit across industries, and public sector organizations are facing pressure to accelerate adoption. At the same time, government institutions face distinct constraints around security, governance, and operations that set them apart from their business counterparts. For this reason, purpose-built small language models (SLMs) offer a promising path to operationalize AI in these environments.  

A Capgemini study found that 79 percent of public sector executives globally are wary about AI’s data security, an understandable figure given the heightened sensitivity of government data and the legal obligations surrounding its use. As Han Xiao, vice president of AI at Elastic, says, “Government agencies must be very restricted about what kind of data they send to the network. This sets a lot of boundaries on how they think about and manage their data.”

The fundamental need for control over sensitive information is one of many factors complicating AI deployment, particularly when compared against the private sector’s standard operational assumptions.

Unique operational challenges

When private-sector entities expand AI, they typically assume certain conditions will be in place, including continuous connectivity to the cloud, reliance on centralized infrastructure, acceptance of incomplete model transparency, and limited restrictions on data movement. For many state institutions, however, accepting these conditions could be anything from dangerous to impossible. 

Government agencies must ensure that their data stays under their control, that information can be checked and verified, and that operational disruptions are kept to an absolute minimum. At the same time, they often have to run their systems in environments where internet connectivity is limited, unreliable, or unavailable. These complexities prevent many promising public sector AI pilots from moving beyond experimentation. “Many people undervalue the operating challenge of AI,” Xiao says. “The public sector needs AI to perform reliably on all kinds of data, and then to be able to grow without breaking. Continuity of operations is often underestimated.” An Elastic survey of public sector leaders found that 65 percent struggle to use data continuously in real time and at scale. 

Infrastructure constraints compound the problem. Government organizations may also struggle to obtain the graphics processing units (GPUs) used to train and access complex AI models. As Xiao points out, “Government doesn’t often purchase GPUs, unlike the private sector—they’re not used to managing GPU infrastructure. So accessing a GPU to run the model is a bottleneck for much of the public sector.” 

A smaller, more practical model

The many nonnegotiable requirements in the public sector make large language models (LLMs) untenable. But SLMs can be housed locally, offering greater security and control. SLMs are specialized AI models that typically use billions rather than hundreds of billions of parameters, making them far less computationally demanding than the largest LLMs.

The public sector does not need to build ever-larger models housed in offsite, centralized locations. An empirical study found that SLMs performed as well or better than LLMs. SLMs allow sensitive information to be used effectively and efficiently while avoiding the operational complexity of maintaining large models. Xiao puts it this way: “It is easy to use ChatGPT to do proofreading. It’s very difficult to run your own large language models just as smoothly in an environment with no network access.” 

SLMs are purpose-built for the needs of the department or agency that will use them. The data is stored securely outside the model, and is only accessed when queried. Carefully engineered prompts ensure that only the most relevant information is retrieved, providing more accurate responses. Using methods such as smart retrieval, vector search, and verifiable source grounding, AI systems can be built that cater to public sector needs. 

Thus, the next phase of AI adoption in the public sector may be to bring the AI tool to the data, rather than sending the data out into the cloud. Gartner predicts that by 2027, small, specialized AI models will be used three times more than LLMs.

Superior search capabilities

“When people in the public sector hear AI, they probably think about ChatGPT. But we can be much more ambitious,” says Xiao. “AI can revolutionize how the government searches and manages the large amounts of data they have.”

Looking beyond chatbots reveals one of AI’s most immediate opportunities: dramatically improved search. Like many organizations, the public sector has mountains of unstructured data—including technical reports, procurement documents, minutes, and invoices. Today’s AI, however, can deliver results sourced from mixed media, like readable PDFs, scans, images, spreadsheets, and recordings, and in multiple languages. All of this can be indexed by SLM-powered systems to provide tailored responses and to draft complex texts in any language, while ensuring outputs are legally compliant. “The public sector has a lot of data, and they don’t always know how to use this data. They don’t know what the possibilities are,” says Xiao.

Even more powerful, AI can help government employees interpret the data they access. “Today’s AI can provide you with a completely new view of how to harness that data,” says Xiao. A well-trained SLM can interpret legal norms, extract insights from public consultations, support data-driven executive decision-making, and improve public access to services and administrative information. This can contribute to dramatic improvements in how the public sector conducts its operations.

The small-language promise

Focusing on SLMs shifts the conversation from how comprehensive the model can be to how efficient it is. LLMs incur significant performance and computational costs and require specialized hardware that many public entities cannot afford. Despite requiring some capital expenses, SLMs are less resource-intensive than LLMs, so they tend to be cheaper and reduce environmental impact. 

Public sector agencies often face stringent audit requirements, and SLM algorithms can be documented and certified as transparent. Some countries, particularly in Europe, also have privacy regulations such as GDPR that SLMs can be designed to meet.

Tailored training data produces more targeted results, reducing errors, bias, and hallucinations that AI is prone to. As Xiao puts it, “Large language models generate text based on what they were trained on, so there is a cut-off date when they were trained. If you ask about anything after that, it will hallucinate. We can solve this by forcing the model to work from verified sources.”

Risks are also minimized by keeping data on local servers, or even on a specific device. This isn’t about isolation but about strategic autonomy to enable trust, resilience, and relevance.

By prioritizing task-specific models designed for environments that process data locally, and by continuously monitoring performance and impact, public sector organizations can build lasting AI capabilities that support real-world decisions. “Do not start with a chatbot; start with search,” Xiao advises. “Much of what we think of as AI intelligence is really about finding the right information.”

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff. It was researched, designed, and written by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

Treating enterprise AI as an operating layer

There’s a fault line running through enterprise AI, and it’s not the one getting the most attention. The public conversation still tracks foundation models and benchmarks—GPT versus Gemini, reasoning scores, and marginal capability gains. But in practice, the more durable advantage is structural: who owns the operating layer where intelligence is applied, governed, and improved. One model treats AI as an on-demand utility; the other embeds it as an operating layer—the combination of operation software, data capture, feedback loops and governance that sits between models and real work—that compounds with use.

Model providers like OpenAI and Anthropic sell intelligence as a service: you have a problem, you call an API, you get an answer. That intelligence is general-purpose, largely stateless, and only loosely connected to the day-to-day operations where decisions are made. It’s highly capable and increasingly interchangeable. The distinction that matters is whether intelligence resets on every prompt or accumulates over time.

Incumbent organizations, by contrast, can treat AI as an operating layer: instrumentation across operations, feedback loops from human decisions, and governance that turns individual tasks into reusable policy. In that setup, every exception, correction, and approval becomes a chance to learn—and intelligence can improve as the platform absorbs more of the organization’s work. The organizations most likely to shape the enterprise AI era are those that can embed intelligence directly into operational platforms and instrument those platforms so work generates usable signals.

The prevailing narrative says nimble startups will out-innovate incumbents by building AI-native from scratch. If AI is primarily a model problem, that story holds. But in many enterprise domains, AI is a systems problem—integrations, permissions, evaluation, and change management—where advantage accrues to whomever already sits inside high-volume, high-stakes operations and converts that position into learning and automation.

The inversion: AI executes, humans adjudicate

Traditional services organizations are built on a simple architecture: humans use software to do expert work. Operators log into systems, navigate operations, make decisions, and process cases. Technology is the medium. Human judgment is the product.

An AI-native platform inverts this. It ingests a problem, applies accumulated domain knowledge, executes autonomously what it can with high confidence, and routes targeted sub-tasks to human experts when the situation demands judgment that the system can’t yet reliably provide.

But inverting human-AI interaction isn’t just a UI redesign—it requires raw material. It’s only possible when the platform is built on a foundation of domain expertise, behavioral data, and operational knowledge accumulated over years.

The three compounding assets incumbents already own

AI-native startups begin with a clean architectural slate and can move quickly. What they can’t easily manufacture is the raw material that makes domain AI defensible at scale:

  • Proprietary operational data
  • A large workforce of domain experts whose day-to-day decisions generate training signals
  • Accumulated tacit knowledge about how complex work actually gets done

Services companies already have all three. But these ingredients aren’t moats on their own. They become an advantage only when a company can systematically convert messy operations into AI-ready signals and institutional knowledge—then feed the results back into operations so the system keeps improving.

Codifying expertise into reusable signals

In most services organizations, expertise is tacit and perishable. The best operators know things they cannot easily articulate: heuristics developed over the years, edge-case intuitions, and pattern recognition that operate below the level of conscious reasoning.

At Ensemble, the strategy for addressing this challenge is knowledge distillation. The systematic conversion of expert judgment and operational decisions into machine-readable training signals.

In health-care revenue cycle management, for example, systems can be seeded with explicit domain knowledge and then deepen their coverage through structured daily interaction with operators. In Ensemble’s implementation, the system identifies gaps, formulates targeted questions, and cross-checks answers across multiple experts to capture both consensus and edge-case nuance. It then synthesizes these inputs into a living knowledge base that reflects the situational reasoning behind expert-level performance.

Turning decisions into a learning flywheel

Once a system is constrained enough to be trusted, the next question is how it gets better without waiting for annual model upgrades. Every time a skilled operator makes a decision, they generate more than a completed task. They generate a potential labeled example—context paired with an expert action (and sometimes an outcome). At scale, across thousands of operators and millions of decisions, that stream can power supervised learning, evaluation, and targeted forms of reinforcement—teaching systems to behave more like experts in real conditions.

For example, if an organization processes 50,000 cases a week and captures just three high-quality decision points per case, that’s 150,000 labeled examples every week without creating a separate data-collection program.

A more advanced human-in-the-loop design places experts inside the decision process, so systems learn not just what the right answer was, but how ambiguity gets resolved. Practically, humans intervene at branch points—selecting from AI-generated options, correcting assumptions, and redirecting operations. Each intervention becomes a high-value training signal. When the platform detects an edge case or a deviation from the expected process, it can prompt for a brief, structured rationale, capturing decision factors without requiring lengthy free-form reasoning logs.

Building toward expertise amplification

The goal is to permanently embed the accumulated expertise of thousands of domain experts—their knowledge, decisions, and reasoning—into an AI platform that amplifies what every operator can accomplish. Done well, this produces a quality of execution that neither humans nor AI achieve independently: higher consistency, improved throughput, and measurable operational gains. Operators can focus on more consequential work, supported by an AI that has already completed the analytical groundwork across thousands of analogous prior cases.

The broader implication for enterprise leaders is straightforward. Advantages in AI won’t be determined by access to general-purpose models alone. It will come from an organization’s ability to capture, refine, and compound what it knows, its data, decisions, and operational judgment, while building the controls required for high-stakes environments. As AI shifts from experimentation to infrastructure, the most durable edge may belong to the companies that understand the work well enough to instrument it and can turn that understanding into systems that improve with use.

This content was produced by Ensemble. It was not written by MIT Technology Review’s editorial staff.

Building trust in the AI era with privacy-led UX

The practice of privacy-led user experience (UX) is a design philosophy that treats transparency around data collection and usage as an integral part of the customer relationship. An undertapped opportunity in digital marketing, privacy-led UX treats user consent not as a tick-box compliance exercise, but rather as the first overture in an ongoing customer relationship. For the companies that get it right, the payoff can bring something more intangible, valuable, and durable than simple consent rates: consumer trust.

The opportunities of privacy-led UX have only recently come into focus. Adelina Peltea, the chief marketing officer at Usercentrics, has seen enterprise sentiment shift: “Even just a few years ago, this space was viewed more as a trade-off between growth and compliance,” she says. “But as the market has matured, there’s been a greater focus on how to tie well-designed privacy experiences to business growth.”

And it turns out that well-designed, value-forward consent experiences routinely outperform initial estimates.
Touchpoints for privacy-led UX often include consent management platforms, terms and conditions, privacy policies, data subject access request (DSAR) tools, and, increasingly, AI data use disclosures.

This report examines how data transparency builds trust with customers; how this, in turn, can support business performance; and how organizations can maintain this trust even as AI systems add complexity to consent processes.

Key findings include the following:

  • Privacy is evolving from a one-time consent transaction into an ongoing data relationship. Rather than asking users for broad permissions up front, leading organizations are introducing data-sharing decisions gradually, matching the depth of the ask to the stage of the customer relationship. Companies that take this tack tend to gather both a larger quantity and higher quality of consumer data, the value of which often compounds over time.
  • Privacy-led UX is a prerequisite for AI growth. The consumer data that organizations gather is rapidly becoming a core foundation upon which AI-powered personalization is built. Organizations that establish clear, enforceable privacy and data transparency policies now are better positioned to deploy AI responsibly and at scale in the future. This starts with correctly configured consent mode across ad platforms.
  • Agentic AI introduces new levels of both complexity and opportunity. As AI systems begin acting on users’ behalf, the traditional consent moment may never occur. Governing agent-generated data flows requires privacy infrastructure that goes well beyond the cookie banner.
  • Realizing the advantages of privacy-led UX requires cross-functional collaboration and clear leadership. Privacy-led UX touches marketing, product, legal, and data teams—but someone must own the strategy and weave the threads together. Chief marketing officers
  • (CMOs) are often best positioned for that role, given their visibility across brand, data, and customer experience.
  • A practical framework can support businesses in getting it right. Organizations must define their data collection and usage strategies and ensure their UX incorporates data consent, including a focus on banner design. Following a blueprint for evaluating and improving privacy-led UX supports consistency at every consent touchpoint.

Download the report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff. It was researched, designed, and written by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

Coming soon: 10 Things That Matter in AI Right Now

Each year we compile our 10 Breakthrough Technologies list, featuring our educated predictions for which technologies will have the biggest impact on how we live and work.

This year, however, we had a dilemma. While our final picks encompass all our core coverage areas (energy, AI, and biotech, plus a few more), our 2026 list was harder to wrangle than normal. Why? We had so many worthy AI candidates we couldn’t fit them all in! (The ones that made it were AI companions, mechanistic interpretability, generative coding, and hyperscale data centers.) Many great ideas fell by the wayside to keep the list as wide-ranging as possible.

Well, that got us thinking: What if we made an entirely new list that was all about AI? We got excited about that idea—and before we knew it we had the beginnings of what we’re calling 10 Things That Matter in AI Right Now. It’s an entirely new annual list that we’re proud to be publishing for the first time on April 21, 2026. We’ll unveil it on stage for attendees at our signature AI conference, EmTech AI, held on MIT’s campus (it’s not too late to get tickets), and then publish the list online later that day.

The process for coming up with the list was similar to the way we pick our 10 Breakthrough Technologies. We petitioned our AI team of reporters and editors to propose ideas, put them all in a document, and engaged in some robust discussion. Eventually, we voted for our favorites and whittled the long list down to a final 10.

But there’s a slight difference between this list and our 10 Breakthrough Technologies. AI is already such a big part of our lives that we didn’t want to restrict ourselves to nominating only technologies. Instead, we wanted to put together a definitive annual list that highlights what we believe are the biggest ideas, topics, and research directions in AI right now. So yes, it will include cutting-edge AI technologies, but it will also feature other trends and developments in AI that we want to bring to our subscribers’ attention.

Think of it as a sneak peek inside the collective brain of our crack AI reporting team: These are the things that our reporters will be watching this year. We intend to follow the items on this list really closely, and you will see it reflected in the news and feature stories we publish in 2026.

For us, 10 Things That Matter in AI Right Now is a guide to how we view the current AI landscape. It will be a source of discussion, debate, and maybe some arguments! We are so excited to share it with you on April 21. If you want to be among the first to see it—join us at EmTech AI or become a subscriber to livestream the announcement.

Redefining the future of software engineering

Software engineering has experienced two seismic shifts this century. First was the rise of the open source movement, which gradually made code accessible to developers and engineers everywhere. Second, the adoption of development operations (DevOps) and agile methodologies took software from siloed to collaborative development and from batch to continuous delivery. Now, a third such shift looks to be taking shape with the adoption of agentic AI in software engineering.

Thus far, engineering teams have mainly used AI to assist with coding, testing, and other individual tasks, within tightly designed parameters. But with agentic capabilities, AI agents become reasoning, self-directing entities that can manage not just discrete tasks but entire software projects—and do so largely autonomously. If adopted and fully embraced by engineering teams, agentic AI will usher in end-to-end software process automation and, ultimately, agent-managed development and product lifecycle automation.

This report, which is based on a survey of 300 engineering and technology executives, finds that software engineering teams are seeing the potential in agentic AI and are beginning to put it to use, but so far in a mainly limited fashion. Their ambitions for it are high, but most realize it will take time and effort to reduce the barriers to its full diffusion in software operations. As with DevOps and agile, reaping the full benefits of agentic AI in engineering will require sometimes difficult organizational and process change to accompany technology adoption. But the gains to be won in speed, efficiency, and quality promise to make any such pain well worthwhile.

Key findings include the following:

Adoption momentum is building. While half of organizations deem agentic AI a top investment priority for software engineering today, it will be a leading investment for over four-fifths in two years. That spending is driving accelerated adoption. Agentic AI is in (mostly limited) use by 51% of software teams today, and 45% have plans to adopt it within the next 12 months.

Early gains will be incremental. It will take time for software teams’ investments in agentic AI to start bearing fruit. Over the next two years, most expect the improvements from agent use to be slight (14%) or at best moderate (52%). But around one-third (32%) have higher expectations, and 9% think the improvements will be game changing.

Agents will accelerate time-to-market. The chief gains from agentic AI use over that two-year time frame will come from greater speed. Nearly all respondents (98%) expect their teams’ delivery of software projects from pilot to production to accelerate, with the anticipated increase in speed averaging 37% across the group.

The goal for most is full agentic lifecycle management. Teams’ ambitions for scaling agentic AI are high. Most aim for AI agents to be managing the product development and software development lifecycles (PDLC and SDLC) end to end relatively quickly. At 41% of organizations, teams aim to achieve this for most or all products in 18 months. That figure will rise to 72% two years from now, if expectations are met.

Compute costs and integration pose key early challenges. For all survey respondents—but especially in early-adopter verticals such as media and entertainment and technology hardware—integrating agents with existing applications and the cost of computing resources are the main challenges they face with agentic AI in software engineering. The experts we interviewed, meanwhile, emphasize the bigger change management difficulties teams will face in changing workflows.

Download the report

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff. It was researched, designed, and written by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

Want to understand the current state of AI? Check out these charts.

<div data-chronoton-summary="

  • The US-China AI race is closer than you think: Chinese models from DeepSeek and Alibaba now trail American ones by razor-thin margins. Meanwhile, the US has more data centers and capital, while China leads in research publications and robotics.
  • AI benchmarks are badly broken: One popular math benchmark has a 42% error rate, and models can game tests by training on the answers. Strong test scores increasingly fail to predict how AI actually performs in the real world.
  • Jobs and anxiety are both rising: Software developer employment for workers aged 22–25 has dropped nearly 20% since 2022, with AI likely a factor. Globally, 59% of people think AI will do more good than harm—but 52% say it still makes them nervous.
  • Regulation is losing the race: The EU banned predictive policing AI, and US states passed a record 150 AI-related bills, but experts say lawmakers don’t yet understand the technology well enough to govern it effectively.

” data-chronoton-post-id=”1135675″ data-chronoton-expand-collapse=”1″ data-chronoton-analytics-enabled=”1″>

If you’re following AI news, you’re probably getting whiplash. AI is a gold rush. AI is a bubble. AI is taking your job. AI can’t even read a clock. The 2026 AI Index from Stanford University’s Institute for Human-Centered Artificial Intelligence, AI’s annual report card, comes out today and cuts through some of that noise. 

Despite predictions that AI development may hit a wall, the report says that the top models just keep getting better. People are adopting AI faster than they picked up the personal computer or the internet. AI companies are generating revenue faster than companies in any previous technology boom, but they’re also spending hundreds of billions of dollars on data centers and chips. The benchmarks designed to measure AI, the policies meant to govern it, and the job market are struggling to keep up. AI is sprinting, and the rest of us are trying to find our shoes.

All that speed comes at a cost. AI data centers around the world can now draw 29.6 gigawatts of power, enough to run the entire state of New York at peak demand. Annual water use from running OpenAI’s GPT-4o alone may exceed the drinking water needs of 12 million people. At the same time, the supply chain for chips is alarmingly fragile. The US hosts most of the world’s AI data centers, and one company in Taiwan, TSMC, fabricates almost every leading AI chip. 

The data reveals a technology evolving faster than we can manage. Here’s a look at some of the key points from this year’s report. 

The US and China are nearly tied

In a long, heated race with immense geopolitical stakes, the US and China are almost neck and neck on AI model performance, according to Arena, a community-driven ranking platform that allows users to compare the outputs of large language models on identical prompts. In early 2023, OpenAI had a lead with ChatGPT, but this gap narrowed in 2024 as Google and Anthropic released their own models. In February 2025, R1, an AI model built by the Chinese lab DeepSeek, briefly matched the top US model, ChatGPT. As of March 2026, Anthropic leads, trailed closely by xAI, Google, and OpenAI. Chinese models like DeepSeek and Alibaba lag only modestly. With the best AI models separated in the rankings by razor-thin margins, they’re now competing on cost, reliability, and real-world usefulness. 

Chart of the performance of top models on the Arena by select providers, showing the Arena score from May 2023 to Jan 2026 with the models all trending upward.  The scores are tightly packed by US based Anthropic, xAI, Google and OpenAI lead Alibaba, DeepSeek and Mistral (in that order.) Meta trails the pack.

The index notes that the US and China have different AI advantages. While the US has more powerful AI models, more capital, and an estimated 5,427 data centers (more than 10 times as many as any other country), China leads in AI research publications, patents, and robotics. 

As competition intensifies, companies like OpenAI, Anthropic, and Google no longer disclose their training code, parameter counts, or data-set sizes. “We don’t know a lot of things about predicting model behaviors,” says Yolanda Gil, a computer scientist at the University of Southern California who coauthored the report. This lack of transparency makes it difficult for independent researchers to study how to make AI models safer, she says.

AI models are advancing super fast

Despite predictions that development will plateau, AI models keep getting better and better. By some measures, they now meet or exceed the performance of human experts on tests that aim to measure PhD-level science, math, and language understanding. SWE-bench Verified, a software engineering benchmark for AI models, saw top scores jump from around 60% in 2024 to almost 100% in 2025. In 2025, an AI system produced a weather forecast on its own.  

“I am stunned that this technology continues to improve, and it’s just not plateauing in any way,” says Gil.

line chart of Select AI Index technical performance benchmarks vs human performance, showing that skills such as image classification, English language understanding, multitask language understanding, visual reasoning, medium level reading comprehension, multimodal understanding and reasoning have surpassed the human baseline at or before 2025, with autonomous software engineering, mathmatical reasoning and agent multimodal computer use trending towards meeting the human baseline by 2026.

However, AI still struggles in plenty of other areas. Because the models learn by processing enormous amounts of text and images rather than by experiencing the physical world, AI exhibits “jagged intelligence.” Robots are still in their early days and succeed in only 12% of household tasks. Self-driving cars are farther along: Waymos are now roaming across five US cities, and Baidu’s Apollo Go vehicles are shuttling riders around in China. AI is also expanding into professional domains like law and finance, but no model dominates the field yet. 

But the way we test AI is broken

These reports of progress should be taken with a grain of salt. The benchmarks designed to track AI progress are struggling to keep up as models quickly blow past their ceilings, the Stanford report says. Some are poorly constructed—a popular benchmark that tests a model’s math abilities has a 42% error rate. Others can be gamed: when models are trained on benchmark test data, for example, they can learn to score well without getting smarter. 

Because AI is rarely used the same way it’s tested, strong benchmark performance doesn’t always translate to real-world performance. And for complex, interactive technologies such as AI agents and robots, benchmarks barely exist yet. 

AI companies are also sharing less about how their models are trained, and independent testing sometimes tells a different story from what they report. “A lot of companies are not releasing how their models do in certain benchmarks, particularly the responsible-AI benchmarks,” says Gil. “The absence of how your model is doing on a benchmark maybe says something.” 

AI is starting to affect jobs

Within three years of going mainstream, AI is now used by more than half of people around the world, a rate of adoption faster than the personal computer or the internet. An estimated 88% of organizations now use AI, and four in five university students use it. 

It’s early days for deployment, and AI’s impact on jobs is hard to measure. Still, some studies suggest AI is beginning to affect young workers in certain professions. According to a 2025 study by economists at Stanford, employment for software developers aged 22 to 25 has fallen nearly 20% since 2022. The decline might not be pinned on AI alone, as broader macroeconomic conditions could be to blame, but AI appears to be playing a part.

two line charts showing the normalized headcount trends by age group from 2021 through 2025. On the left for software developers the early career (age 22-25) cohort drops rapidly after a peak in September 2022, with other ages still rising albeit less steeply.  On the right, customer support agents see a similar trend, although the decline for the early career group is less steep than for software developers.

Employers say that hiring may continue to tighten. According to a 2025 survey conducted by McKinsey & Company, a third of organizations expect AI to shrink their workforce in the coming year, particularly in service and supply chain operations and software engineering. AI is boosting productivity by 14% in customer service and 26% in software development, according to research cited by the index, but such gains are not seen in tasks requiring more judgment. Overall, it’s still too early to understand the bigger economic impact of AI. 

People have complicated feelings about AI 

Around the world, people feel both optimistic and anxious about AI: 59% of people think that it will provide more benefits than drawbacks, while 52% say that it makes them nervous, according to an Ipsos survey cited in the index. 

Notably, experts and the public see the future of AI very differently, according to a Pew survey. The biggest gap is around the future of work: While 73% of experts think that AI will have a positive impact on how people do their jobs, only 23% of the American public thinks so. Experts are also more optimistic than the public about AI’s impact on education and medical care, but they agree that AI will hurt elections and personal relationships.

Bar chart of US perceptions of AI's societal impact contrasting US adults with AI experts, with the percentage of AI experts saying that AI will have a positive impact in the next 20 years is 2-3 times higher than the US adults.  The most optimistic AI experts are in the field of medical care with 84% predicting a positive outcome (versus 44% of US adults.) The greatest difference is for jobs with experts polling at 73% and US adults  polling at 23%.  Both groups have a similar (11% for experts and 9% of adults.) expectation for a positive outcome for AI in elections.

Among all countries surveyed, Americans trust their government least to regulate AI appropriately, according to another Ipsos survey. More Americans worry federal AI regulation won’t go far enough than worry it will go too far. 

Governments are struggling to regulate AI

Governments around the world are struggling to regulate AI, but there were some minor successes last year. The EU AI Act’s first prohibitions, which ban the use of AI in predictive policing and emotion recognition, took effect. Japan, South Korea, and Italy also passed national AI laws. Meanwhile, the US federal government moved toward deregulation, with President Trump issuing an executive order seeking to handcuff states from regulating AI. 

Despite this federal action, state legislatures in the US passed a record 150 AI-related bills. California enacted landmark legislation, including SB 53, which mandates safety disclosures and whistleblower protections for developers of AI models. New York passed the RAISE Act, requiring AI companies to publish safety protocols and report critical safety incidents.

line chart showing the number of AI-related bills passed into law by all US states from 2016-2025, which increases sharply in 2023 and peaks with 150 bills in 2025.

But for all the legislative activity, Gil says, regulation is running behind the technology because we don’t really understand how it works. “Governments are cautious to regulate AI because … we don’t understand many things very well,” she says. “We don’t have a good handle on those systems.”

Why opinion on AI is so divided

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

In an industry that doesn’t stand still, Stanford’s AI Index, an annual roundup of key results and trends, is a chance to take a breath. (It’s a marathon, not a sprint, after all.)

This year’s report, which dropped today, is full of striking stats. A lot of the value comes from having numbers to back up gut feelings you might already have, such as the sense that the US is gunning harder for AI than everyone else: It hosts 5,427 data centers (and counting). That’s more than 10 times as many as any other country.  

There’s also a reminder that the hardware supply chain the AI industry relies on has some major choke points. Here’s perhaps the most remarkable fact: “A single company, TSMC, fabricates almost every leading AI chip, making the global AI hardware supply chain dependent on one foundry in Taiwan.” One foundry! That’s just wild.

But the main takeaway I have from the 2026 AI Index is that the state of AI right now is shot through with inconsistencies. As my colleague Michelle Kim put it today in her piece about the report: “If you’re following AI news, you’re probably getting whiplash. AI is a gold rush. AI is a bubble. AI is taking your job. AI can’t even read a clock.” (The Stanford report notes that Google DeepMind’s top reasoning model, Gemini Deep Think, scored a gold medal in the International Math Olympiad but is unable to read analog clocks half the time.)

Michelle does a great job covering the report’s highlights. But I wanted to dwell on a question that I can’t shake. Why is it so hard to know exactly what’s going on in AI right now?  

The widest gap seems to be between experts and non-experts. “AI experts and the general public view the technology’s trajectory very differently,” the authors of the AI Index write. “Assessing AI’s impact on jobs, 73% of U.S. experts are positive, compared with only 23% of the public, a 50 percentage point gap. Similar divides emerge with respect to the economy and medical care.”

That’s a huge gap. What’s going on? What do experts know that the public doesn’t? (“Experts” here means US-based researchers who took part in AI conferences in 2023 and 2024.)

I suspect part of what’s going on is that experts and non-experts base their views on very different experiences. “The degree to which you are awed by AI is perfectly correlated with how much you use AI to code,” a software developer posted on X the other day. Maybe that’s tongue-in-cheek, but there’s definitely something to it.

The latest models from the top labs are now better than ever at producing code. Because technical tasks like coding have right or wrong results, it is easier to train models to do them, compared with tasks that are more open-ended. What’s more, models that can code are proving to be profitable, so model makers are throwing resources at improving them.

This means that people who use those tools for coding or other technical work are experiencing this technology at its best. Outside of those use cases, you get more of a mixed bag. LLMs still make dumb mistakes. This phenomenon has become known as the “jagged frontier”: Models are very good at doing some things and less good at others.

The influential AI researcher Andrej Karpathy also had some thoughts. “Judging by my [timeline] there is a growing gap in understanding of AI capability,” he wrote in reply to that X post. He noted that power users (read: people who use LLMs for coding, math, or research) not only keep up to date with the latest models but will often pay $200 a month for the best versions. “The recent improvements in these domains as of this year have been nothing short of staggering,” he continued.

Because LLMs are still improving fast, someone who pays to use Claude Code will in effect be using a different technology from someone who tried using the free version of Claude to plan a wedding six months ago. Those two groups are speaking past each other.

Where does that leave us? I think there are two realities. Yes, AI is far better than a lot of people realize. And yes, it is still pretty bad at a lot of stuff that a lot of people care about (and it may stay that way). Anyone making bets about the future on either side should bear that in mind.