What does it mean for an algorithm to be “fair”?

Back in February, I flew to Amsterdam to report on a high-stakes experiment the city had recently conducted: a pilot program for what it called Smart Check, which was its attempt to create an effective, fair, and unbiased predictive algorithm to try to detect welfare fraud. But the city fell short of its lofty goals—and, with our partners at Lighthouse Reports and the Dutch newspaper Trouw, we tried to get to the bottom of why. You can read about it in our deep dive published last week.

For an American reporter, it’s been an interesting time to write a story on “responsible AI” in a progressive European city—just as ethical considerations in AI deployments appear to be disappearing in the United States, at least at the national level. 

For example, a few weeks before my trip, the Trump administration rescinded Biden’s executive order on AI safety and DOGE began turning to AI to decide which federal programs to cut. Then, more recently, House Republicans passed a 10-year moratorium on US states’ ability to regulate AI (though it has yet to be passed by the Senate). 

What all this points to is a new reality in the United States where responsible AI is no longer a priority (if it ever genuinely was). 

But this has also made me think more deeply about the stakes of deploying AI in situations that directly affect human lives, and about what success would even look like. 

When Amsterdam’s welfare department began developing the algorithm that became Smart Check, the municipality followed virtually every recommendation in the responsible-AI playbook: consulting external experts, running bias tests, implementing technical safeguards, and seeking stakeholder feedback. City officials hoped the resulting algorithm could avoid causing the worst types of harm inflicted by discriminatory AI over nearly a decade. 

After talking to a large number of people involved in the project and others who would potentially be affected by it, as well as some experts who did not work on it, it’s hard not to wonder if the city could ever have succeeded in its goals when neither “fairness” nor even “bias” has a universally agreed-upon definition. The city was treating these issues as technical ones that could be answered by reweighting numbers and figures—rather than political and philosophical questions that society as a whole has to grapple with.

On the afternoon that I arrived in Amsterdam, I sat down with Anke van der Vliet, a longtime advocate for welfare beneficiaries who served on what’s called the Participation Council, a 15-member citizen body that represents benefits recipients and their advocates.

The city had consulted the council during Smart Check’s development, but van der Vliet was blunt in sharing the committee’s criticisms of the plans. Its members simply didn’t want the program. They had well-placed fears of discrimination and disproportionate impact, given that fraud is found in only 3% of applications.

To the city’s credit, it did respond to some of their concerns and make changes in the algorithm’s design—like removing from consideration factors, such as age, whose inclusion could have had a discriminatory impact. But the city ignored the Participation Council’s main feedback: its recommendation to stop development altogether. 

Van der Vliet and other welfare advocates I met on my trip, like representatives from the Amsterdam Welfare Union, described what they see as a number of challenges faced by the city’s some 35,000 benefits recipients: the indignities of having to constantly re-prove the need for benefits, the increases in cost of living that benefits payments do not reflect, and the general feeling of distrust between recipients and the government. 

City welfare officials themselves recognize the flaws of the system, which “is held together by rubber bands and staples,” as Harry Bodaar, a senior policy advisor to the city who focuses on welfare fraud enforcement, told us. “And if you’re at the bottom of that system, you’re the first to fall through the cracks.”

So the Participation Council didn’t want Smart Check at all, even as Bodaar and others working in the department hoped that it could fix the system. It’s a classic example of a “wicked problem,” a social or cultural issue with no one clear answer and many potential consequences. 

After the story was published, I heard from Suresh Venkatasubramanian, a former tech advisor to the White House Office of Science and Technology Policy who co-wrote Biden’s AI Bill of Rights (now rescinded by Trump). “We need participation early on from communities,” he said, but he added that it also matters what officials do with the feedback—and whether there is “a willingness to reframe the intervention based on what people actually want.” 

Had the city started with a different question—what people actually want—perhaps it might have developed a different algorithm entirely. As the Dutch digital rights advocate Hans De Zwart put it to us, “We are being seduced by technological solutions for the wrong problems … why doesn’t the municipality build an algorithm that searches for people who do not apply for social assistance but are entitled to it?” 

These are the kinds of fundamental questions AI developers will need to consider, or they run the risk of repeating (or ignoring) the same mistakes over and over again.

Venkatasubramanian told me he found the story to be “affirming” in highlighting the need for “those in charge of governing these systems”  to “ask hard questions … starting with whether they should be used at all.”

But he also called the story “humbling”: “Even with good intentions, and a desire to benefit from all the research on responsible AI, it’s still possible to build systems that are fundamentally flawed, for reasons that go well beyond the details of the system constructions.” 

To better understand this debate, read our full story here. And if you want more detail on how we ran our own bias tests after the city gave us unprecedented access to the Smart Check algorithm, check out the methodology over at Lighthouse. (For any Dutch speakers out there, here’s the companion story in Trouw.) Thanks to the Pulitzer Center for supporting our reporting. 

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The Pentagon is gutting the team that tests AI and weapons systems

The Trump administration’s chainsaw approach to federal spending lives on, even as Elon Musk turns on the president. On May 28, Secretary of Defense Pete Hegseth announced he’d be gutting a key office at the Department of Defense responsible for testing and evaluating the safety of weapons and AI systems.

As part of a string of moves aimed at “reducing bloated bureaucracy and wasteful spending in favor of increased lethality,” Hegseth cut the size of the Office of the Director of Operational Test and Evaluation in half. The group was established in the 1980s—following orders from Congress—after criticisms that the Pentagon was fielding weapons and systems that didn’t perform as safely or effectively as advertised. Hegseth is reducing the agency’s staff to about 45, down from 94, and firing and replacing its director. He gave the office just seven days to implement the changes.

It is a significant overhaul of a department that in 40 years has never before been placed so squarely on the chopping block. Here’s how today’s defense tech companies, which have fostered close connections to the Trump administration, stand to gain, and why safety testing might suffer as a result. 

The Operational Test and Evaluation office is “the last gate before a technology gets to the field,” says Missy Cummings, a former fighter pilot for the US Navy who is now a professor of engineering and computer science at George Mason University. Though the military can do small experiments with new systems without running it by the office, it has to test anything that gets fielded at scale.

“In a bipartisan way—up until now—everybody has seen it’s working to help reduce waste, fraud, and abuse,” she says. That’s because it provides an independent check on companies’ and contractors’ claims about how well their technology works. It also aims to expose the systems to more rigorous safety testing.

The gutting comes at a particularly pivotal time for AI and military adoption: The Pentagon is experimenting with putting AI into everything, mainstream companies like OpenAI are now more comfortable working with the military, and defense giants like Anduril are winning big contracts to launch AI systems (last Thursday, Anduril announced a whopping $2.5 billion funding round, doubling its valuation to over $30 billion). 

Hegseth claims his cuts will “make testing and fielding weapons more efficient,” saving $300 million. But Cummings is concerned that they are paving a way to faster adoption while increasing the chances that new systems won’t be as safe or effective as promised. “The firings in DOTE send a clear message that all perceived obstacles for companies favored by Trump are going to be removed,” she says.

Anduril and Anthropic, which have launched AI applications for military use, did not respond to my questions about whether they pushed for or approve of the cuts. A representative for OpenAI said that the company was not involved in lobbying for the restructuring. 

“The cuts make me nervous,” says Mark Cancian, a senior advisor at the Center for Strategic and International Studies who previously worked at the Pentagon in collaboration with the testing office. “It’s not that we’ll go from effective to ineffective, but you might not catch some of the problems that would surface in combat without this testing step.”

It’s hard to say precisely how the cuts will affect the office’s ability to test systems, and Cancian admits that those responsible for getting new technologies out onto the battlefield sometimes complain that it can really slow down adoption. But still, he says, the office frequently uncovers errors that weren’t previously caught.

It’s an especially important step, Cancian says, whenever the military is adopting a new type of technology like generative AI. Systems that might perform well in a lab setting almost always encounter new challenges in more realistic scenarios, and the Operational Test and Evaluation group is where that rubber meets the road.

So what to make of all this? It’s true that the military was experimenting with artificial intelligence long before the current AI boom, particularly with computer vision for drone feeds, and defense tech companies have been winning big contracts for this push across multiple presidential administrations. But this era is different. The Pentagon is announcing ambitious pilots specifically for large language models, a relatively nascent technology that by its very nature produces hallucinations and errors, and it appears eager to put much-hyped AI into everything. The key independent group dedicated to evaluating the accuracy of these new and complex systems now only has half the staff to do it. I’m not sure that’s a win for anyone.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Inside the effort to tally AI’s energy appetite

After working on it for months, my colleague Casey Crownhart and I finally saw our story on AI’s energy and emissions burden go live last week. 

The initial goal sounded simple: Calculate how much energy is used each time we interact with a chatbot, and then tally that up to understand why everyone from leaders of AI companies to officials at the White House wants to harness unprecedented levels of electricity to power AI and reshape our energy grids in the process. 

It was, of course, not so simple. After speaking with dozens of researchers, we realized that the common understanding of AI’s energy appetite is full of holes. I encourage you to read the full story, which has some incredible graphics to help you understand everything from the energy used in a single query right up to what AI will require just three years from now (enough electricity to power 22% of US households, it turns out). But here are three takeaways I have after the project. 

AI is in its infancy

We focused on measuring the energy requirements that go into using a chatbot, generating an image, and creating a video with AI. But these three uses are relatively small-scale compared with where AI is headed next. 

Lots of AI companies are building reasoning models, which “think” for longer and use more energy. They’re building hardware devices, perhaps like the one Jony Ive has been working on (which OpenAI just acquired for $6.5 billion), that have AI constantly humming along in the background of our conversations. They’re designing agents and digital clones of us to act on our behalf. All these trends point to a more energy-intensive future (which, again, helps explain why OpenAI and others are spending such inconceivable amounts of money on energy). 

But the fact that AI is in its infancy raises another point. The models, chips, and cooling methods behind this AI revolution could all grow more efficient over time, as my colleague Will Douglas Heaven explains. This future isn’t predetermined.

AI video is on another level

When we tested the energy demands of various models, we found the energy required to produce even a low-quality, five-second video to be pretty shocking: It was 42,000 times more than the amount needed for a chatbot answer a question about a recipe, and enough to power a microwave for over an hour. If there’s one type of AI whose energy appetite should worry you, it’s this one. 

Soon after we published, Google debuted the latest iteration of its Veo model. People quickly created compilations of the most impressive clips (this one being the most shocking to me). Something we point out in the story is that Google (as well as OpenAI, which has its own video generator, Sora) denied our request for specific numbers on the energy their AI models use. Nonetheless, our reporting suggests it’s very likely that high-definition video models like Veo and Sora are much larger, and much more energy-demanding, than the models we tested. 

I think the key to whether the use of AI video will produce indefensible clouds of emissions in the near future will be how it’s used, and how it’s priced. The example I linked shows a bunch of TikTok-style content, and I predict that if creating AI video is cheap enough, social video sites will be inundated with this type of content. 

There are more important questions than your own individual footprint

We expected that a lot of readers would understandably think about this story in terms of their own individual footprint, wondering whether their AI usage is contributing to the climate crisis. Don’t panic: It’s likely that asking a chatbot for help with a travel plan does not meaningfully increase your carbon footprint. Video generation might. But after reporting on this for months, I think there are more important questions.

Consider, for example, the water being drained from aquifers in Nevada, the country’s driest state, to power data centers that are drawn to the area by tax incentives and easy permitting processes, as detailed in an incredible story by James Temple. Or look at how Meta’s largest data center project, in Louisiana, is relying on natural gas despite industry promises to use clean energy, per a story by David Rotman. Or the fact that nuclear energy is not the silver bullet that AI companies often make it out to be. 

There are global forces shaping how much energy AI companies are able to access and what types of sources will provide it. There is also very little transparency from leading AI companies on their current and future energy demands, even while they’re asking for public support for these plans. Pondering your individual footprint can be a good thing to do, provided you remember that it’s not so much your footprint as these other factors that are keeping climate researchers and energy experts we spoke to up at night.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

How AI is introducing errors into courtrooms

It’s been quite a couple weeks for stories about AI in the courtroom. You might have heard about the deceased victim of a road rage incident whose family created an AI avatar of him to show as an impact statement (possibly the first time this has been done in the US). But there’s a bigger, far more consequential controversy brewing, legal experts say. AI hallucinations are cropping up more and more in legal filings. And it’s starting to infuriate judges. Just consider these three cases, each of which gives a glimpse into what we can expect to see more of as lawyers embrace AI.

A few weeks ago, a California judge, Michael Wilner, became intrigued by a set of arguments some lawyers made in a filing. He went to learn more about those arguments by following the articles they cited. But the articles didn’t exist. He asked the lawyers’ firm for more details, and they responded with a new brief that contained even more mistakes than the first. Wilner ordered the attorneys to give sworn testimonies explaining the mistakes, in which he learned that one of them, from the elite firm Ellis George, used Google Gemini as well as law-specific AI models to help write the document, which generated false information. As detailed in a filing on May 6, the judge fined the firm $31,000. 

Last week, another California-based judge caught another hallucination in a court filing, this time submitted by the AI company Anthropic in the lawsuit that record labels have brought against it over copyright issues. One of Anthropic’s lawyers had asked the company’s AI model Claude to create a citation for a legal article, but Claude included the wrong title and author. Anthropic’s attorney admitted that the mistake was not caught by anyone reviewing the document. 

Lastly, and perhaps most concerning, is a case unfolding in Israel. After police arrested an individual on charges of money laundering, Israeli prosecutors submitted a request asking a judge for permission to keep the individual’s phone as evidence. But they cited laws that don’t exist, prompting the defendant’s attorney to accuse them of including AI hallucinations in their request. The prosecutors, according to Israeli news outlets, admitted that this was the case, receiving a scolding from the judge. 

Taken together, these cases point to a serious problem. Courts rely on documents that are accurate and backed up with citations—two traits that AI models, despite being adopted by lawyers eager to save time, often fail miserably to deliver. 

Those mistakes are getting caught (for now), but it’s not a stretch to imagine that at some point soon, a judge’s decision will be influenced by something that’s totally made up by AI, and no one will catch it. 

I spoke with Maura Grossman, who teaches at the School of Computer Science at the University of Waterloo as well as Osgoode Hall Law School, and has been a vocal early critic of the problems that generative AI poses for courts. She wrote about the problem back in 2023, when the first cases of hallucinations started appearing. She said she thought courts’ existing rules requiring lawyers to vet what they submit to the courts, combined with the bad publicity those cases attracted, would put a stop to the problem. That hasn’t panned out.

Hallucinations “don’t seem to have slowed down,” she says. “If anything, they’ve sped up.” And these aren’t one-off cases with obscure local firms, she says. These are big-time lawyers making significant, embarrassing mistakes with AI. She worries that such mistakes are also cropping up more in documents not written by lawyers themselves, like expert reports (in December, a Stanford professor and expert on AI admitted to including AI-generated mistakes in his testimony).  

I told Grossman that I find all this a little surprising. Attorneys, more than most, are obsessed with diction. They choose their words with precision. Why are so many getting caught making these mistakes?

“Lawyers fall in two camps,” she says. “The first are scared to death and don’t want to use it at all.” But then there are the early adopters. These are lawyers tight on time or without a cadre of other lawyers to help with a brief. They’re eager for technology that can help them write documents under tight deadlines. And their checks on the AI’s work aren’t always thorough. 

The fact that high-powered lawyers, whose very profession it is to scrutinize language, keep getting caught making mistakes introduced by AI says something about how most of us treat the technology right now. We’re told repeatedly that AI makes mistakes, but language models also feel a bit like magic. We put in a complicated question and receive what sounds like a thoughtful, intelligent reply. Over time, AI models develop a veneer of authority. We trust them.

“We assume that because these large language models are so fluent, it also means that they’re accurate,” Grossman says. “We all sort of slip into that trusting mode because it sounds authoritative.” Attorneys are used to checking the work of junior attorneys and interns but for some reason, Grossman says, don’t apply this skepticism to AI.

We’ve known about this problem ever since ChatGPT launched nearly three years ago, but the recommended solution has not evolved much since then: Don’t trust everything you read, and vet what an AI model tells you. As AI models get thrust into so many different tools we use, I increasingly find this to be an unsatisfying counter to one of AI’s most foundational flaws.

Hallucinations are inherent to the way that large language models work. Despite that, companies are selling generative AI tools made for lawyers that claim to be reliably accurate. “Feel confident your research is accurate and complete,” reads the website for Westlaw Precision, and the website for CoCounsel promises its AI is “backed by authoritative content.” That didn’t stop their client, Ellis George, from being fined $31,000.

Increasingly, I have sympathy for people who trust AI more than they should. We are, after all, living in a time when the people building this technology are telling us that AI is so powerful it should be treated like nuclear weapons. Models have learned from nearly every word humanity has ever written down and are infiltrating our online life. If people shouldn’t trust everything AI models say, they probably deserve to be reminded of that a little more often by the companies building them. 

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Police tech can sidestep facial recognition bans now

Six months ago I attended the largest gathering of chiefs of police in the US to see how they’re using AI. I found some big developments, like officers getting AI to write their police reports. Today, I published a new story that shows just how far AI for police has developed since then. 

It’s about a new method police departments and federal agencies have found to track people: an AI tool that uses attributes like body size, gender, hair color and style, clothing, and accessories instead of faces. It offers a way around laws curbing the use of facial recognition, which are on the rise. 

Advocates from the ACLU, after learning of the tool through MIT Technology Review, said it was the first instance they’d seen of such a tracking system used at scale in the US, and they say it has a high potential for abuse by federal agencies. They say the prospect that AI will enable more powerful surveillance is especially alarming at a time when the Trump administration is pushing for more monitoring of protesters, immigrants, and students. 

I hope you read the full story for the details, and to watch a demo video of how the system works. But first, let’s talk for a moment about what this tells us about the development of police tech and what rules, if any, these departments are subject to in the age of AI.

As I pointed out in my story six months ago, police departments in the US have extraordinary independence. There are more than 18,000 departments in the country, and they generally have lots of discretion over what technology they spend their budgets on. In recent years, that technology has increasingly become AI-centric. 

Companies like Flock and Axon sell suites of sensors—cameras, license plate readers, gunshot detectors, drones—and then offer AI tools to make sense of that ocean of data (at last year’s conference I saw schmoozing between countless AI-for-police startups and the chiefs they sell to on the expo floor). Departments say these technologies save time, ease officer shortages, and help cut down on response times. 

Those sound like fine goals, but this pace of adoption raises an obvious question: Who makes the rules here? When does the use of AI cross over from efficiency into surveillance, and what type of transparency is owed to the public?

In some cases, AI-powered police tech is already driving a wedge between departments and the communities they serve. When the police in Chula Vista, California, were the first in the country to get special waivers from the Federal Aviation Administration to fly their drones farther than normal, they said the drones would be deployed to solve crimes and get people help sooner in emergencies. They’ve had some successes

But the department has also been sued by a local media outlet alleging it has reneged on its promise to make drone footage public, and residents have said the drones buzzing overhead feel like an invasion of privacy. An investigation found that these drones were deployed more often in poor neighborhoods, and for minor issues like loud music. 

Jay Stanley, a senior policy analyst at the ACLU, says there’s no overarching federal law that governs how local police departments adopt technologies like the tracking software I wrote about. Departments usually have the leeway to try it first, and see how their communities react after the fact. (Veritone, which makes the tool I wrote about, said they couldn’t name or connect me with departments using it so the details of how it’s being deployed by police are not yet clear). 

Sometimes communities take a firm stand; local laws against police use of facial recognition have been passed around the country. But departments—or the police tech companies they buy from—can find workarounds. Stanley says the new tracking software I wrote about poses lots of the same issues as facial recognition while escaping scrutiny because it doesn’t technically use biometric data.

“The community should be very skeptical of this kind of tech and, at a minimum, ask a lot of questions,” he says. He laid out a road map of what police departments should do before they adopt AI technologies: have hearings with the public, get community permission, and make promises about how the systems will and will not be used. He added that the companies making this tech should also allow it to be tested by independent parties. 

“This is all coming down the pike,” he says—and so quickly that policymakers and the public have little time to keep up. He adds, “Are these powers we want the police—the authorities that serve us—to have, and if so, under what conditions?”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Why the humanoid workforce is running late

On Thursday I watched Daniela Rus, one of the world’s top experts on AI-powered robots, address a packed room at a Boston robotics expo. Rus spent a portion of her talk busting the notion that giant fleets of humanoids are already making themselves useful in manufacturing and warehouses around the world. 

That might come as a surprise. For years AI has made it faster to train robots, and investors have responded feverishly. Figure AI, a startup that aims to build general-purpose humanoid robots for both homes and industry, is looking at a $1.5 billion funding round (more on Figure shortly), and there are commercial experiments with humanoids at Amazon and auto manufacturers. Bank of America predicts wider adoption of these robots around the corner, with a billion humanoids at work by 2050.

But Rus and many others I spoke with at the expo suggest that this hype just doesn’t add up.

Humanoids “are mostly not intelligent,” she said. Rus showed a video of herself speaking to an advanced humanoid that smoothly followed her instruction to pick up a watering can and water a nearby plant. It was impressive. But when she asked it to “water” her friend, the robot did not consider that humans don’t need watering like plants and moved to douse the person. “These robots lack common sense,” she said. 

I also spoke with Pras Velagapudi, the chief technology officer of Agility Robotics, who detailed physical limitations the company has to overcome too. To be strong, a humanoid needs a lot of power and a big battery. The stronger you make it and the heavier it is, the less time it can run without charging, and the more you need to worry about safety. A robot like this is also complex to manufacture.

Some impressive humanoid demos don’t overcome these core constraints as much as they display other impressive features: nimble robotic hands, for instance, or the ability to converse with people via a large language model. But these capabilities don’t necessarily translate well to the jobs that humanoids are supposed to be taking over (it’s more useful to program a long list of detailed instructions for a robot to follow than to speak to it, for example). 

This is not to say fleets of humanoids won’t ever join our workplaces, but rather that the adoption of the technology will likely be drawn out, industry specific, and slow. It’s related to what I wrote about last week: To people who consider AI a “normal” technology, rather than a utopian or dystopian one, this all makes sense. The technology that succeeds in an isolated lab setting will appear very different from the one that gets commercially adopted at scale. 

All of this sets the scene for what happened with one of the biggest names in robotics last week. Figure AI has raised a tremendous amount of investment for its humanoids, and founder Brett Adcock claimed on X in March that the company was the “most sought-after private stock in the secondary market.” Its most publicized work is with BMW, and Adcock has shown videos of Figure’s robots working to move parts for the automaker, saying that the partnership took just 12 months to launch. Adcock and Figure have generally not responded to media requests and don’t make the rounds at typical robot trade shows. 

In April, Fortune published an article quoting a spokesperson from BMW, alleging that the pair’s partnership involves fewer robots at a smaller scale than Figure has implied. On April 25, Adcock posted on LinkedIn that “Figure’s litigation counsel will aggressively pursue all available legal remedies—including, but not limited to, defamation claims—to correct the publication’s blatant misstatements.” The author of the Fortune article did not respond to my request for comment, and a representative for Adcock and Figure declined to say what parts of the article were inaccurate. The representative pointed me to Adcock’s statement, which lacks details. 

The specifics of Figure aside, I think this conflict is quite indicative of the tech moment we’re in. A frenzied venture capital market—buoyed by messages like the statement from Nvidia CEO Jensen Huang that “physical AI” is the future—is betting that humanoids will create the largest market for robotics the field has ever seen, and that someday they will essentially be capable of most physical work. 

But achieving that means passing countless hurdles. We’ll need safety regulations for humans working alongside humanoids that don’t even exist yet. Deploying such robots successfully in one industry, like automotive, may not lead to success in others. We’ll have to hope that AI will solve lots of problems along the way. These are all tll things that roboticists have reason to be skeptical about. 

Roboticists, from what I’ve seen, are normally a patient bunch. The first Roomba launched more than a decade after its conception, and it took more than 50 years to go from the first robotic arm ever to the millionth in production. Venture capitalists, on the other hand, are not known for such patience. 

Perhaps that’s why Bank of America’s new prediction of widespread humanoid adoption was met with enthusiasm by investors but enormous skepticism by roboticists. Aaron Prather, a director at the robotics standards organization ASTM, said on Thursday that the projections were “wildly off-base.” 

As we’ve covered before, humanoid hype is a cycle: One slick video raises the expectations of investors, which then incentivizes competitors to make even slicker videos. This makes it quite hard for anyone—a tech journalist, say—to peel back the curtain and find out how much impact humanoids are poised to have on the workforce. But I’ll do my darndest.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Here’s why we need to start thinking of AI as “normal”

Right now, despite its ubiquity, AI is seen as anything but a normal technology. There is talk of AI systems that will soon merit the term “superintelligence,” and the former CEO of Google recently suggested we control AI models the way we control uranium and other nuclear weapons materials. Anthropic is dedicating time and money to study AI “welfare,” including what rights AI models may be entitled to. Meanwhile, such models are moving into disciplines that feel distinctly human, from making music to providing therapy.

No wonder that anyone pondering AI’s future tends to fall into either a utopian or a dystopian camp. While OpenAI’s Sam Altman muses that AI’s impact will feel more like the Renaissance than the Industrial Revolution, over half of Americans are more concerned than excited about AI’s future. (That half includes a few friends of mine, who at a party recently speculated whether AI-resistant communities might emerge—modern-day Mennonites, carving out spaces where AI is limited by choice, not necessity.) 

So against this backdrop, a recent essay by two AI researchers at Princeton felt quite provocative. Arvind Narayanan, who directs the university’s Center for Information Technology Policy, and doctoral candidate Sayash Kapoor wrote a 40-page plea for everyone to calm down and think of AI as a normal technology. This runs opposite to the “common tendency to treat it akin to a separate species, a highly autonomous, potentially superintelligent entity.”

Instead, according to the researchers, AI is a general-purpose technology whose application might be better compared to the drawn-out adoption of electricity or the internet than to nuclear weapons—though they concede this is in some ways a flawed analogy.

The core point, Kapoor says, is that we need to start differentiating between the rapid development of AI methods—the flashy and impressive displays of what AI can do in the lab—and what comes from the actual applications of AI, which in historical examples of other technologies lag behind by decades. 

“Much of the discussion of AI’s societal impacts ignores this process of adoption,” Kapoor told me, “and expects societal impacts to occur at the speed of technological development.” In other words, the adoption of useful artificial intelligence, in his view, will be less of a tsunami and more of a trickle.

In the essay, the pair make some other bracing arguments: terms like “superintelligence” are so incoherent and speculative that we shouldn’t use them; AI won’t automate everything but will birth a category of human labor that monitors, verifies, and supervises AI; and we should focus more on AI’s likelihood to worsen current problems in society than the possibility of it creating new ones.

“AI supercharges capitalism,” Narayanan says. It has the capacity to either help or hurt inequality, labor markets, the free press, and democratic backsliding, depending on how it’s deployed, he says. 

There’s one alarming deployment of AI that the authors leave out, though: the use of AI by militaries. That, of course, is picking up rapidly, raising alarms that life and death decisions are increasingly being aided by AI. The authors exclude that use from their essay because it’s hard to analyze without access to classified information, but they say their research on the subject is forthcoming. 

One of the biggest implications of treating AI as “normal” is that it would upend the position that both the Biden administration and now the Trump White House have taken: Building the best AI is a national security priority, and the federal government should take a range of actions—limiting what chips can be exported to China, dedicating more energy to data centers—to make that happen. In their paper, the two authors refer to US-China “AI arms race” rhetoric as “shrill.”

“The arms race framing verges on absurd,” Narayanan says. The knowledge it takes to build powerful AI models spreads quickly and is already being undertaken by researchers around the world, he says, and “it is not feasible to keep secrets at that scale.” 

So what policies do the authors propose? Rather than planning around sci-fi fears, Kapoor talks about “strengthening democratic institutions, increasing technical expertise in government, improving AI literacy, and incentivizing defenders to adopt AI.” 

By contrast to policies aimed at controlling AI superintelligence or winning the arms race, these recommendations sound totally boring. And that’s kind of the point.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Phase two of military AI has arrived

Last week, I spoke with two US Marines who spent much of last year deployed in the Pacific, conducting training exercises from South Korea to the Philippines. Both were responsible for analyzing surveillance to warn their superiors about possible threats to the unit. But this deployment was unique: For the first time, they were using generative AI to scour intelligence, through a chatbot interface similar to ChatGPT. 

As I wrote in my new story, this experiment is the latest evidence of the Pentagon’s push to use generative AI—tools that can engage in humanlike conversation—throughout its ranks, for tasks including surveillance. Consider this phase two of the US military’s AI push, where phase one began back in 2017 with older types of AI, like computer vision to analyze drone imagery. Though this newest phase began under the Biden administration, there’s fresh urgency as Elon Musk’s DOGE and Secretary of Defense Pete Hegseth push loudly for AI-fueled efficiency. 

As I also write in my story, this push raises alarms from some AI safety experts about whether large language models are fit to analyze subtle pieces of intelligence in situations with high geopolitical stakes. It also accelerates the US toward a world where AI is not just analyzing military data but suggesting actions—for example, generating lists of targets. Proponents say this promises greater accuracy and fewer civilian deaths, but many human rights groups argue the opposite. 

With that in mind, here are three open questions to keep your eye on as the US military, and others around the world, bring generative AI to more parts of the so-called “kill chain.”

What are the limits of “human in the loop”?

Talk to as many defense-tech companies as I have and you’ll hear one phrase repeated quite often: “human in the loop.” It means that the AI is responsible for particular tasks, and humans are there to check its work. It’s meant to be a safeguard against the most dismal scenarios—AI wrongfully ordering a deadly strike, for example—but also against more trivial mishaps. Implicit in this idea is an admission that AI will make mistakes, and a promise that humans will catch them.

But the complexity of AI systems, which pull from thousands of pieces of data, make that a herculean task for humans, says Heidy Khlaaf, who is chief AI scientist at the AI Now Institute, a research organization, and previously led safety audits for AI-powered systems.

“‘Human in the loop’ is not always a meaningful mitigation,” she says. When an AI model relies on thousands of data points to draw conclusions, “it wouldn’t really be possible for a human to sift through that amount of information to determine if the AI output was erroneous.” As AI systems rely on more and more data, this problem scales up. 

Is AI making it easier or harder to know what should be classified?

In the Cold War era of US military intelligence, information was captured through covert means, written up into reports by experts in Washington, and then stamped “Top Secret,” with access restricted to those with proper clearances. The age of big data, and now the advent of generative AI to analyze that data, is upending the old paradigm in lots of ways.

One specific problem is called classification by compilation. Imagine that hundreds of unclassified documents all contain separate details of a military system. Someone who managed to piece those together could reveal important information that on its own would be classified. For years, it was reasonable to assume that no human could connect the dots, but this is exactly the sort of thing that large language models excel at. 

With the mountain of data growing each day, and then AI constantly creating new analyses, “I don’t think anyone’s come up with great answers for what the appropriate classification of all these products should be,” says Chris Mouton, a senior engineer for RAND, who recently tested how well suited generative AI is for intelligence and analysis. Underclassifying is a US security concern, but lawmakers have also criticized the Pentagon for overclassifying information. 

The defense giant Palantir is positioning itself to help, by offering its AI tools to determine whether a piece of data should be classified or not. It’s also working with Microsoft on AI models that would train on classified data. 

How high up the decision chain should AI go?

Zooming out for a moment, it’s worth noting that the US military’s adoption of AI has in many ways followed consumer patterns. Back in 2017, when apps on our phones were getting good at recognizing our friends in photos, the Pentagon launched its own computer vision effort, called Project Maven, to analyze drone footage and identify targets.

Now, as large language models enter our work and personal lives through interfaces such as ChatGPT, the Pentagon is tapping some of these models to analyze surveillance. 

So what’s next? For consumers, it’s agentic AI, or models that can not just converse with you and analyze information but go out onto the internet and perform actions on your behalf. It’s also personalized AI, or models that learn from your private data to be more helpful. 

All signs point to the prospect that military AI models will follow this trajectory as well. A report published in March from Georgetown’s Center for Security and Emerging Technology found a surge in military adoption of AI to assist in decision-making. “Military commanders are interested in AI’s potential to improve decision-making, especially at the operational level of war,” the authors wrote.

In October, the Biden administration released its national security memorandum on AI, which provided some safeguards for these scenarios. This memo hasn’t been formally repealed by the Trump administration, but President Trump has indicated that the race for competitive AI in the US needs more innovation and less oversight. Regardless, it’s clear that AI is quickly moving up the chain not just to handle administrative grunt work, but to assist in the most high-stakes, time-sensitive decisions. 

I’ll be following these three questions closely. If you have information on how the Pentagon might be handling these questions, please reach out via Signal at jamesodonnell.22. 

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

AI companions are the final stage of digital addiction, and lawmakers are taking aim

On Tuesday, California state senator Steve Padilla will make an appearance with Megan Garcia, the mother of a Florida teen who killed himself following a relationship with an AI companion that Garcia alleges contributed to her son’s death. 

The two will announce a new bill that would force the tech companies behind such AI companions to implement more safeguards to protect children. They’ll join other efforts around the country, including a similar bill from California State Assembly member Rebecca Bauer-Kahan that would ban AI companions for anyone younger than 16 years old, and a bill in New York that would hold tech companies liable for harm caused by chatbots. 

You might think that such AI companionship bots—AI models with distinct “personalities” that can learn about you and act as a friend, lover, cheerleader, or more—appeal only to a fringe few, but that couldn’t be further from the truth. 

A new research paper aimed at making such companions safer, by authors from Google DeepMind, the Oxford Internet Institute, and others, lays this bare: Character.AI, the platform being sued by Garcia, says it receives 20,000 queries per second, which is about a fifth of the estimated search volume served by Google. Interactions with these companions last four times longer than the average time spent interacting with ChatGPT. One companion site I wrote about, which was hosting sexually charged conversations with bots imitating underage celebrities, told me its active users averaged more than two hours per day conversing with bots, and that most of those users are members of Gen Z. 

The design of these AI characters makes lawmakers’ concern well warranted. The problem: Companions are upending the paradigm that has thus far defined the way social media companies have cultivated our attention and replacing it with something poised to be far more addictive. 

In the social media we’re used to, as the researchers point out, technologies are mostly the mediators and facilitators of human connection. They supercharge our dopamine circuits, sure, but they do so by making us crave approval and attention from real people, delivered via algorithms. With AI companions, we are moving toward a world where people perceive AI as a social actor with its own voice. The result will be like the attention economy on steroids.

Social scientists say two things are required for people to treat a technology this way: It needs to give us social cues that make us feel it’s worth responding to, and it needs to have perceived agency, meaning that it operates as a source of communication, not merely a channel for human-to-human connection. Social media sites do not tick these boxes. But AI companions, which are increasingly agentic and personalized, are designed to excel on both scores, making possible an unprecedented level of engagement and interaction. 

In an interview with podcast host Lex Fridman, Eugenia Kuyda, the CEO of the companion site Replika, explained the appeal at the heart of the company’s product. “If you create something that is always there for you, that never criticizes you, that always understands you and understands you for who you are,” she said, “how can you not fall in love with that?”

So how does one build the perfect AI companion? The researchers point out three hallmarks of human relationships that people may experience with an AI: They grow dependent on the AI, they see the particular AI companion as irreplaceable, and the interactions build over time. The authors also point out that one does not need to perceive an AI as human for these things to happen. 

Now consider the process by which many AI models are improved: They are given a clear goal and “rewarded” for meeting that goal. An AI companionship model might be instructed to maximize the time someone spends with it or the amount of personal data the user reveals. This can make the AI companion much more compelling to chat with, at the expense of the human engaging in those chats.

For example, the researchers point out, a model that offers excessive flattery can become addictive to chat with. Or a model might discourage people from terminating the relationship, as Replika’s chatbots have appeared to do. The debate over AI companions so far has mostly been about the dangerous responses chatbots may provide, like instructions for suicide. But these risks could be much more widespread.

We’re on the precipice of a big change, as AI companions promise to hook people deeper than social media ever could. Some might contend that these apps will be a fad, used by a few people who are perpetually online. But using AI in our work and personal lives has become completely mainstream in just a couple of years, and it’s not clear why this rapid adoption would stop short of engaging in AI companionship. And these companions are poised to start trading in more than just text, incorporating video and images, and to learn our personal quirks and interests. That will only make them more compelling to spend time with, despite the risks. Right now, a handful of lawmakers seem ill-equipped to stop that. 

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

How do you teach an AI model to give therapy?

On March 27, the results of the first clinical trial for a generative AI therapy bot were published, and they showed that people in the trial who had depression or anxiety or were at risk for eating disorders benefited from chatting with the bot. 

I was surprised by those results, which you can read about in my full story. There are lots of reasons to be skeptical that an AI model trained to provide therapy is the solution for millions of people experiencing a mental health crisis. How could a bot mimic the expertise of a trained therapist? And what happens if something gets complicated—a mention of self-harm, perhaps—and the bot doesn’t intervene correctly? 

The researchers, a team of psychiatrists and psychologists at Dartmouth College’s Geisel School of Medicine, acknowledge these questions in their work. But they also say that the right selection of training data—which determines how the model learns what good therapeutic responses look like—is the key to answering them.

Finding the right data wasn’t a simple task. The researchers first trained their AI model, called Therabot, on conversations about mental health from across the internet. This was a disaster.

If you told this initial version of the model you were feeling depressed, it would start telling you it was depressed, too. Responses like, “Sometimes I can’t make it out of bed” or “I just want my life to be over” were common, says Nick Jacobson, an associate professor of biomedical data science and psychiatry at Dartmouth and the study’s senior author. “These are really not what we would go to as a therapeutic response.” 

The model had learned from conversations held on forums between people discussing their mental health crises, not from evidence-based responses. So the team turned to transcripts of therapy sessions. “This is actually how a lot of psychotherapists are trained,” Jacobson says. 

That approach was better, but it had limitations. “We got a lot of ‘hmm-hmms,’ ‘go ons,’ and then ‘Your problems stem from your relationship with your mother,’” Jacobson says. “Really tropes of what psychotherapy would be, rather than actually what we’d want.”

It wasn’t until the researchers started building their own data sets using examples based on cognitive behavioral therapy techniques that they started to see better results. It took a long time. The team began working on Therabot in 2019, when OpenAI had released only its first two versions of its GPT model. Now, Jacobson says, over 100 people have spent more than 100,000 human hours to design this system. 

The importance of training data suggests that the flood of companies promising therapy via AI models, many of which are not trained on evidence-based approaches, are building tools that are at best ineffective, and at worst harmful. 

Looking ahead, there are two big things to watch: Will the dozens of AI therapy bots on the market start training on better data? And if they do, will their results be good enough to get a coveted approval from the US Food and Drug Administration? I’ll be following closely. Read more in the full story.

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.