AI is changing how small online sellers decide what to make

For years Mike McClary sold the Guardian LTE Flashlight, a heavy-duty black model, online through his small outdoor brand. The product, designed for brightness and durability, became one of his most popular items ever. Even after he stopped offering it around 2017, customers kept sending him emails asking where they could buy it. 

When McClary decided to revisit the Guardian flashlight in 2025, he didn’t begin the way he might have in the past, by combing through supplier listings and sending inquiries to factories. Instead, he opened Accio, an AI sourcing and researching tool on Alibaba.com.

For small entrepreneurs in the US, deciding what to sell and where to make it has traditionally been a slow, labor-intensive process that can take months. Now that work is increasingly being done by AI tools like Accio, which help connect businesses with manufacturers in countries including China and India. Business owners and e-commerce experts told MIT Technology Review that these AI tools are making sourcing more accessible and significantly shortening the time it takes to go from product idea to launch. 

McClary, 51, who runs his business from his Illinois living room, has sold products ranging from leather conditioner to camping lights, including one rechargeable lantern that brought in half a million dollars. Like many small online merchants, he built his business by being extremely scrappy—spotting demand for a product, tweaking existing designs, finding a factory, doing modest marketing, and getting the goods in front of customers fast. 

This time, though, he began by telling Accio about the flashlight’s original design, production cost, and profit margin. Then Accio suggested several changes, making it smaller and slightly less bright and switching its charging method to battery power. It also identified a manufacturer in Ningbo, China, that McClary said could cut the manufacturing cost from $17 to about $2.50 per unit.

McClary took the process from there, contacting the supplier himself to discuss the revised design. Within a month, the new version of the Guardian flashlight was back up for sale on Amazon and on his brand’s website.

The new factory hunt

Although Alibaba is better known for owning Taobao, the biggest shopping site in China, its first business was Alibaba.com, the primary website that lists Chinese factories open for bulk orders. Placing an order with a manufacturer usually requires far more than clicking “Buy.” Sellers often spend days or weeks browsing listings, comparing suppliers’ reviews and manufacturing capacities, asking about minimum order quantities, requesting samples, and negotiating timelines and customization options. 

But Accio has gained significant momentum by changing how that sourcing gets done. Launched in 2024, Accio exceeded 10 million monthly active users in March 2026, according to the company. That means about one in five Alibaba users consults with AI about product sourcing.

Accio’s interface looks a lot like ChatGPT or Claude: Users type a question into an empty box and choose between “fast” and “thinking” modes. But when asked about products, the tool returns more than text, offering charts, links, and visuals and asking follow-up questions to clarify the buyer’s needs. It then narrows the field to one or a handful of suppliers that appear capable of delivering. After that, the human work begins: Users still have to reach out to suppliers themselves and negotiate the details.

Zhang Kuo, the president of Alibaba.com, told MIT Technology Review that the tool is built on multiple frontier models, including the company’s own Qwen series, a popular family of open-source large language models. The system is able to pull from the site’s millions of supplier profiles and is trained on 26 years of proprietary transaction data.

For tasks like product research and sourcing analysis, the tool “blows it away” compared with general AI tools like ChatGPT, says Richard Kostick, CEO of the beauty brand 100% Pure.

Many websites have tried using AI to assist shopping, but Alibaba has been one of the most aggressive. In March, Eddie Wu, CEO of the site’s parent company Alibaba Group, told managers that integrating the company’s core services with Qwen’s AI capabilities is a top priority. During a Chinese New Year promotion of Qwen’s personal shopping AI agent, where the company gave away cash, customers placed 200 million orders, the firm says.

Vincenzo Toscano, an e-commerce seller and consultant, recommended Accio to his clients before deciding to try it himself for a new sunglasses brand. He came in with a rough vision: a brand shaped by his Italian heritage, his personal style, and a boutique aesthetic. He says the AI helped turn that concept into something more concrete, suggesting materials, refining the look, and pointing to design ideas that felt current.

But the tool has clear limits. McClary, who uses AI tools regularly, says Accio is strongest when it comes to product ideation, but less helpful on marketing questions such as advertising and social media outreach. To use it well, he says, buyers still need to challenge its recommendations, since some can be generic.

The rest of the business

As platforms become more AI-driven, manufacturers are adjusting too. Sally Li, a representative at a makeup packaging company in Wuhan, China, says her firm has started writing more detailed product descriptions and adding information about its equipment and manufacturing experience on Alibaba.com because it suspects those details make its listings more likely to be surfaced by AI.

Yan says manufacturers cannot tell whether an inquiry from a customer was generated or guided by AI, and that her firm is not using AI to negotiate pricing or product details.

“AI agents are increasingly used by people to assist purchase decisions and even directly making transactions, and with clear guardrails, they can become extremely useful,” says Jiaxin Pei, a research scientist at the Stanford Institute for Human-Centered AI, “but agents need to act transparently, securely, and in the customer’s best interest.” Pei says developers of these tools should disclose the data they collect and the incentives built into them to ensure that the marketplace remains fair.

Zhang, of Alibaba.com, says Accio currently does not include advertising. Suppliers can pay for higher placement in Alibaba.com’s regular search results, but Zhang says Accio is “not integrated” with that system. “We haven’t had a clear answer in terms of how to monetize this tool,” he says. For now, users can pay for additional tokens to continue chatting with the agent after their free queries run out.

Sellers say that while AI tools have made it easier to come up with ideas and get a business off the ground, they do not replace the core skills that make someone good at e-commerce. McClary believes that even when sellers have access to the same market information, some are still better at making decisions, acting quickly, and actually delivering on orders. Those differences, he says, still go a long way.

Toscano, the brand founder and e-commerce consultant, feels good about officially launching his new brand of sunglasses in just a few months: “We [small business owners] always have to bootstrap a lot of decisions. Deciding what to sell often comes down to an educated guess,” he says, “And we’re now in an era when making those decisions is easier than ever.”

The one piece of data that could actually shed light on your job and AI

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Within Silicon Valley’s orbit, an AI-fueled jobs apocalypse is spoken about as a given. The mood is so grim that a societal impacts researcher at Anthropic, responding Wednesday to a call for more optimistic visions of AI’s future, said there might be a recession in the near term and a “breakdown of the early-career ladder.” Her less-measured colleague Dario Amodei, the company’s CEO, has called AI “a general labor substitute for humans” that could do all jobs in less than five years. And those ideas are not just coming from Anthropic, of course. 

These conversations have unsurprisingly left many workers in a panic (and are probably contributing to support for efforts to entirely pause the construction of data centers, some of which gained steam last week). The panic isn’t being helped by lawmakers, none of whom have articulated a coherent plan for what comes next.

Even economists who have cautioned that AI has not yet cut jobs and may not result in a cliff ahead are coming around to the idea that it could have a unique and unprecedented impact on how we work. 

Alex Imas, based at the University of Chicago, is one of those economists. He shared two things with me when we spoke on Friday morning: a blunt assessment that our tools for predicting what this will look like are pretty abysmal, and a “call to arms” for economists to start collecting the one type of data that could make a plan to address AI in the workforce possible at all. 

On our abysmal tools: consider the fact that any job is made up of individual tasks. One part of a real estate agent’s job, for example, is to ask clients what sort of property they want to buy. The US government chronicled thousands of these tasks in a massive catalogue first launched in 1998 and updated regularly since then. This was the data that researchers at OpenAI used in December to judge how “exposed” a job is to AI (they found a real estate agent to be 28% exposed, for example). Then in February, Anthropic used this data in its analysis of millions of Claude conversations to see which tasks people are actually using its AI to complete and where the two lists overlapped.

But knowing the AI exposure of tasks leads to an illusory understanding of how much a given job is at risk, Imas says. “Exposure alone is a completely meaningless tool for predicting displacement,” he told me.

Sure, it is illustrative in the gloomiest case—for a job in which literally every task could be done by AI with no human direction. If it costs less for an AI model to do all those tasks than what you’re paid—which is not a given, since reasoning models and agentic AI can rack up quite a bill—and it can do them well, the job likely disappears, Imas says. This is the oft-mentioned case of the elevator operator from decades ago; maybe today’s parallel is a customer service agent solely doing phone call triage. 

But for the vast majority of jobs, the case is not so simple. And the specifics matter, too: Some jobs are likely to have dark days ahead, but knowing how and when this will play out is hard to answer when only looking at exposure.

Take writing code, for example. Someone who builds premium dating apps, let’s say, might use AI coding tools to create in one day what used to take three days. That means the worker is more productive. The worker’s employer, spending the same amount of money, can now get more output. So then will the employer want more employees or fewer? 

This is the question that Imas says should keep any policymaker up at night, because the answer will change depending on the industry. And we are operating in the dark. 

In this coder’s case, these efficiencies make it possible for dating apps to lower prices. (A skeptic might expect companies to simply pocket the gains, but in a competitive market, they risk being undercut if they do.) These lower prices will always drive some increase in demand for the apps. But how much? If millions more people want it, the company might grow and ultimately hire more engineers to meet this demand. But if demand barely ticks up—maybe the people who don’t use premium dating apps still won’t want them even at a lower price—fewer coders are needed, and layoffs will happen.

Repeat this hypothetical across every job with tasks that AI can do, and you have the most pressing economic question of our time: the specifics of price elasticity, or how much demand for something changes when its price changes. And this is the second part of what Imas emphasized last week: We don’t currently have this data across the economy. But we could

We do have the numbers for grocery items like cereal and milk, Imas says, because the University of Chicago partners with supermarkets to get data from their price scanners. But we don’t have such figures for tutors or web developers or dietitians (all jobs found to have “exposure” to AI, by the way). Or at least not in a way that’s been widely compiled or made accessible to researchers; sometimes it’s scattered across private companies or consultancies. 

“We need, like, a Manhattan Project to collect this,” Imas says. And we don’t need it just for jobs that could obviously be affected by AI now: “Fields that are not exposed now will become exposed in the future, so you just want to track these statistics across the entire economy.”

Getting all this information would take time and money, but Imas makes the case that it’s worth it; it would give economists the first realistic look at how our AI-enabled future could unfold and give policymakers a shot at making a plan for it.

Four things we’d need to put data centers in space

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

In January, Elon Musk’s SpaceX filed an application with the US Federal Communications Commission to launch up to one million data centers into Earth’s orbit. The goal? To fully unleash the potential of AI without triggering an environmental crisis on Earth. But could it work?

SpaceX is the latest in a string of high-tech companies extolling the potential of orbital computing infrastructure. Last year, Amazon founder Jeff Bezos said that the tech industry will move toward large-scale computing in space. Google has plans to loft data-crunching satellites, aiming to launch a test constellation of 80 as early as next year. And last November Starcloud, a startup based in Washington State, launched a satellite fitted with a high-performance Nvidia H100 GPU, marking the first orbital test of an advanced AI chip. The company envisions orbiting data centers as large as those on Earth by 2030.

Proponents believe that putting data centers in space makes sense. The current AI boom is straining energy grids and adding to the demand for water, which is needed to cool the computers. Communities in the vicinity of large-scale data centers worry about increasing prices for those resources as a result of the growing demand, among other issues.

In space, advocates say, the water and energy problems would be solved. In constantly illuminated sun-synchronous orbits, space-borne data centers would have uninterrupted access to solar power. At the same time, the excess heat they produce would be easily expelled into the cold vacuum of space. And with the cost of space launches decreasing, and mega-rockets such as SpaceX’s Starship promising to push prices even lower, there could be a point at which moving the world’s data centers into space makes sound business sense. Detractors, on the other hand, tell a different story and point to a variety of technological hurdles, though some say it’s possible they may be surmountable in the not-so-distant future. Here are four of the must-haves we’d need to make space-based data centers a reality. 

A way to carry away heat 

AI data centers produce a lot of heat. Space might seem like a great place to dispel that heat without using up massive amounts of water. But it’s not so simple. To get the power needed to run 24-7, a space-based data center would have to be in a constantly illuminated orbit, circling the planet from pole to pole, and never hide in Earth’s shadow. And in that orbit, the temperature of the equipment would never drop below 80 °C, which is way too hot for electronics to operate safely in the long term. 

Getting the heat out of such a system is surprisingly challenging. “Thermal management and cooling in space is generally a huge problem,” says Lilly Eichinger, CEO of the Austrian space tech startup Satellives.

On Earth, heat dissipates mostly through the natural process of convection, which relies on the movement of gases and liquids like air and water. In the vacuum of space, heat has to be removed through the far less efficient process of radiation. Safely removing the heat produced by the computers, as well as what’s absorbed from the sun, requires large radiative surfaces. The bulkier the satellite, the harder it is to send all the heat inside it out into space.

But Yves Durand, former director of technology at the European aerospace giant Thales Alenia Space, says that technology already exists to tackle the problem.

The company previously developed a system for large telecommunications satellites that can pipe refrigerant fluid through a network of tubing using a mechanical pump, ultimately transferring heat from within a spacecraft to radiators on the exterior. Durand led a 2024 feasibility study on space-based data centers, which found that although challenges exist, it should be possible for Europe to put gigawatt-scale data centers (on par with the largest Earthbound facilities) into orbit before 2050. These would be considerably larger than those envisioned by SpaceX, featuring solar arrays hundreds of meters in size—larger than the International Space Station.

Computer chips that can withstand a radiation onslaught

The space around Earth is constantly battered by cosmic particles and lashed by solar radiation. On Earth’s surface, humans and their electronic devices are protected from this corrosive soup of charged particles by the planet’s atmosphere and magnetosphere. But the farther away from Earth you venture, the weaker that protection becomes. Studies show that aircraft crews have a higher risk of developing cancer because of their frequent exposure to high radiation at cruising altitude, where the atmosphere is thin and less protective.

Electronics in space are at risk of three types of problems caused by high radiation levels, says Ken Mai, a principal systems scientist in electrical and computer engineering at Carnegie Mellon University. Phenomena known as single-event upsets can cause bit flips and corrupt stored data when charged particles hit chips and memory devices. Over time, electronics in space accumulate damage from ionizing radiation that degrades their performance. And sometimes a charged particle can strike the component in a way that physically displaces atoms on the chip, creating permanent damage, Mai explains.

Traditionally, computers launched to space had to undergo years of testing and were specifically designed to withstand the intense radiation present in Earth’s orbit. These space-hardened electronics are much more expensive, though, and their performance is also years behind the state-of-the-art devices for Earth-based computing. Launching conventional chips is a gamble. But Durand says cutting-edge computer chips use technologies that are by default more resistant to radiation than past systems. And in mid-March, Nvidia touted hardware, including a new GPU, that is “bringing AI compute to orbital data centers.” 

Nvidia’s head of edge AI marketing, Chen Su, told MIT Technology Review, that “Nvidia systems are inherently commercial off the shelf, with radiation resilience achieved at the system level rather than through radiation‑hardened silicon alone.” He added that satellite makers increase the chips’ resiliency with the help of shielding, advanced software for error detection, and architectures that combine the consumer-grade devices with bespoke, hardened technologies.

Still, Mai says that the data-crunching chips are only one issue. The data centers would also need memory and storage devices, both of which are vulnerable to damage by excessive radiation. And operators would need the ability to swap things out or adapt when issues arise. The feasibility and affordability of using robots or astronaut missions for maintenance is a major question mark hanging over the idea of large-scale orbiting data centers.

“You not only need to throw up a data center to space that meets your current needs; you need redundancy, extra parts, and reconfigurability, so when stuff breaks, you can just change your configuration and continue working,” says Mai. “It’s a very challenging problem because on one hand you have free energy and power in space, but there are a lot of disadvantages. It’s quite possible that those problems will outweigh the advantages that you get from putting a data center into space.”

In addition to the need for regular maintenance, there’s also the potential for catastrophic loss. During periods of intense space weather, satellites can be flooded with enough radiation to kill all their electronics. The sun has just passed the most active phase of its 11-year cycle with relatively little impact on satellites. Still, experts warn that since the space age began, the planet has not experienced the worst the sun is capable of. Many doubt whether the low-cost new space systems that dominate Earth’s orbits today are prepared for that.

A plan to dodge space debris

Both large-scale orbiting data centers such as those envisioned by Thales Alenia Space and the mega-constellations of smaller satellites as proposed by SpaceX give a headache to space sustainability experts. The space around Earth is already quite crowded with satellites. Starlink satellites alone perform hundreds of thousands of collision avoidance maneuvers every year to dodge debris and other spacecraft. The more stuff in space, the higher the likelihood of a devastating collision that would clutter the orbit with thousands of dangerous fragments.

Large structures with hundreds of square meters of solar arrays would quickly suffer damage from small pieces of space debris and meteorites, which would over time degrade the performance of their solar panels and create more debris in orbit. Operating one million satellites in low Earth orbit, the region of space at the altitude of up to 2,000 kilometers, might be impossible to do safely unless all satellites in that area are part of the same network so they can communicate effectively to maneuver around each other, Greg Vialle, the founder of the orbital recycling startup Lunexus Space, told MIT Technology Review.

“You can fit roughly four to five thousand satellites in one orbital shell,” Vialle says. “If you count all the shells in low Earth orbit, you get to a number of around 240,000 satellites maximum.”

And spacecraft must be able to pass each other at a safe distance to avoid collisions, he says. 

“You also need to be able to get stuff up to higher orbits and back down to de-orbit,” he adds. “So you need to have gaps of at least 10 kilometers between the satellites to do that safely. Mega-constellations like Starlink can be packed more tightly because the satellites communicate with each other. But you can’t have one million satellites around Earth unless it’s a monopoly.”

On top of that, Starlink would likely want to regularly upgrade its orbiting data centers with more modern technology. Replacing a million satellites perhaps every five years would mean even more orbital traffic—and it could increase the rate of debris reentry into Earth’s atmosphere from around three or four pieces of junk a day to about one every three minutes, according to a group of astronomers who filed objections against SpaceX’s FCC application. Some scientists are concerned that reentering debris could damage the ozone layer and alter Earth’s thermal balance

Economical launch and assembly

The longer hardware survives in orbit, the better the return on investment. But for orbital data centers to make economic sense, companies will have to find a relatively cheap way to get that hardware in orbit. SpaceX is betting on its upcoming Starship mega-rocket, which will be able to carry up to six times as much payload as the current workhorse, Falcon 9. The Thales Alenia Space study concluded that if Europe were to build its own orbital data centers, it would have to develop a similarly potent launcher. 

But launch is only part of the equation. A large-scale orbital data center won’t fit in a rocket—even a mega-rocket. It will need to be assembled in orbit. And that will likely require advanced robotic systems that do not exist yet. Various companies have conducted Earth-based tests with precursors of such systems, but they are still far from real-world use.

Durand says that in the short term, smaller-scale data centers are likely to establish themselves as an integral part of the orbital infrastructure, by processing images from Earth-observing satellites directly in space without having to send them to Earth. That would be a huge help for companies selling insights from space, as many of these data sets are extremely large, and competition for opportunities to downlink them to Earth for processing via ground stations is growing.

“The good thing with orbital data centers is that you can start with small servers and gradually increase and build up larger data centers,” says Durand. “You can use modularity. You can learn little by little and gradually develop industrial capacity in space. We have all the technology, and the demand for space-based data processing infrastructure is huge, so it makes sense to think about it.”

Smaller facilities probably won’t do much to offset the strain that terrestrial data centers are placing on the planet’s water and electricity, though. That vision of the future might take decades to come to fruition, some critics think—if it even gets off the ground at all. 

Fuel prices are soaring. Plastic could be next.

As the war in Iran continues to engulf the Middle East and the Strait of Hormuz stays closed, one of the most visible global economic ripple effects has been fossil-fuel prices. In particular, you can’t get away from news about the price of gasoline, which just topped an average of $4 a gallon in the US, its highest level since 2022.

But looking ahead, further consequences for the global economy could be looming in plastics. Plastics are made using petrochemicals, and the supply chain impacts of the oil bottleneck near Iran are starting to build up. 

Plastic production accounts for roughly 5% of global carbon dioxide emissions today. And our current moment shows just how embedded oil and gas products are in our lives. It goes far beyond their use for energy. 

As I write this, I’m wearing clothes that contain plastic fibers, typing on a plastic keyboard, and looking through the plastic lenses of my glasses. It’s hard to imagine what our world looks like without plastic. And in some ways, moving away from fossil-derived plastic could prove even more complicated than decarbonizing our energy system. 

Crude oil prices have been on a roller-coaster in recent weeks, and prices have recently topped $100 a barrel.

Crude oil contains a huge range of hydrocarbons, and it’s typically refined by putting it through a distillation unit that separates the raw material into different fractions according to their boiling point. Those fractions then go on to be further processed into everything from jet fuel to asphalt binder. We’ve already seen the price spikes for some materials pulled out of crude oil, like gasoline and jet fuel.

Let’s zoom in on another component, naphtha. It can be added to gasoline and jet fuel to improve performance. It can also be used as a solvent or as a raw material to make plastics.

The Middle East currently accounts for about 20% of global naphtha production­ and supplies about 40% of the market in Asia, where prices are already up by 50% over the last month.

We’re starting to see these effects trickle down already. The price of polypropylene (which is made from naphtha and used for food containers, bottle caps, and even automotive parts) is climbing, especially in Asia.  

Typically, manufacturers have a bit of stock built up, but that’ll be exhausted soon, likely in the coming weeks. The largest supplier of water bottles in India recently announced that it would raise prices by 11% after its packaging costs went up by over 70%, according to reporting from Reuters. Toys could be more expensive this holiday season as manufacturers grapple with supply chain concerns.

Americans will likely feel these ripples especially hard if disruptions continue. The average US resident used over 250 kilograms of new plastics in 2019, according to a 2022 report from the Organization for Economic Cooperation and Development. That’s an absolutely massive number—the global average is just 60 kilograms.

The effects of higher prices for both fuels and feedstocks could compound and multiply, and alternatives aren’t widely available. Bio-based plastics made with materials like plant sugars exist, but they still make up a vanishingly tiny portion of the market. As of 2025, global plastics production totaled over 431 million metric tons per year. Bio-based and bio-degradable plastics made up about 0.5% of that, a share that could reach 1% by 2030.

Bio-based plastics are much more expensive than their fossil-derived counterparts. And many are made using agricultural raw materials, so scaling them up too much could be harmful for the environment and might compete with other industries like food production.

Recycling isn’t the easy answer either. Mechanical recycling is the current standard method used for materials like the plastics that make up water bottles and disposable coffee cups. But that degrades the materials over time, so they can’t be used infinitely. Chemical recycling has its own host of issues—the facilities that do it can be highly polluting, and today plastics that go into advanced recycling plants largely don’t actually go into new plastics.

There’s been a lot of talk in recent weeks about how this energy crisis is going to push the world more toward renewable energy. Solar panels, electric vehicles, and batteries could suddenly become more attractive as we face the drastic consequences of a disruption in the global fossil-fuel supply.

But when it comes to plastic, the future looks far more complicated. Even though the plastics industry is facing much the same disruptions as the energy sector, there aren’t the same obvious alternatives available for a transition. Our lives are tied up in plastic, with uses ranging from the essential (like medical equipment) to the mundane (my to-go coffee cup). Soon, our economy could feel the effects of just how much we rely on fossil-derived plastics, and how hard it’s going to be to replace them. 

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here

The gig workers who are training humanoid robots at home

When Zeus, a medical student living in a hilltop city in central Nigeria, returns to his studio apartment from a long day at the hospital, he turns on his ring light, straps his iPhone to his forehead, and starts recording himself. He raises his hands in front of him like a sleepwalker and puts a sheet on his bed. He moves slowly and carefully to make sure his hands stay within the camera frame. 

Zeus is a data recorder for Micro1, a US company based in Palo Alto, California that collects real-world data to sell to robotics companies. As companies like Tesla, Figure AI, and Agility Robotics race to build humanoids—robots designed to resemble and move like humans in factories and homes—videos recorded by gig workers like Zeus are becoming the hottest new way to train them. 

Micro1 has hired thousands of contract workers in more than 50 countries, including India, Nigeria, and Argentina, where swathes of tech-savvy young people are looking for jobs. They’re mounting iPhones on their heads and recording themselves folding laundry, washing dishes, and cooking. The job pays well by local standards and is boosting local economies, but it raises thorny questions around privacy and informed consent. And the work can be challenging at times—and weird.

Zeus found the job in November, when people started talking about it everywhere on LinkedIn and YouTube. “This would be a real nice opportunity to set a mark and give data that will be used to train robots in the future,” he thought. 

Zeus is paid $15 an hour, which is good income in Nigeria’s strained economy with high unemployment rates. But as a bright-eyed student dreaming of becoming a doctor, he finds ironing his clothes for hours every day boring. 

“I really [do] not like it so much,” he says. “I’m the kind of person that requires … a technical job that requires me to think.” 

Zeus, and all the workers interviewed by MIT Technology Review, asked to be referred to only by pseudonyms because they were not authorized to talk about their work.

Humanoid robots are notoriously hard to build because manipulating physical objects is a difficult skill to master. But the rise of large language models underlying chatbots like ChatGPT has inspired a paradigm shift in robotics. Just as large language models learned to generate words by being trained on vast troves of text scraped from the internet, many researchers believe that humanoid robots can learn to interact with the world by being trained on massive amounts of movement data. 

Editor’s note: In a recent poll, MIT Technology Review readers selected humanoid robots as the 11th breakthrough for our 2026 list of 10 Breakthrough Technologies.

Robotics requires far more complex data about the physical world, though, and that is much harder to find. Virtual simulations can train robots to perform acrobatics, but not how to grasp and move objects, because simulations struggle to model physics with perfect accuracy. For robots to work in factories and serve as housekeepers, real-world data, however time-consuming and expensive to collect, may be what we need. 

Investors are pouring money feverishly into solving this challenge, spending over $6 billion on humanoid robots in 2025. And at-home data recording is becoming a booming gig economy around the world. Data companies like Scale AI and Encord are recruiting their own armies of data recorders, while DoorDash pays delivery drivers to film themselves doing chores. And in China, workers in dozens of state-owned robot training centers wear virtual-reality headsets and exoskeletons to teach humanoid robots how to open a microwave and wipe down the table. 

“There is a lot of demand, and it’s increasing really fast,” says Ali Ansari, CEO of Micro1. He estimates that robotics companies are now spending more than $100 million each year to buy real-world data from his company and others like it.

A day in the life

Workers at Micro1 are vetted by an AI agent named Zara that conducts interviews and reviews samples of chore videos. Every week, they submit videos of themselves doing chores around their homes, following a list of instructions about things like keeping their hands visible and moving at natural speed. The videos are reviewed by both AI and a human and are either accepted or rejected. They’re then annotated by AI and a team of hundreds of humans who label the actions in the footage.

“There is a lot of demand, and it’s increasing really fast.”

Ali Ansari, CEO of Micro1 

Because this approach to training robots is in its infancy, it’s not clear yet what makes good training data. Still, “you need to give lots and lots of variations for the robot to generalize well for basic navigation and manipulation of the world,” says Ansari.

But many workers say that creating a variety of “chore content” in their tiny homes is a challenge. Zeus, a scrappy student living in a humble studio, struggles to record anything beyond ironing his clothes every day. Arjun, a tutor in Delhi, India, takes an hour to make a 15-minute video because he spends so much time brainstorming new chores.

“How much content [can be made] in the home? How much content?” he says. 

There’s also the sticky question of privacy. Micro1 asks workers not to show their faces to the camera or reveal personal information such as names, phone numbers, and birth dates. Then it uses AI and human reviewers to remove anything that slips through. 

But even without faces, the videos capture an intimate slice of workers’ lives: the interiors of their homes, their possessions, their routines. And understanding what kind of personal information they might be recording while they’re busy doing chores on camera can be tricky. Reviews of such footage might not filter out sensitive information beyond the most obvious identifiers.

For workers with families, keeping private life off camera is a constant negotiation. Arjun, a father of two daughters, has to wrangle his chaotic two-year-old out of frame. “Sometimes it’s very difficult to work because my daughter is small,” he says. 

Sasha, a banker turned data recorder in Nigeria, tiptoes around when she hangs her laundry outside in a shared residential compound so she won’t record her neighbors, who watch her in bewilderment.

“It’s going to take longer than people think.”

Ken Goldberg, UC Berkeley

While the workers interviewed by MIT Technology Review understand that their data is being used to train robots, none of them know how exactly their data will be used, stored, and shared with third parties, including the robotics companies that Micro1 is selling the data to. For confidentiality reasons, says Ansari, Micro1 doesn’t name its clients or disclose to workers the specific nature of the projects they are contributing to.

“It is important that if workers are engaging in this, that they are informed by the companies themselves of the intention … where this kind of technology might go and how that might affect them longer term,” says Yasmine Kotturi, a professor of human-centered computing at the University of Maryland.

Occasionally, some workers say, they’ve seen other workers asking on the company Slack channel if the company could delete their data. Micro1 declined to comment on whether such data is deleted.

“People are opting into doing this,” says Ansari. “They could stop the work at any time.”

Hungry for data

With thousands of workers doing their chores differently in different homes, some roboticists wonder if the data collected from them is reliable enough to train robots safely. 

“How we conduct our lives in our homes is not always right from a safety point of view,” says Aaron Prather, a roboticist at ASTM International. “If those folks are teaching those bad habits that could lead to an incident, then that’s not good data.” And the sheer volume of data being collected makes reviewing it for quality control challenging. But Ansari says the company rejects videos showing unsafe ways of performing a task, while clumsy movements can be useful to teach robots what not to do.

Then there’s the question of how much of this data we need. Micro1 says it has tens of thousands of hours of footage, while Scale AI announced it had gathered more than 100,000 hours.

“It’s going to take a long time to get there,” says Ken Goldberg, a roboticist at the University of California, Berkeley. Large language models were trained on text and images that would take a human 100,000 years to read, and humanoid robots may need even more data, because controlling robotic joints is even more complicated than generating text. “It’s going to take longer than people think,” he says.

When Dattu, an engineering student living in a bustling tech hub in India, comes home after a full day of classes at his university, he skips dinner and dashes to his tiny balcony, cramped with potted plants and dumbbells. He straps his iPhone to his forehead and records himself folding the same set of clothes over and over again. 

His family stares at him quizzically. “It’s like some space technology for them,” he says. When he tells his friends about his job, “they just get astounded by the idea that they can get paid by recording chores.”

Juggling his university studies with data recording, as well as other data annotation gigs, takes a toll on him. Still, “it feels like you’re doing something different than the whole world,” he says. 

AI benchmarks are broken. Here’s what we need instead.

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. 

This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines. 

But there’s a problem: AI is almost never used in the way it is benchmarked. Although   researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, these  innovations resolve only part of the issue. That’s because they still evaluate AI’s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds. 

While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI’s capabilities, overlooking systemic risks, and misjudging its economic and social consequences.

To mitigate this, it’s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarksHuman–AI, Context-Specific Evaluation.

What happens when AI fails 

For governments and businesses, AI benchmark scores appear more objective than vendor claims. They’re a critical part of determining whether an AI model or application is “good enough” for real-world deployment. Imagine an AI model that achieves impressive technical scores on the most cutting-edge benchmarks—98% accuracy, groundbreaking speed, compelling outputs. On the strength of these results, organizations may decide to adopt the model, committing sizable financial and technical resources to purchasing and integrating it. 

But then, once it’s adopted, the gap between benchmark and real-world performance quickly becomes visible. For example, take the swathe of FDA-approved AI models that can read medical scans faster and more accurately than an expert radiologist. In the radiology units of hospitals from the heart of California to the outskirts of London, I witnessed staff using highly ranked radiology AI applications. Repeatedly, it took them extra time to interpret AI’s outputs alongside hospital-specific reporting standards and nation-specific regulatory requirements. What appeared as a productivity-enhancing AI tool when tested in a vacuum introduced delays in practice. 

It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teams—radiologists, oncologists, physicists, nurses—who jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care.

The same pattern emerges in my research across other sectors: When embedded within real-world work environments, even AI models that perform brilliantly on standardized tests don’t perform as promised. 

When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the “AI graveyard.” The costs are significant: Time, effort and money end up being wasted. And over time, repeated experiences like this erode organizational confidence in AI and—in critical settings such as health—may erode broader public trust in the technology as well. 

When current benchmarks provide only a partial and potentially misleading signal of an AI model’s readiness for real-world use, this creates regulatory blind spots: Oversight is shaped by metrics that do not reflect reality. It also leaves organizations and governments to shoulder the risks of testing AI in sensitive real-world settings, often with limited resources and support. 

How to build better tests 

To close the gap between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will be used. The critical questions: Can AI function as a productive participant within human teams? And can it generate sustained, collective value? 

Through my research on AI deployment across multiple sectors, I have seen a number of organizations already moving—deliberately and experimentally—toward the HAIC benchmarks I favor. 

HAIC benchmarks reframe current benchmarking in four ways: 

1.     From individual and single-task performance to team and workflow performance (shifting the unit of analysis)

2.     From one-off testing with right/wrong answers to long-term impacts (expanding the time horizon)

3.     From correctness and speed to organizational outcomes, coordination quality, and error detectability (expanding outcome measures)

4.     From isolated outputs to upstream and downstream consequences (system effects)

Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis. 

For example, in one UK hospital system in the period 2021–2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospital’s multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices. 

This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance. 

Once that foundation is set, HAIC benchmarking can begin to take on the element of time. 

Today’s benchmarks resemble school exams—one-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions. 

I saw this aspect of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated within real workflows, with particular attention to how detectable its errors were—that is, how easily human teams could identify and correct them. This long-term “record of error detectability” meant the organizations involved could design and test context-specific guardrails to promote trust in the system, despite the inevitability of occasional AI mistakes.

A longer time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to improve multidisciplinary decision-making. Worse, it may introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to people’s  cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the point of the AI’s use. These knock-on effects—often invisible to current benchmarks—are central to understanding real impact. 

The HAIC approach, admittedly promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enables—or undermines—when humans and teams in the real world work with it.

 Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.

Inside the stealthy startup that pitched brainless human clones

After operating in secrecy for years, a startup company called R3 Bio, in Richmond, California, suddenly shared details about its work last week—saying it had raised money to create nonsentient monkey “organ sacks” as an alternative to animal testing.

In an interview with Wired, R3 listed three investors: billionaire Tim Draper, the Singapore-based fund Immortal Dragons, and life-extension investors LongGame Ventures.

But there is more to the story. And R3 doesn’t want that story told.

MIT Technology Review discovered that the stealth startup’s founder John Schloendorn also pitched a startling, medically graphic, and ethically charged vision for what he’s called “brainless clones” to serve the role of backup human bodies.

Imagine it like this: a baby version of yourself with only enough of a brain structure to be alive in case you ever need a new kidney or liver.

Or, alternatively, he has speculated, you might one day get your brain placed into a younger clone. That could be a way to gain a second lifespan through a still hypothetical procedure known as a body transplant.

The fuller context of R3’s proposals, as well as activities of another stealth startup with related goals, have not previously been reported. They’ve been kept secret by a circle of extreme life-extension proponents who fear that their plans for immortality could be derailed by clickbait headlines and public backlash.

And that’s because the idea can sound like something straight from a creepy science fiction film. One person who heard R3’s clone presentation, and spoke on the condition of anonymity, was left reeling by its implications and shaken by Schloendorn’s enthusiastic delivery. The briefing, this person said, was like a “close encounter of the third kind” with “Dr. Strangelove.”

A key inspiration for Schloendorn is a birth defect in which children are born missing most of their cortical hemispheres; he’s shown people medical scans of these kids’ nearly empty skulls as evidence that a body can live without much of a brain. 

And he’s talked about how to grow a clone. Since artificial wombs don’t exist yet, brainless bodies can’t be grown in a lab. So he’s said the first batch of brainless clones would have to be carried by women paid to do the job. In the future, though, one brainless clone could give birth to another.

Last Monday, the same day it announced itself to the world in Wired, R3 sent us a sweeping disavowal of our findings. It said Schloendorn “never made any statement regarding hypothetical ‘non-sentient human clones’ [that] would be carried by surrogates.” The most overarching of these challenges was its insistence that “any allegations of intent or conspiracy to create human clones or humans with brain damage are categorically false.”

But even Schloendorn and his cofounder, Alice Gilman, can’t seem to keep away from the topic. Just last September, the pair presented at Abundance Longevity, a $70,000-per-ticket event in Boston organized by the anti-aging promoter Peter Diamandis. Although the presentation to about 40 people was not recorded and was meant to be confidential, a copy of the agenda for the event shows that Schloendorn was there to outline his “final bid to defeat aging” in a session called “Full Body Replacement.”

According to a person who was there, both animal research and personal clones for spare organs were discussed. During the presentation, Gilman and Schloendorn even stood in front of an image of a cloning needle. Pressed on whether this was a talk about brainless clones, Gilman told us that while R3’s current business is replacing animal models, “the team reserves the right to hold hypothetical futuristic discussions.”

MIT Technology Review found no evidence that R3 has cloned anyone, or even any animal bigger than a rodent. What we did find were documents, additional meeting agendas, and other sources outlining a technical road map for what R3 called “body replacement cloning” in a 2023 letter to supporters. That road map involved improvements to the cloning process and genetic wiring diagrams for how to create animals without complete brains. 

light passing through an infant's skull
A child with hydranencephaly, a rare condition in which most of the brain is missing. Could a human clone also be created without much of a brain as an ethical source of spare organs?
DIMITRI AGAMANOLIS, M.D. VIA WIKIPEDIA

A main purpose of the fundraising, investors say, was to support efforts to try these techniques in monkeys from a base in the Caribbean. That offered a path to a nearer-term business plan for more ethical medical experiments and toxicology testing—if the company could develop what it now calls monkey “organ sacks.” However, this work would clearly inform any possible human version. 

Though he holds a PhD, Schloendorn is a biotech outsider who has published little and is best known for having once outfitted a DIY lab in his Bay Area garage. Still, his ties to the experimental fringe of longevity science have earned him a network in Silicon Valley and allies at a risk-taking US health innovation agency, ARPA-H. Together with his success at raising money from investors, this signals that the brainless-clone concept should be taken seriously by a wider community of scientists, doctors, and ethicists, some of whom expressed grave concerns. 

“It sounds crazy, in my opinion,” said Jose Cibelli, a researcher at Michigan State University, after MIT Technology Review described R3’s brainless-clone idea to him. “How do you demonstrate safety? What is safety when you’re trying to create an abnormal human?”

Twenty-five years ago, Cibelli was among the first scientists to try to clone human embryos, but he was trying to obtain matched stem cells, not make a baby. “There is no limit to human imagination and ways to make money, but there have to be boundaries,” he says. “And this is the boundary of making a human being who is not a human being.” 

“Feasibility research”

Since Dolly the sheep was born in 1996, researchers have cloned dogs, cats, camels, horses, cattle, ferrets, and other species of mammal. Injecting a cell from an existing animal into an egg creates a carbon-copy embryo that can develop, although not always without problems. Defects, deformities, and stillbirths remain common. 

Those grave risks are why we’ve never heard of a human clone, even though it’s theoretically possible to create one. 

But brainless clones flip the script. That’s because the ultimate aim is to create not a healthy person but an unconscious body that would probably need life support, like a feeding tube, to stay alive. Because this body would share the DNA of the person being copied, its organs would be a near-perfect immunological match. 

Backers of this broad concept argue that a nonsentient body would be ethically acceptable to harvest organs from. Some also believe that swapping in fresh, young body parts—known as “replacement”—is the likeliest path to life extension, since so far no drug can reverse aging. 

And then there’s the idea of a complete body transplant. “Certainly, for the cryonics patients, that sounds like something really promising,” says Anders Sandberg, a prominent Swedish transhumanist and expert in the ethics of future technologies. He notes that many people who opt to be stored in cryonic chambers after death choose the less expensive “head only” option, so “there might be a market for having an extra cloned body.”

MIT Technology Review first approached Schloendorn two years ago after learning he’d led a confidential online seminar called the Body Replacement Mini Conference, in which he presented “recent lab progress towards making replacement bodies.” 

According to a copy of the agenda, that 2023 session also included a presentation by a cloning expert, Young Gie Chung. And there was another from Jean Hébert, who was then a professor at the Albert Einstein College of Medicine and is now a program manager at ARPA-H, where he oversees a project to use stem cells to restore damaged brain tissue. Hébert popularized the so-called replacement solution to avoiding death in a 2020 book called Replacing Aging

In an interview prior to joining the government in 2024, Hébert described an informal but “very collaborative” relationship with Schloendorn. The overall idea was that to stop aging, one of them would determine how to repair a brain, while the other would figure out how to create a body without one. “It’s a perfect match, right? Body, brain,” Hébert told MIT Technology Review at the time. 

Schloendorn, by working outside the mainstream, had the huge advantage of “not being bound by getting the next paper out, or the next grant,” Hébert said, adding, “It’s such a wonderful way of doing research. It’s just clean and pure.” R3 now appears on the ARPA-H website on a list of prospective partners for Hébert’s program.

In a LinkedIn message exchanged with Schloendorn that same year, he described his work as “feasibility research in body replacement.”

“We will try to do it in a way that produces defined societal benefits early on, and we need to be prepared to take no for an answer, if it turns out that this cannot be done safely,” Schloendorn wrote at the time. He declined an interview then, saying that before exiting stealth mode, he wants to be sure the benefits are “reasonably grounded in reality.”

That could prove challenging. While body-part replacement sounds logical, like swapping the timing belt on an old car, in reality there’s scant evidence that receiving organs from a younger twin would make you live any longer. 

A complete body transplant, meanwhile, would probably be fatal, at least with current techniques. In the latest test of the concept, published last July, Russian surgeons removed a pig’s head and then sewed it back on. The animal did live—breathing weakly and lapping water from a syringe. But because its spinal cord had been cut, it was otherwise totally paralyzed. (As yet, there’s no proven method to rejoin a severed spinal cord.) In an act of mercy, the doctors ended the pig’s life after about 12 hours. 

Even some of R3’s investors say the endeavor is a risky, low-odds project, on par with colonizing Mars. Boyang Wang, head of Immortal Dragons, has spoken at longevity conferences about body-swapping technology, referring to the chance that “when the time comes, you can transplant your brain into a new body.” Wang confirmed in a January Zoom call that he’d been referring to R3 and that he invested $500,000 in the company during a 2024 fundraising round.

But since making his investment, Wang says, he’s become less bullish. He now views whole-body transplant as “very infeasible, not even very scientific” and “far away from hope for any realistic application.” 

Still, he says, the investment in R3 fits with his philosophy of making unorthodox bets that could be breakthroughs against aging. “What can really move the needle?” he asks. “Because time is running out.”

Stealth mode

Clonal bodies sit at the extreme frontier of an advancing cluster of technologies all aimed at growing spare parts. Researchers are exploring stem cells, synthetic embryos, and blob-like organoids, and some companies are cloning genetically engineered pigs whose kidneys and hearts have already been transplanted into a few patients. Each of these methods seeks to harness development—the process by which animal bodies naturally form in the womb—to grow fully functional organs. 

There’s even a growing cadre of mainstream scientists who say nonsentient bodies could solve the organ shortage, if they could be grown through artificial means. Two Stanford University professors, calling these structures “bodyoids,” published an editorial in favor of manufacturing spare human bodies in MIT Technology Review last year. While that editorial left many details to the imagination, they called the idea “at least plausible—and possibly revolutionary.” 

“There are a lot of variations on this where they’re trying to find a socially acceptable form,” says George Church, a Harvard University professor who advises startups in the field. But Church says gestating an entire body is probably taking things too far, especially since nearly all patients on transplant lists are waiting for just a single organ, like a heart or kidney. 

“There’s almost no scenario where you need a whole body,” he says. “I just think even if it’s someday acceptable, it’s not a good place to start.” For the moment, Church says, brainless human bodies are “not very useful, in addition to being repulsive.”

That’s arguably why body replacement technology still feels risky to talk about, even among life-extension enthusiasts who are otherwise ready to inject Chinese peptides or have their bodies cryogenically frozen. “I think it’s exciting or interesting from a scientific perspective, but I think the world is not fully ready for it yet,” says Emil Kendziorra, CEO of Tomorrow Bio, a company in Berlin that stores bodies at -196 °C in the hope they can be restored to life in the future. 

“Everybody’s like, yeah, you know, cryopreservation makes total sense,” he says. “And then you talk about total body replacement. And then everybody’s like, Whoa, whoa, whoa.”

Even so, “replacement” technology has found a fervent base of support among a group of self-described “hardcore” longevity adherents who follow a philosophy called Vitalism, which holds that society should redirect resources toward achieving unlimited lifespans. The growing influence of this movement, achieved through lobbying, investment, recruiting, and public messaging, was detailed earlier this year in MIT Technology Review.

Last spring, during a meetup for this community, Kendziorra was among the attendees at an invite-only “Replacement Day” gathering that took place off the public schedule. It was where more radical ideas could be discussed freely, since to some in the Vitalist circle, replacing body parts has emerged as the most plausible, least expensive way to beat death. 

At least that was the conclusion of a road map for anti-aging technology produced by one Vitalist group, the Longevity Biotech Fellowship, which reckoned that a proof-of-concept human clone lacking a neocortex would cost $40 million to create—a tiny amount, relatively speaking. 

Its report cited the existence of two stealth companies working on cloning whole nonsentient bodies, although it took care not to name them. If these companies’ activities become public, “there will be a huge backlash—people will hate it,” the entrepreneur Kris Borer said while presenting the road map at a French resort last August. 

“There are a ton of dystopian movies and novels about this kind of stuff. That is why I didn’t talk about any of the companies working on it. They are trying to hide from public attention,” he said. “We have to have the angel investors and other people invest kind of in secret until things are ready.” 

Borer did say what he sees as the best way to go public: first, to slowly ease body replacement into society’s awareness by disclosing more limited aims, which will be palatable. “We are not going to start with Let’s clone you and give you a body. We are going to start with Let’s solve the organ shortage,” he said. “Eventually people will warm up to it, and then we can go to the more hardcore stuff.”

In an interview earlier this month, Borer declined to name the companies involved in his immortality road map, or to say if R3 is one of them. But we did identify one additional stealthy startup, this one focused on replacing a person’s internal organs, not the whole body. Called Kind Biotechnology, it is a New Hampshire–based company headed by the anti-aging researcher Justin Rebo, a sometime collaborator of Schloendorn’s.

Fig 13 from a patent application
A patent image from Kind Biotechnology shows a mouse pup engineered to lack anatomical features (left) next to a normal animal. The company’s goal is to grow organ “sacks” with a “complete lack of ability to feel, think, or sense.”
WO2025260099 VIA WIPO

According to patent applications filed by the company, Rebo’s team is working to create animals with a “complete lack of ability to feel, think, or sense the environment.” Images included in the patents show mice the company produced that lack a complete brain, and others that don’t have faces or limbs. They did that by deleting genes in embryos using the gene-editing technology CRISPR with the goal of creating a “sack of organs that grows mostly on its own,” with only a minimal nervous system. A cartoon rendering submitted to the patent office shows what looks like a fleshy duffel bag connected to life support tubes. 

In an email, Rebo said his company is working on an “ethical and scalable” way to create animal organs for experimental transplant to humans. He notes that “thousands die while waiting” for an organ. 

Some of Kind’s patent applications do cover the possibility of producing these organ sacks from human cells. Rebo says that’s more of a speculative possibility. But he does see his work as part of the “replacement” approach to longevity. Firstly, that’s because a “scalable production of young, high-quality organs” would let surgeons try transplants in more types of patients, including many with heart disease in old age who aren’t candidates for a transplant now. 

“With abundant high-quality organs, replacement could become a direct form of rejuvenation by replacement of failing parts,” he says. 

And Rebo imagines that simultaneously replacing multiple internal organs (grown together in the sack) could have even broader rejuvenating effects. “Ultimately, replacing failing parts is a direct path to extending healthy human lifespan,” he says. 

Church, who agreed earlier this year to advise Kind Bio, sees this work as part of an effort to “nudge” these technologies “toward something that is more useful and more acceptable from the get-go,” he says. “And then let’s see how society responds to that—rather than jumping to the most repulsive and most useless form, which some of them seem to be aiming for.” 

“There’s one way to find out”

People who know Schloendorn describe a dynamo-like presence who is “100% dedicated” to the goal of extreme life extension. In 2006, he penned a paper in a bioethics journal outlining why the “desire to live forever” is rational, and his doctoral research at the University of Arizona was sponsored by a longevity research organization called the SENS Foundation.  

He’s also well connected. In an interview, Aubrey de Grey, the influential and controversial fundraiser and prognosticator who cofounded SENS, called Schloendorn “one of my protégés.” And around 2010, Peter Thiel reportedly invested $1.5 million in ImmunePath, a company started by Schloendorn to develop stem-cell treatments, though it soon failed. (A representative for Thiel did not respond to a request to confirm the figure.)

By 2021, Schloendorn had moved on, founding R3 Biotechnologies. He began to circulate the body replacement idea and discuss a step-by-step scheme to get there: assess techniques in the lab first, then in monkeys, and maybe eventually in humans. 

A 2023 “letter to stakeholders” signed by Schloendorn begins by saying that “body replacement cloning will require multicomponent genetic engineering on a scale that has never been attempted in primates.” Fortunately, it adds, molecular techniques for “brain knockout” are well known in mice and should also be expected to function in “birthing whole primates,” a class that includes both monkeys and humans. 

Would it work? “There’s one way to find out,” the letter says. 

Wang, the investor at Immortal Dragons, says he put money into R3 after it showed him it is possible to create mice without complete brains. “There were imperfections, but the resulting mice survived, grew up, and to me, that is a pretty strong experiment,” he says; it was evidence enough for him to fund R3’s attempt to “replicate the result in primates.” 

(In its emailed statement, R3 said the company and its founders “never produced any degree of brain alterations in any species, did not attempt to do so, did not hire another party to do so, and have no specific plans to do so in the future.” It added: “We do not work with live non-human primates.”) 

The bigger technical obstacle, though, remains the cloning. Out of 100 attempts to clone an animal, only a few typically succeed. That fact alone makes cloning a human—or a monkey—almost infeasible.

But R3 does seem to have made an effort to tackle the efficiency problem. In one document reviewed by MIT Technology Review, it claims to have implemented improvements to the basic procedure in rodents, referencing a protein, called a histone demethylase, that helps erase a cell’s genetic memory. Adding it can greatly increase the chance that the cell will form a cloned embryo after being injected into an egg in the lab.

Those molecules were used in the first successful cloning of a monkey, which occurred in 2018 in China. But it still wasn’t easy—in fact, it was a huge and costly effort to handle a crowd of monkeys in estrus and perform IVF on them. According to Michigan State’s Cibelli, monkey cloning remains nearly impossible, at least on US territory, just because it’s “unaffordable.”

Nevertheless, success in monkeys did help prove, at least biologically, that human reproductive cloning could be possible. 

The company may also have tried to tackle a second long-standing obstacle to cloning: defects in how the placenta works. Because of such problems, some cloned animals die quickly after birth.

The R3 document refers to a “birthing fix” it developed to further improve the cloning success rate. While MIT Technology Review didn’t learn what R3’s process entails, we found a reference to it on the LinkedIn page of Maitriyee Mahanta, a scientist who cosigned the 2023 letter to R3 stakeholders and is a former research assistant to Hébert. (We were unable to reach Mahanta for comment.)

Her page described her current role as “molecular lead” studying cloning, “birth rate fixing,” and cortical development using cells from nonhuman primates. Her job affiliation is given as the Longevity Escape Velocity Foundation, a nonprofit where de Grey is the president and chief science officer. But de Grey says his foundation only arranged a work visa for Mahanta as part of a partnership “with the company she actually spends her time at.”

Like several other people interviewed for this article, de Grey made a resourceful effort to avoid directly confirming the existence of R3 when we spoke, while at the same time freely discussing theoretical aspects of body cloning technology. For instance, he talked about ways to shorten the wait for your double to grow up to a size suitable for organ harvesting; a further genetic mutation could be added to cause “central precocious puberty” in the clone, he said. This condition causes a growth spurt, even pubic hair, in a toddler. 

Cloning dictators

Who would clone a body and pay to keep it alive for years, until it’s needed? The first customers for this costly technology (if it ever proves feasible) would likely be the ultra-rich or the ultra-powerful. 

Indeed, somehow the world’s top dictators seem to have gotten the memo about replacement parts. In September, a hot mic picked up a conversation between Russian president Vladimir Putin and Chinese leader Xi Jinping as they walked through Beijing with North Korean autocrat Kim Jong Un; in the exchange, the Russian speculated on life extension.  

“Biotechnology is continuously developing. Human organs can be continuously transplanted. The longer you live, the younger you become, and [you can] even achieve immortality,” Putin said through an interpreter.

“Some predict that in this century, humans will live to 150 years old,” Xi responded agreeably.

How the leaders learned of these possibilities is unknown. But scenarios involving dictators are a constant topic among body replacement enthusiasts. 

“There are companies working on this. They are in stealth—we can’t reveal too much about them—but the general concept on this is if you didn’t have any ethical qualms, you could do most of it today,” Will Harborne, the chief investment officer of LongGame Advisors, said last year, during an interview with the podcaster Julian Issa. “If you were the dictator of some country and wanted a clone of yourself, you can already go grow one. You can create a cloned embryo of yourself, you can get a surrogate to carry it to term, and you can grow [a] body until age 18 with a brain, and eventually, if you were a dictator, you could kill them and try to transplant your head on their body.”

“And now no one is suggesting you do that—it’s very unethical—but most of the technology is there,” he said. He noted that the reason for removing the cortex of a clone created for such a purpose is that “we don’t want to kill other people to live forever.” 

Harborne subsequently confirmed to MIT Technology Review that the fund invested $1 million in R3 about a year and a half ago.

In order to make the body replacement process ethical, the clone’s brain needs to be stunted so it lacks consciousness. That is where the interest in birth defects comes in. Remarkable medical scans of kids with a rare condition, hydranencephaly, show a total absence of the cerebral hemispheres. Yet if they are cared for, they may be able to live into their 20s, even though they cannot speak or engage in purposeful movement. 

The technical question, then, is how to intentionally produce such a condition in a clone. Sandberg, the futurist, says he’s visited R3’s lab, talked to Gilman, and sat through a presentation about how genetic engineering can be used to shape brain growth. Previous work has shown that by adding a toxic gene, it is possible to kill specific cell types in a growing embryo but spare others, leading to a mouse without a neocortex.

While Sandberg isn’t an expert in biotechnology, he says R3’s theory looked sensible to him. “I think it’s possible to actually prevent the development of the brain well enough that you can say ‘Yeah, there is almost certainly no consciousness here,’” Sandberg says. “Hence, there can’t be any suffering, or any individual, in a practical sense.”

“I think the overall aim—actually, it looks ethically pretty good,” he says. 

Two monkeys with stuffed animals in a plastic research container
Monkeys were successfully cloned in China for the first time in 2018. Although it was was a costly and difficult undertaking, the feat suggested human cloning is biologically possible.
QIANG SUN AND MU-MING POO/CHINESE ACADEMY OF SCIENCES VIA AP

Yet it could be difficult to really determine where consciousness starts and ends. Under current medical standards, taking the organs of people with hydranencephaly isn’t allowed because they don’t meet the standard of brain death: They have a functioning brain stem. An even more serious problem is evidence that the brain stem alone produces a basic form of consciousness. If that is so, says Bjorn Merker, a neuroscientist who surveyed caretakers of more than a hundred children with hydranencephaly, a plan “to harvest organs from organisms modeled on this condition would be unethical.”

Of course, the most extreme version of the replacement dream isn’t just to take organs. It’s to take over the body entirely. Sergio Canavero, a controversial Italian surgeon who has proposed head and brain transplants, says he was approached for advice by Schloendorn and others a few years ago. “They told me they were looking at a head transplant on a two- or three-year-old,” he says. “I stopped short. How could you even conceive of that? The biomechanical compatibility is not there. You have to wait until at least 14. And I would say 16. It was very clear to me these guys are not surgeons—they are biologists.” 

Canavero says he’s not opposed to cloning bodies for transplant—he thinks it could work. “But if you want to use a clone,” he says, “it must be a nonsentient clone. Otherwise it’s murder, a homicide.”    

MIT Technology Review has not found any evidence that R3 has yet created an “organ sack,” much less a brainless human clone. And there are many reasons to believe their hypothetical future of “full body replacement” will never come to pass—that it is just a live-forever fantasy.

“There are so many barriers,” says Cibelli. It’s a long list: Human cloning is illegal in many countries, it’s unsafe, and few competent experts would want, or dare, to participate. And then there’s the inconvenient fact that for now, there’s no way to grow a brainless clone to birth, except in a woman’s body. Think about it, Cibelli says: “You’d have to convince a woman to carry a fetus that is going to be abnormal.”

Sandberg agrees that is where things could start to get tricky. “The problem here, of course,” he says, “is that the yuck factor is magnificent.”

The Pentagon’s culture war tactic against Anthropic has backfired

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Last Thursday, a California judge temporarily blocked the Pentagon from labeling Anthropic a supply chain risk and ordering government agencies to stop using its AI. It’s the latest development in the month-long feud. And the matter still isn’t settled: The government was given seven days to appeal, and Anthropic has a second case against the designation that has yet to be decided. Until then, the company remains persona non grata with the government. 

The stakes in the case—how much the government can punish a company for not playing ball—were apparent from the start. Anthropic drew lots of senior supporters with unlikely bedfellows among them, including former authors of President Trump’s AI policy.

But Judge Rita Lin’s 43-page opinion suggests that what is really a contract dispute never needed to reach such a frenzy. It did so because the government disregarded the existing process for how such disputes are governed and fueled the fire with social media posts from officials that would eventually contradict the positions it took in court. The Pentagon, in other words, wanted a culture war (on top of the actual war in Iran that began hours later). 

The government used Anthropic’s Claude for much of 2025 without complaint, according to court documents, while the company walked a branding tightrope as a safety-focused AI company that also won defense contracts. Defense employees accessing it through Palantir were required to accept terms of a government-specific usage policy that Anthropic cofounder Jared Kaplan said “prohibited mass surveillance of Americans and lethal autonomous warfare” (Kaplan’s declaration to the court didn’t include details of the policy). Only when the government aimed to contract with Anthropic directly did the disagreements begin. 

What drew the ire of the judge is that when these disagreements became public, they had more to do with punishment than just cutting ties with Anthropic. And they had a pattern: Tweet first, lawyer later. 

President Trump’s post on Truth Social on February 27 referenced “Leftwing nutjobs” at Anthropic and directed every federal agency to stop using the company’s AI. This was echoed soon after by Defense Secretary Pete Hegseth, who said he’d direct the Pentagon to label Anthropic a supply chain risk. 

Doing so necessitates that the secretary take a specific set of actions, which the judge found Hegseth did not complete. Letters sent to congressional committees, for example, said that less drastic steps were evaluated and deemed not possible, without providing any further details. The government also said the designation as a supply chain risk was necessary because Anthropic could implement a “kill switch,” but its lawyers later had to admit it had no evidence of that, the judge wrote.

Hegseth’s post also stated that “No contractor, supplier, or partner that does business with the United States military may conduct any commercial activity with Anthropic.” But the government’s own lawyers admitted on Tuesday that the Secretary doesn’t have the power to do that, and agreed with the judge that the statement had “absolutely no legal effect at all.”

The aggressive posts also led the judge to also conclude that Anthropic was on solid ground in complaining that its First Amendment rights were violated. The government, the judge wrote while citing the posts, “set out to publicly punish Anthropic for its ‘ideology’ and ‘rhetoric,’ as well as its ‘arrogance’ for being unwilling to compromise those beliefs.”

Labeling Anthropic a supply chain risk would essentially be identifying it as a “saboteur” of the government, for which the judge did not see sufficient evidence. She issued an order last Thursday halting the designation, preventing the Pentagon from enforcing it and forbidding the government from fulfilling the promises made by Hegseth and Trump. Dean Ball, who worked on AI policy for the Trump administration but wrote a brief supporting Anthropic, described the judge’s order on Thursday as “a devastating ruling for the government, finding Anthropic likely to prevail on essentially all of its theories for why the government’s actions were unlawful and unconstitutional.”

The government is expected to appeal the decision. But Anthropic’s separate case, filed in DC, makes similar allegations. It just references a different segment of the law governing supply chain risks. 

The court documents paint a pretty clear pattern. Public statements made by officials and the President did not at all align with what the law says should happen in a contract dispute like this, and the government’s lawyers have consistently had to create justifications for social media lambasting of the company after the fact.

Pentagon and White House leadership knew that pursuing the nuclear option would spark a court battle; Anthropic vowed on February 27 to fight the supply chain risk designation days before the government formally filed it on March 3. Pursuing it anyway meant senior leadership was, to say the least, distracted during the first five days of the Iran war, launching strikes while also compiling evidence that Anthropic was a saboteur to the government, all while it could have cut ties with Anthropic by simpler means. 

But even if Anthropic ultimately wins, the government has other means to shun the company from government work. Defense contractors who want to stay on good terms with the Pentagon, for example, now have little reason to work with Anthropic even if it’s not flagged as a supply chain risk. 

“I think it’s safe to say that there are mechanisms the government can use to apply some degree of pressure without breaking the law,” says Charlie Bullock, a senior research fellow at the Institute for Law and AI. “It kind of depends how invested the government is in punishing Anthropic.”

From the evidence thus far, the administration is committing top-level time and attention to winning an AI culture war. At the same time, Claude is apparently so important to its operations that even President Trump said the Pentagon needed six months to stop using it. The White House demands political loyalty and ideological alignment from top AI companies, But the case against Anthropic, at least for now, exposes the limits of its leverage.

If you have information about the military’s use of AI, you can share it securely via Signal (username jamesodonnell.22).

There are more AI health tools than ever—but how well do they work?

<div data-chronoton-summary="

  • Demand is driving the boom: Microsoft, Amazon, and OpenAI have all launched consumer health AI tools in recent months, partly because people are already using general chatbots for medical advice at massive scale—Microsoft alone fields 50 million health questions daily.
  • Independent testing is lagging behind releases: Most experts agree these tools could genuinely help people who struggle to access care, but all six academic researchers interviewed raised concerns that products are going public before independent researchers can assess whether they’re actually safe.
  • Even good benchmarks have blind spots: Studies show that real users—lacking medical expertise—might not know how to get the answers they want from health chatbots, a gap that some lab-based evaluations may not catch.
  • The honest answer is still “we don’t know”: No one is demanding perfection from health AI, but without trusted third-party evaluation, it remains genuinely unclear whether today’s tools help more than they harm.

” data-chronoton-post-id=”1134795″ data-chronoton-expand-collapse=”1″ data-chronoton-analytics-enabled=”1″>

Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropic’s Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend. 

There’s a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released. 

In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations aren’t made available for external expert review. And even if the companies are doing quality, rigorous research—which some, including OpenAI, do seem to be—they might still have blind spots that the broader research community could help to fill.

“To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,” says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out.”

“But,” he adds, “the evidence base really needs to be there.”

Tipping points 

To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the company’s health team was formed, and why Copilot Health now exists. “We’ve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,” he says.

But that’s only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report, and an accompanying blog post, detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app.

Other AI companies have noticed, and responded to, this trend. “Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,” says Karan Singhal, who leads OpenAI’s Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAI’s models.)

It’s possible that people simply prefer posing their health problems to a nonjudgmental bot that’s available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. “There is a reason that these tools exist and they have a position in the overall landscape,” says Girish Nadkarni, chief AI officer​ at the Mount Sinai Health System. “That’s because access to health care is hard, and it’s particularly hard for certain populations.”

The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbot’s advice rather than unnecessarily busying emergency rooms and doctor’s offices.

But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and  some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Health’s capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public.

Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan. 

The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazon’s Health AI include similar warnings. But those warnings are easy to ignore. “We all know that people are going to use it for diagnosis and management,” says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.

Medical testing

Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversations—though the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the model’s HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect. 

But evaluations like HealthBench have limitations. In a study published last month, Bean—the Oxford doctoral candidate—and his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenario—or their real-life experience—are important to include in their prompt, or they might misinterpret the information that an LLM gives them.

Bean says that this performance gap could be significant for OpenAI’s models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If that’s the case, then users who don’t have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice.

Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.

Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. 

Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. 

Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.

Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”

They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.

OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” 

Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.

Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who’s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.”

No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes—and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren’t too grave. 

With the current state of the evidence, however, it’s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.

A woman’s uterus has been kept alive outside the body for the first time

<div data-chronoton-summary="

  • A uterus survived outside the body for the first time: Scientists in Spain kept a donated human uterus alive for 24 hours using a machine that mimics the body’s circulatory system, pumping modified blood through the organ.
  • The researchers hope to someday keep a uterus alive for a full menstrual cycle: Researchers also want to study how embryos implant into the uterine lining, by observing the process in a living organ outside the body.
  • Bigger ambitions are already on the table: The team’s founder envisions a future where a machine like this could gestate a human fetus entirely outside the body, offering a new path to parenthood for those unable to carry a pregnancy.

” data-chronoton-post-id=”1134766″ data-chronoton-expand-collapse=”1″ data-chronoton-analytics-enabled=”1″>

“Think of this as a human body,” says Javier González.

In front of me is essentially a metal box on wheels. Standing at around a meter in height, it reminds me of a stainless-steel counter in a restaurant kitchen. It is covered in flexible plastic tubing—which act as veins and arteries—connecting a series of transparent containers, the organs of this machine.

What makes it extra special is the role of the cream-colored tub that sits on its surface. Ten months ago, González, a biomedical scientist who developed the device with his colleagues at the Carlos Simon Foundation, carefully placed a freshly donated human uterus in the tub. The team connected it to the device’s tubes and pumped in modified human blood.

The device kept the uterus alive for a day—a new feat that could represent the first step to the long-term maintenance of uteruses outside the human body. The work has not yet been published. 

The team members want to keep donated human uteruses alive long enough to see a full menstrual cycle. They hope this will help them study diseases of the uterus and learn more about how embryos burrow their way into the organ’s lining at the start of a pregnancy. They also hope that future iterations of their device might one day sustain the full gestation of a human fetus.

The machine is technically called PUPER, which stands for “preservation of the uterus in perfusion.” But González’s colleague Xavier Santamaria says the team has adopted a nickname for it: “We call it ‘Mother.’”

The organ in the machine

González and Santamaria, medical vice president of the Carlos Simon Foundation, demonstrated how the device might work when I visited the foundation in Valencia, Spain, earlier this month (although it held no organs on that day). 

Both are interested in learning more about implantation, the moment at which an embryo attaches itself to the lining of a uterus—essentially, the very first moment of pregnancy.

The foundation’s founder and director, Carlos Simon, believes it’s a sticking point in IVF: Scientists have made many improvements to the technology over the years, but the failure of embryos to implant underlies plenty of unsuccessful IVF cycles, he says. Being able to carefully study how the process works in a real, living organ might give the team a better idea of how to prevent those failures.

a person in gloves stands next to a machine with lots of tubing coming in and out of the metal exterior

JESS HAMZELOU
a sheep uterus resting on gauze connected to several tubes

JAVIER GONZALES/CARLOS SIMON FOUNDATION

Javier González demonstrates the perfusion machine. A previous iteration of the device kept a sheep’s uterus (right) alive for a day.

The team took inspiration from advances in technologies designed to maintain donated organs for transplantation. In recent years, researchers around the world have created devices that deliver nutrients and filter waste so that organs can survive longer after being removed from donors’ bodies.

The main goal here is to buy time. A human organ might last only a matter of hours outside the body, so a transplant may require frantic preparation for the recipient, sometimes in the middle of the night. With a little more time, doctors could find better donor-patient matches and potentially test the quality of donated organs.

This approach is called normothermic or machine perfusion, and it is already being used clinically for some liver, kidney, and heart transplants.

The team at the Carlos Simon Foundation built a similar machine for uteruses. A blood bag hangs on one side. From there, blood is ferried via plastic tubing to a pump, which functions as the heart. The pump shunts the blood through an oxygenator, which adds oxygen and removes carbon dioxide as the lungs would in a human body.

The blood is warmed and passed through sensors that monitor the levels of glucose and oxygen, along with other factors. It passes through a “kidney” to remove waste. And finally the blood reaches the uterus, hooked up to its own plastic “arteries” and “veins.” The organ itself sits at a tilt, just as in the body, and is kept in a humid environment to stay moist.

Mother’s first uterus

The team first began testing an early prototype of the device with sheep uteruses around four years ago. That meant carting the machine to an animal research center in Zaragoza, around 200 miles away. Over the course of the preliminary study, veterinary surgeons removed the uteruses of six sheep and hooked them up to the machine. They kept each uterus alive for a day, using blood from the same animals.

After the sheep experiments, the researchers carted their machine back to Valencia and modified it to achieve its current incarnation, “Mother.” They started working with a local hospital that performed hysterectomies. And in May last year, they were offered their first human uterus.

The team needed to be quick. “You need to put [the uterus in the machine] within a couple of hours, maximum, of the extraction,” says Santamaria. He and his colleagues also needed to connect the uterus’s blood vessels to the tubing delicately, taking care to avoid any blockages (clotting is a major challenge in organ perfusion). The organ was hooked up to human blood obtained from a blood bank.

It seemed to work—at least temporarily. “We kept it alive for one day,” says Santamaria.

“As a proof of concept, it is impressive,” says Keren Ladin, a bioethicist who has focused on organ transplantation and perfusion at Tufts University. “These are early days.”

It might not sound like much, but 24 hours is a long time for an organ to be out of the body. Maintaining a donated uterus for that long could expand the options for uterus transplant, a fairly new procedure offered to some people who want to be pregnant but don’t have a functional uterus, says Gerald Brandacher, professor of experimental and translational transplant surgery at the Medical University of Innsbruck in Austria.

“It is better than what we currently have, because we have only a couple of hours,” he says. So far, most uterus transplants have been planned operations involving organs from living donors. A technology like this could allow for the use of more organs from deceased donors, he says.

That work is “not in the immediate pipeline” for the team in Spain, says Santamaria. “We are working on other problems.”

Pregnancy in the lab?

Santamaria, González, and their colleagues are more interested in using sustained human uteruses for research. 

They’ve mounted a camera to a wall in the corner of the room, pointed at their machine. It allows the team to monitor “Mother” remotely, and to check if any valves disconnect. (That happened once before—a spike in pressure caused the blood bag to come loose, spilling a liter of blood on the floor, Santamaria says.)

They’d like to be able to keep their uteruses alive for around 28 days to study the menstrual cycle and disorders that affect the uterus, like endometriosis and fibroids.

It won’t be easy to maintain a uterus for that long, cautions Brandacher. As far as he knows, no one has been able to maintain a liver for more than seven days. “No studies out there … have shown 30-day survival in a machine perfusion circuit,” he says.

But it’s worth the effort. The team’s main interest is learning more about how embryos implant in the uterine lining at the start of a pregnancy. They hope to be able to test the process in their outside-the-body uteruses.

They won’t be allowed to use human embryos for this, says González—that would cross an ethical boundary. Instead, they plan to use embryo-like structures made from stem cells. The structures closely resemble human embryos but are created in a lab without sperm or eggs.

Simon himself has grander ambitions.

He sees a future in which a machine like “Mother” will be able to fully gestate a human, all the way from embryo to newborn. It could offer a new path to parenthood for people who don’t have a uterus, for example, or who are not able to get pregnant for other reasons.

He appreciates that it sounds futuristic, to say the least. “I don’t know if we will end up having pregnancies inside of the uterus outside of the body, but at least we are ready to understand all the steps to do that,” he says. “You have to start somewhere.”