A tiny new open-source AI model performs as well as powerful big ones

The Allen Institute for Artificial Intelligence (Ai2), a research nonprofit, is releasing a family of open-source multimodal language models, called Molmo, that it says perform as well as top proprietary models from OpenAI, Google, and Anthropic. 

The organization claims that its biggest Molmo model, which has 72 billion parameters, outperforms OpenAI’s GPT-4o, which is estimated to have over a trillion parameters, in tests that measure things like understanding images, charts, and documents.  

Meanwhile, Ai2 says a smaller Molmo model, with 7 billion parameters, comes close to OpenAI’s state-of-the-art model in performance, an achievement it ascribes to vastly more efficient data collection and training methods. 

What Molmo shows is that open-source AI development is now on par with closed, proprietary models, says Ali Farhadi, the CEO of Ai2. And open-source models have a significant advantage, as their open nature means other people can build applications on top of them. The Molmo demo is available here, and it will be available for developers to tinker with on the Hugging Face website. (Certain elements of the most powerful Molmo model are still shielded from view.) 

Other large multimodal language models are trained on vast data sets containing billions of images and text samples that have been hoovered from the internet, and they can include several trillion parameters. This process introduces a lot of noise to the training data and, with it, hallucinations, says Ani Kembhavi, a senior director of research at Ai2. In contrast, Ai2’s Molmo models have been trained on a significantly smaller and more curated data set containing only 600,000 images, and they have between 1 billion and 72 billion parameters. This focus on high-quality data, versus indiscriminately scraped data, has led to good performance with far fewer resources, Kembhavi says.

Ai2 achieved this by getting human annotators to describe the images in the model’s training data set in excruciating detail over multiple pages of text. They asked the annotators to talk about what they saw instead of typing it. Then they used AI techniques to convert their speech into data, which made the training process much quicker while reducing the computing power required. 

These techniques could prove really useful if we want to meaningfully govern the data that we use for AI development, says Yacine Jernite, who is the machine learning and society lead at Hugging Face, and was not involved in the research. 

“It makes sense that in general, training on higher-quality data can lower the compute costs,” says Percy Liang, the director of the Stanford Center for Research on Foundation Models, who also did not participate in the research. 

Another impressive capability is that the model can “point” at things, meaning it can analyze elements of an image by identifying the pixels that answer queries.

In a demo shared with MIT Technology Review, Ai2 researchers took a photo outside their office of the local Seattle marina and asked the model to identify various elements of the image, such as deck chairs. The model successfully described what the image contained, counted the deck chairs, and accurately pinpointed to other things in the image as the researchers asked. It was not perfect, however. It could not locate a specific parking lot, for example. 

Other advanced AI models are good at describing scenes and images, says Farhadi. But that’s not enough when you want to build more sophisticated web agents that can interact with the world and can, for example, book a flight. Pointing allows people to interact with user interfaces, he says. 

Jernite says Ai2 is operating with a greater degree of openness than we’ve seen from other AI companies. And while Molmo is a good start, he says, its real significance will lie in the applications developers build on top of it, and the ways people improve it.

Farhadi agrees. AI companies have drawn massive, multitrillion-dollar investments over the past few years. But in the past few months, investors have expressed skepticism about whether that investment will bring returns. Big, expensive proprietary models won’t do that, he argues, but open-source ones can. He says the work shows that open-source AI can also be built in a way that makes efficient use of money and time. 

“We’re excited about enabling others and seeing what others would build with this,” Farhadi says. 

Want AI that flags hateful content? Build it.

Humane Intelligence, an organization focused on evaluating AI systems, is launching a competition that challenges developers to create a computer vision model that can track hateful image-based propaganda online. Organized in partnership with the Nordic counterterrorism group Revontulet, the bounty program opens September 26. It is open to anyone, 18 or older, who wants to compete and promises $10,000 in prizes for the winners.

This is the second of a planned series of 10 “algorithmic bias bounty” programs from Humane Intelligence, a nonprofit that investigates the societal impact of AI and was launched by the prominent AI researcher Rumman Chowdhury in 2022. The series is supported by Google.org, Google’s philanthropic arm.

“The goal of our bounty programs is to, number one, teach people how to do algorithmic assessments,” says Chowdhury, “but also, number two, to actually solve a pressing problem in the field.” 

Its first challenge asked participants to evaluate gaps in sample data sets that may be used to train models—gaps that may specifically produce output that is factually inaccurate, biased, or misleading. 

The second challenge deals with tracking hateful imagery online—an incredibly complex problem. Generative AI has enabled an explosion in this type of content, and AI is also deployed to manipulate content so that it won’t be removed from social media. For example, extremist groups may use AI to slightly alter an image that a platform has already banned, quickly creating hundreds of different copies that can’t easily be flagged by automated detection systems. Extremist networks can also use AI to embed a pattern into an image that is undetectable to the human eye but will confuse and evade detection systems. It has essentially created a cat-and-mouse game between extremist groups and online platforms. 

The challenge asks for two different models. The first, a task for those with intermediate skills, is one that identifies hateful images; the second, considered an advanced challenge, is a model that attempts to fool the first one. “That actually mimics how it works in the real world,” says Chowdhury. “The do-gooders make one approach, and then the bad guys make an approach.” The goal is to engage machine-learning researchers on the topic of mitigating extremism, which may lead to the creation of new models that can effectively screen for hateful images.  

A core challenge of the project is that hate-based propaganda can be very dependent on its context. And someone who doesn’t have a deep understanding of certain symbols or signifiers may not be able to tell what even qualifies as propaganda for a white nationalist group. 

“If [the model] never sees an example of a hateful image from a part of the world, then it’s not going to be any good at detecting it,” says Jimmy Lin, a professor of computer science at the University of Waterloo, who is not associated with the bounty program.

This effect is amplified around the world, since many models don’t have a vast knowledge of cultural contexts. That’s why Humane Intelligence decided to partner with a non-US organization for this particular challenge. “Most of these models are often fine-tuned to US examples, which is why it’s important that we’re working with a Nordic counterterrorism group,” says Chowdhury.

Lin, though, warns that solving these problems may require more than algorithmic changes. “We have models that generate fake content. Well, can we develop other models that can detect fake generated content? Yes, that is certainly one approach to it,” he says. “But I think overall, in the long run, training, literacy, and education efforts are actually going to be more beneficial and have a longer-lasting impact. Because you’re not going to be subjected to this cat-and-mouse game.”

The challenge will run till November 7, 2024. Two winners will be selected, one for the intermediate challenge and another for the advanced; they will receive $4,000 and $6,000, respectively. Participants will also have their models reviewed by Revontulet, which may decide to add them to its current suite of tools to combat extremism. 

An AI script editor could help decide what films get made in Hollywood

Every day across Hollywood, scores of film school graduates and production assistants work as script readers. Their job is to find the diamonds in the rough from the 50,000 or so screenplays pitched each year and flag any worth pursuing further. Each script runs anywhere from 100 to 150 pages, and it can take half a day to read one and write up a “coverage,” or summary of the strengths and weaknesses. With only about 50 of these scripts selling in a given year, readers are trained to be ruthless. 

Now the film-focused tech company Cinelytic, which works with major studios like Warner Bros. and Sony Pictures to analyze film budgets and box office potential, aims to offer script feedback with generative AI. 

Today it launched a new tool called Callaia, which amateur writers and professional script readers alike can use to analyze scripts at $79 each. Using AI, it takes Callaia less than a minute to write its own coverage, which includes a synopsis, a list of comparable films, grades for areas like dialogue and originality, and actor recommendations. It also makes a recommendation on whether or not the film should be financed, giving it a rating of “pass,” “consider,” “recommend,” or “strongly recommend.” Though the foundation of the tool is built with ChatGPT’s API, the team had to coach the model on script-specific tasks like evaluating genres and writing a movie’s logline, which summarize the story in a sentence. 

“It helps people understand the script very quickly,” says Tobias Queisser, Cinelytic’s cofounder and CEO, who also had a career as a film producer. “You can look at more stories and more scripts, and not eliminate them based on factors that are detrimental to the business of finding great content.”

The idea is that Callaia will give studios a more analytical way to predict how a script may perform on the screen before spending on marketing or production. But, the company says, it’s also meant to ease the bottleneck that script readers create in the filmmaking process. With such a deluge to sort through, many scripts can make it to decision-makers only if they have a recognizable name attached. An AI-driven tool would democratize the script selection process and allow better scripts and writers to be discovered, Queisser says.

The tool’s introduction may further fuel the ongoing Hollywood debate about whether AI will help or harm its creatives. Since the public launch of ChatGPT in late 2022, the technology has drawn concern everywhere from writers’ rooms to special effects departments, where people worry that it will cheapen, augment, or replace human talent.  

In this case, Callaia’s success will depend on whether it can provide critical feedback as well as a human script reader can. 

That’s a challenge because of what GPT and other AI models are built to do, according to Tuhin Chakrabarty, a researcher who studied how well AI can analyze creative works during his PhD in computer science at Columbia University. In one of his studies, Chakrabarty and his coauthors had various AI models and a group of human experts—including professors of creative writing and a screenwriter—analyze the quality of 48 stories, 12 that appeared in the New Yorker and the rest of which were AI-generated. His team found that the two groups virtually never agreed on the quality of the works. 

“Whenever you ask an AI model about the creativity of your work, it is never going to say bad things,” Chakrabarty says. “It is always going to say good things, because it’s trained to be a helpful, polite assistant.”

Cinelytic CTO Dev Sen says this trait did present a hurdle in the design of Callaia, and that the initial output of the model was overly positive. That improved with time and tweaking. “We don’t necessarily want to be overly critical, but aim for a more balanced analysis that points out both strengths and weaknesses in the script,” he says. 

Vir Srinivas, an independent filmmaker whose film Orders from Above won Best Historical Film at Cannes in 2021, agreed to look at an example of Callaia’s output to see how well the AI model can analyze a script. I showed him what the model made of a 100-page script about a jazz trumpeter on a journey of self-discovery in San Francisco, which Cinelytic provided. Srinivas says that the coverage generated by the model didn’t go deep enough to present genuinely helpful feedback to a screenwriter.

“It’s approaching the script in too literal a sense and not a metaphorical one—something which human audiences do intuitively and unconsciously,” he says. “It’s as if it’s being forced to be diplomatic and not make any waves.”

There were other flaws, too. For example, Callaia predicted that the film would need a budget of just $5 to $10 million but also suggested that expensive A-listers like Paul Rudd would have been well suited for the lead role.

Cinelytic says it’s currently at work improving the actor recommendation component, and though the company did not provide data on how well its model analyzes a given script, Sen says feedback from 100 script readers who beta-tested the model was overwhelmingly positive. “Most of them were pretty much blown away, because they said that the coverages were on the order of, if not better than, the coverages they’re used to,” he says. 

Overall, Cinelytic is pitching Callaia as a tool meant to quickly provide feedback on lots of scripts, not to replace human script readers, who will still read and adjust the tool’s findings. Queisser, who is cognizant that whether AI can effectively write or edit creatively is hotly contested in Hollywood, is hopeful the tool will allow script readers to more quickly identify standout scripts while also providing an efficient source of feedback for writers.

“Writers that embrace our tool will have something that can help them refine their scripts and find more opportunities,” he says. “It’s positive for both sides.”

OpenAI released its advanced voice mode to more people. Here’s how to get it.

OpenAI is broadening access to Advanced Voice Mode, a feature of ChatGPT that allows you to speak more naturally with the AI model. It allows you to interrupt its responses midsentence, and it can sense and interpret your emotions from your tone of voice and adjust its responses accordingly. 

These features were teased back in May when OpenAI unveiled GPT-4o, but they were not released until July—and then just to an invite-only group. (At least initially, there seem to have been some safety issues with the model; OpenAI gave several Wired reporters access to the voice mode back in May, but the magazine reported that the company “pulled it the next morning, citing safety concerns.”) Users who’ve been able to try it have largely described the model as an impressively fast, dynamic, and realistic voice assistant—which has made its limited availability particularly frustrating to some other OpenAI users. 

Today is the first time OpenAI has promised to bring the new voice mode to a wide range of users. Here’s what you need to know.

What can it do? 

Though ChatGPT currently offers a standard voice mode to paid users, its interactions can be clunky. In the mobile app, for example, you can’t interrupt the model’s often long-winded responses with your voice, only with a tap on the screen. The new version fixes that, and also promises to modify its responses on the basis of the emotion it’s sensing from your voice. As with other versions of ChatGPT, users can personalize the voice mode by asking the model to remember facts about themselves. The new mode also has improved its pronunciation of words in non-English languages.

AI investor Allie Miller posted a demo of the tool in August, which highlighted a lot of the same strengths of OpenAI’s own release videos: The model is fast and adept at changing its accent, tone, and content to match your needs.

The update also adds new voices. Shortly after the launch of GPT-4o, OpenAI was criticized for the similarity between the female voice in its demo videos, named Sky, and that of Scarlett Johansson, who played an AI love interest in the movie Her. OpenAI then removed the voice. Now it has launched five new voices, named Arbor, Maple, Sol, Spruce, and Vale, which will be available in both the standard and advanced voice modes. MIT Technology Review has not heard them yet, but OpenAI says they were made using professional voice actors from around the world. “We interviewed dozens of actors to find those with the qualities of voices we feel people will enjoy talking to for hours—warm, approachable, inquisitive, with some rich texture and tone,” a company spokesperson says. 

Who can access it and when?

For now, OpenAI is rolling out access to Advanced Voice Mode to Plus users, who pay $20 per month for a premium version, and Team users, who pay $30 per month and have higher message limits. The next group to receive access will be those in the Enterprise and Edu tiers. The exact timing, though, is vague; an OpenAI spokesperson says the company will “gradually roll out access to all Plus and Team users and will roll out to Enterprise and Edu tiers starting next week.” The company hasn’t committed to a firm deadline for when all users in these categories will have access. A message in the ChatGPT app indicates that all Plus users will have access by “the end of fall.”

There are geographic limitations. The new feature is not yet available in the EU, the UK, Switzerland, Iceland, Norway, or Liechtenstein.

There is no immediate plan to release Advanced Voice Mode to free users. (The standard mode remains available to all paid users.)

What steps have been taken to make sure it’s safe?

As the company noted upon the initial release in July and again emphasized this week, Advanced Voice Mode has been safety-tested by external experts “who collectively speak a total of 45 different languages, and represent 29 different geographies.” The GPT-4o system card details how the underlying model handles issues like generating violent or erotic speech, imitating voices without their consent, or generating copyrighted content. 

Still, OpenAI’s models are not open-source. Compared with such models, which are more transparent about their training data and the “model weights” that govern how the AI produces responses, OpenAI’s closed-source models are harder for independent researchers to evaluate from the perspective of safety, bias, and harm.

AI models let robots carry out tasks in unfamiliar environments

It’s tricky to get robots to do things in environments they’ve never seen before. Typically, researchers need to train them on new data for every new place they encounter, which can become very time-consuming and expensive.

Now researchers have developed a series of AI models that teach robots to complete basic tasks in new surroundings without further training or fine-tuning. The five AI models, called robot utility models (RUMs), allow machines to complete five separate tasks—opening doors and drawers, and picking up tissues, bags, and cylindrical objects—in unfamiliar environments with a 90% success rate. 

The team, consisting of researchers from New York University, Meta, and the robotics company Hello Robot, hopes its findings will make it quicker and easier to teach robots new skills while helping them function within previously unseen domains. The approach could make it easier and cheaper to deploy robots in our homes.

“In the past, people have focused a lot on the problem of ‘How do we get robots to do everything?’ but not really asking ‘How do we get robots to do the things that they do know how to do—everywhere?’” says Mahi Shafiullah, a PhD student at New York University who worked on the project. “We looked at ‘How do you teach a robot to, say, open any door, anywhere?’”

Teaching robots new skills generally requires a lot of data, which is pretty hard to come by. Because robotic training data needs to be collected physically—a time-consuming and expensive undertaking—it’s much harder to build and scale training databases for robots than it is for types of AI like large language models, which are trained on information scraped from the internet.

To make it faster to gather the data essential for teaching a robot a new skill, the researchers developed a new version of a tool it had used in previous research: an iPhone attached to a cheap reacher-grabber stick, the kind typically used to pick up trash. 

The team used the setup to record around 1,000 demonstrations in 40 different environments, including homes in New York City and Jersey City, for each of the five tasks—some of which had been gathered as part of previous research. Then they trained learning algorithms on the five data sets to create the five RUM models.

These models were deployed on Stretch, a robot consisting of a wheeled unit, a tall pole, and a retractable arm holding an iPhone, to test how successfully they were able to execute the tasks in new environments without additional tweaking. Although they achieved a completion rate of 74.4%, the researchers were able to increase this to a 90% success rate when they took images from the iPhone and the robot’s head-mounted camera,  gave them to OpenAI’s recent GPT-4o LLM model, and asked it if the task had been completed successfully. If GPT-4o said no, they simply reset the robot and tried again.

A significant challenge facing roboticists is that training and testing their models in lab environments isn’t representative of what could happen in the real world, meaning research that helps machines to behave more reliably in new settings is much welcomed, says Mohit Shridhar, a research scientist specializing in robotic manipulation who wasn’t involved in the work. 

“It’s nice to see that it’s being evaluated in all these diverse homes and kitchens, because if you can get a robot to work in the wild in a random house, that’s the true goal of robotics,” he says.

The project could serve as a general recipe to build other utility robotics models for other tasks, helping to teach robots new skills with minimal extra work and making it easier for people who aren’t trained roboticists to deploy future robots in their homes, says Shafiullah.

“The dream that we’re going for is that I could train something, put it on the internet, and you should be able to download and run it on a robot in your home,” he says.

Why we need an AI safety hotline

In the past couple of years, regulators have been caught off guard again and again as tech companies compete to launch ever more advanced AI models. It’s only a matter of time before labs release another round of models that pose new regulatory challenges. We’re likely just weeks away, for example, from OpenAI’s release of ChatGPT-5, which promises to push AI capabilities further than ever before. As it stands, it seems there’s little anyone can do to delay or prevent the release of a model that poses excessive risks.

Testing AI models before they’re released is a common approach to mitigating certain risks, and it may help regulators weigh up the costs and benefits—and potentially block models from being released if they’re deemed too dangerous. But the accuracy and comprehensiveness of these tests leaves a lot to be desired. AI models may “sandbag” the evaluation—hiding some of their capabilities to avoid raising any safety concerns. The evaluations may also fail to reliably uncover the full set of risks posed by any one model. Evaluations likewise suffer from limited scope—current tests are unlikely to uncover all the risks that warrant further investigation. There’s also the question of who conducts the evaluations and how their biases may influence testing efforts. For those reasons, evaluations need to be used alongside other governance tools. 

One such tool could be internal reporting mechanisms within the labs. Ideally, employees should feel empowered to regularly and fully share their AI safety concerns with their colleagues, and they should feel those colleagues can then be counted on to act on the concerns. However, there’s growing evidence that, far from being promoted, open criticism is becoming rarer in AI labs. Just three months ago, 13 former and current workers from OpenAI and other labs penned an open letter expressing fear of retaliation if they attempt to disclose questionable corporate behaviors that fall short of breaking the law. 

How to sound the alarm

In theory, external whistleblower protections could play a valuable role in the detection of AI risks. These could protect employees fired for disclosing corporate actions, and they could help make up for inadequate internal reporting mechanisms. Nearly every state has a public policy exception to at-will employment termination—in other words, terminated employees can seek recourse against their employers if they were retaliated against for calling out unsafe or illegal corporate practices. However, in practice this exception offers employees few assurances. Judges tend to favor employers in whistleblower cases. The likelihood of AI labs’ surviving such suits seems particularly high given that society has yet to reach any sort of consensus as to what qualifies as unsafe AI development and deployment. 

These and other shortcomings explain why the aforementioned 13 AI workers, including ex-OpenAI employee William Saunders, called for a novel “right to warn.” Companies would have to offer employees an anonymous process for disclosing risk-related concerns to the lab’s board, a regulatory authority, and an independent third body made up of subject-matter experts. The ins and outs of this process have yet to be figured out, but it would presumably be a formal, bureaucratic mechanism. The board, regulator, and third party would all need to make a record of the disclosure. It’s likely that each body would then initiate some sort of investigation. Subsequent meetings and hearings also seem like a necessary part of the process. Yet if Saunders is to be taken at his word, what AI workers really want is something different. 

When Saunders went on the Big Technology Podcast to outline his ideal process for sharing safety concerns, his focus was not on formal avenues for reporting established risks. Instead, he indicated a desire for some intermediate, informal step. He wants a chance to receive neutral, expert feedback on whether a safety concern is substantial enough to go through a “high stakes” process such as a right-to-warn system. Current government regulators, as Saunders says, could not serve that role. 

For one thing, they likely lack the expertise to help an AI worker think through safety concerns. What’s more, few workers will pick up the phone if they know it’s a government official on the other end—that sort of call may be “very intimidating,” as Saunders himself said on the podcast. Instead, he envisages being able to call an expert to discuss his concerns. In an ideal scenario, he’d be told that the risk in question does not seem that severe or likely to materialize, freeing him up to return to whatever he was doing with more peace of mind. 

Lowering the stakes

What Saunders is asking for in this podcast isn’t a right to warn, then, as that suggests the employee is already convinced there’s unsafe or illegal activity afoot. What he’s really calling for is a gut check—an opportunity to verify whether a suspicion of unsafe or illegal behavior seems warranted. The stakes would be much lower, so the regulatory response could be lighter. The third party responsible for weighing up these gut checks could be a much more informal one. For example, AI PhD students, retired AI industry workers, and other individuals with AI expertise could volunteer for an AI safety hotline. They could be tasked with quickly and expertly discussing safety matters with employees via a confidential and anonymous phone conversation. Hotline volunteers would have familiarity with leading safety practices, as well as extensive knowledge of what options, such as right-to-warn mechanisms, may be available to the employee. 

As Saunders indicated, few employees will likely want to go from 0 to 100 with their safety concerns—straight from colleagues to the board or even a government body. They are much more likely to raise their issues if an intermediary, informal step is available.

Studying examples elsewhere

The details of how precisely an AI safety hotline would work deserve more debate among AI community members, regulators, and civil society. For the hotline to realize its full potential, for instance, it may need some way to escalate the most urgent, verified reports to the appropriate authorities. How to ensure the confidentiality of hotline conversations is another matter that needs thorough investigation. How to recruit and retain volunteers is another key question. Given leading experts’ broad concern about AI risk, some may be willing to participate simply out of a desire to lend a hand. Should too few folks step forward, other incentives may be necessary. The essential first step, though, is acknowledging this missing piece in the puzzle of AI safety regulation. The next step is looking for models to emulate in building out the first AI hotline. 

One place to start is with ombudspersons. Other industries have recognized the value of identifying these neutral, independent individuals as resources for evaluating the seriousness of employee concerns. Ombudspersons exist in academia, nonprofits, and the private sector. The distinguishing attribute of these individuals and their staffers is neutrality—they have no incentive to favor one side or the other, and thus they’re more likely to be trusted by all. A glance at the use of ombudspersons in the federal government shows that when they are available, issues may be raised and resolved sooner than they would be otherwise.

This concept is relatively new. The US Department of Commerce established the first federal ombudsman in 1971. The office was tasked with helping citizens resolve disputes with the agency and investigate agency actions. Other agencies, including the Social Security Administration and the Internal Revenue Service, soon followed suit. A retrospective review of these early efforts concluded that effective ombudspersons can meaningfully improve citizen-government relations. On the whole, ombudspersons were associated with an uptick in voluntary compliance with regulations and cooperation with the government. 

An AI ombudsperson or safety hotline would surely have different tasks and staff from an ombudsperson in a federal agency. Nevertheless, the general concept is worthy of study by those advocating safeguards in the AI industry. 

A right to warn may play a role in getting AI safety concerns aired, but we need to set up more intermediate, informal steps as well. An AI safety hotline is low-hanging regulatory fruit. A pilot made up of volunteers could be organized in relatively short order and provide an immediate outlet for those, like Saunders, who merely want a sounding board.

Kevin Frazier is an assistant professor at St. Thomas University College of Law and senior research fellow in the Constitutional Studies Program at the University of Texas at Austin.

Why OpenAI’s new model is such a big deal

This story is from The Algorithm, our weekly newsletter on AI. To get it in your inbox first, sign up here.

Last weekend, I got married at a summer camp, and during the day our guests competed in a series of games inspired by the show Survivor that my now-wife and I orchestrated. When we were planning the games in August, we wanted one station to be a memory challenge, where our friends and family would have to memorize part of a poem and then relay it to their teammates so they could re-create it with a set of wooden tiles. 

I thought OpenAI’s GPT-4o, its leading model at the time, would be perfectly suited to help. I asked it to create a short wedding-themed poem, with the constraint that each letter could only appear a certain number of times so we could make sure teams would be able to reproduce it with the provided set of tiles. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn’t. It would correctly count the letters only after the fact, while continuing to deliver poems that didn’t fit the prompt. Without the time to meticulously craft the verses by hand, we ditched the poem idea and instead challenged guests to memorize a series of shapes made from colored tiles. (That ended up being a total hit with our friends and family, who also competed in dodgeball, egg tosses, and capture the flag.)    

However, last week OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that blows GPT-4o out of the water for this type of purpose

Unlike previous models that are well suited for language tasks like writing and editing, OpenAI o1 is focused on multistep “reasoning,” the type of process required for advanced mathematics, coding, or other STEM-based questions. It uses a “chain of thought” technique, according to OpenAI. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” the company wrote in a blog post on its website.

OpenAI’s tests point to resounding success. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces and would be among the top 500 high school students in the USA Math Olympiad, which covers geometry, number theory, and other math topics. The model is also trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry. 

In math olympiad questions, the new model is 83.3% accurate, versus 13.4% for GPT-4o. In the PhD-level questions, it averaged 78% accuracy, compared with 69.7% from human experts and 56.1% from GPT-4o. (In light of these accomplishments, it’s unsurprising the new model was pretty good at writing a poem for our nuptial games, though still not perfect; it used more Ts and Ss than instructed to.)

So why does this matter? The bulk of LLM progress until now has been language-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. But in addition to getting lots of facts wrong, such LLMs have failed to demonstrate the types of skills required to solve important problems in fields like drug discovery, materials science, coding, or physics. OpenAI’s o1 is one of the first signs that LLMs might soon become genuinely helpful companions to human researchers in these fields. 

It’s a big deal because it brings “chain-of-thought” reasoning in an AI model to a mass audience, says Matt Welsh, an AI researcher and founder of the LLM startup Fixie. 

“The reasoning abilities are directly in the model, rather than one having to use separate tools to achieve similar results. My expectation is that it will raise the bar for what people expect AI models to be able to do,” Welsh says.

That said, it’s best to take OpenAI’s comparisons to “human-level skills” with a grain of salt, says Yves-Alexandre de Montjoye, an associate professor in math and computer science at Imperial College London. It’s very hard to meaningfully compare how LLMs and people go about tasks such as solving math problems from scratch.

Also, AI researchers say that measuring how well a model like o1 can “reason” is harder than it sounds. If it answers a given question correctly, is that because it successfully reasoned its way to the logical answer? Or was it aided by a sufficient starting point of knowledge built into the model? The model “still falls short when it comes to open-ended reasoning,” Google AI researcher François Chollet wrote on X.

Finally, there’s the price. This reasoning-heavy model doesn’t come cheap. Though access to some versions of the model is included in premium OpenAI subscriptions, developers using o1 through the API will pay three times as much as they pay for GPT-4o—$15 per 1 million input tokens in o1, versus $5 for GPT-4o. The new model also won’t be most users’ first pick for more language-heavy tasks, where GPT-4o continues to be the better option, according to OpenAI’s user surveys. 

What will it unlock? We won’t know until researchers and labs have the access, time, and budget to tinker with the new mode and find its limits. But it’s surely a sign that the race for models that can outreason humans has begun. 

Now read the rest of The Algorithm


Deeper learning

Chatbots can persuade people to stop believing in conspiracy theories

Researchers believe they’ve uncovered a new tool for combating false conspiracy theories: AI chatbots. Researchers from MIT Sloan and Cornell University found that chatting about a conspiracy theory with a large language model (LLM) reduced people’s belief in it by about 20%—even among participants who claimed that their beliefs were important to their identity. 

Why this matters: The findings could represent an important step forward in how we engage with and educate people who espouse such baseless theories, says Yunhao (Jerry) Zhang, a postdoc fellow affiliated with the Psychology of Technology Institute who studies AI’s impacts on society. “They show that with the help of large language models, we can—I wouldn’t say solve it, but we can at least mitigate this problem,” he says. “It points out a way to make society better.” Read more from Rhiannon Williams here.

Bits and bytes

Google’s new tool lets large language models fact-check their responses

Called DataGemma, it uses two methods to help LLMs check their responses against reliable data and cite their sources more transparently to users. (MIT Technology Review)

Meet the radio-obsessed civilian shaping Ukraine’s drone defense 

Since Russia’s invasion, Serhii “Flash” Beskrestnov has become an influential, if sometimes controversial, force—sharing expert advice and intel on the ever-evolving technology that’s taken over the skies. His work may determine the future of Ukraine, and wars far beyond it. (MIT Technology Review)

Tech companies have joined a White House commitment to prevent AI-generated sexual abuse imagery

The pledges, signed by firms like OpenAI, Anthropic, and Microsoft, aim to “curb the creation of image-based sexual abuse.” The companies promise to set limits on what models will generate and to remove nude images from training data sets where possible.  (Fortune)

OpenAI is now valued at $150 billion

The valuation arose out of talks it’s currently engaged in to raise $6.5 billion. Given that OpenAI is becoming increasingly costly to operate, and could lose as much as $5 billion this year, it’s tricky to see how it all adds up. (The Information)

There are more than 120 AI bills in Congress right now

More than 120 bills related to regulating artificial intelligence are currently floating around the US Congress.

They’re pretty varied. One aims to improve knowledge of AI in public schools, while another is pushing for model developers to disclose what copyrighted material they use in their training.  Three deal with mitigating AI robocalls, while two address biological risks from AI. There’s even a bill that prohibits AI from launching a nuke on its own.

The flood of bills is indicative of the desperation Congress feels to keep up with the rapid pace of technological improvements. “There is a sense of urgency. There’s a commitment to addressing this issue, because it is developing so quickly and because it is so crucial to our economy,” says Heather Vaughan, director of communications for the US House of Representatives Committee on Science, Space, and Technology.

Because of the way Congress works, the majority of these bills will never make it into law. But simply taking a look at all the different bills that are in motion can give us insight into policymakers’ current preoccupations: where they think the dangers are, what each party is focusing on, and more broadly, what vision the US is pursuing when it comes to AI and how it should be regulated.

That’s why, with help from the Brennan Center for Justice, which created a tracker with all the AI bills circulating in various committees in Congress right now, MIT Technology Review has taken a closer look to see if there’s anything we can learn from this legislative smorgasbord. 

As you can see, it can seem as if Congress is trying to do everything at once when it comes to AI. To get a better sense of what may actually pass, it’s useful to look at what bills are moving along to potentially become law. 

A bill typically needs to pass a committee, or a smaller body of Congress, before it is voted on by the whole Congress. Many will fall short at this stage, while others will simply be introduced and then never spoken of again. This happens because there are so many bills presented in each session, and not all of them are given equal consideration. If the leaders of a party don’t feel a bill from one of its members can pass, they may not even try to push it forward. And then, depending on the makeup of Congress, a bill’s sponsor usually needs to get some members of the opposite party to support it for it to pass. In the current polarized US political climate, that task can be herculean. 

Congress has passed legislation on artificial intelligence before. Back in 2020, the National AI Initiative Act was part of the Defense Authorization Act, which invested resources in AI research and provided support for public education and workforce training on AI.

And some of the current bills are making their way through the system. The Senate Commerce Committee pushed through five AI-related bills at the end of July. The bills focused on authorizing the newly formed US AI Safety Institute (AISI) to create test beds and voluntary guidelines for AI models. The other bills focused on expanding education on AI, establishing public computing resources for AI research, and criminalizing the publication of deepfake pornography. The next step would be to put the bills on the congressional calendar to be voted on, debated, or amended.

“The US AI Safety Institute, as a place to have consortium building and easy collaboration between corporate and civil society actors, is amazing. It’s exactly what we need,” says Yacine Jernite, an AI researcher at Hugging Face.

The progress of these bills is a positive development, says Varun Krovi, executive director of the Center for AI Safety Action Fund. “We need to codify the US AI Safety Institute into law if you want to maintain our leadership on the global stage when it comes to standards development,” he says. “And we need to make sure that we pass a bill that provides computing capacity required for startups, small businesses, and academia to pursue AI.”

Following the Senate’s lead, the House Committee on Science, Space, and Technology just passed nine more bills regarding AI on September 11. Those bills focused on improving education on AI in schools, directing the National Institute of Standards and Technology (NIST) to establish guidelines for artificial-intelligence systems, and expanding the workforce of AI experts. These bills were chosen because they have a narrower focus and thus might not get bogged down in big ideological battles on AI, says Vaughan.

“It was a day that culminated from a lot of work. We’ve had a lot of time to hear from members and stakeholders. We’ve had years of hearings and fact-finding briefings on artificial intelligence,” says Representative Haley Stevens, one of the Democratic members of the House committee.

Many of the bills specify that any guidance they propose for the industry is nonbinding and that the goal is to work with companies to ensure safe development rather than curtail innovation. 

For example, one of the bills from the House, the AI Development Practices Act, directs NIST to establish “voluntary guidance for practices and guidelines relating to the development … of AI systems” and a “voluntary risk management framework.” Another bill, the AI Advancement and Reliability Act, has similar language. It supports “the development of voluntary best practices and technical standards” for evaluating AI systems. 

“Each bill contributes to advancing AI in a safe, reliable, and trustworthy manner while fostering the technology’s growth and progress through innovation and vital R&D,” committee chairman Frank Lucas, an Oklahoma Republican, said in a press release on the bills coming out of the House.

“It’s emblematic of the approach that the US has taken when it comes to tech policy. We hope that we would move on from voluntary agreements to mandating them,” says Krovi.

Avoiding mandates is a practical matter for the House committee. “Republicans don’t go in for mandates for the most part. They generally aren’t going to go for that. So we would have a hard time getting support,” says Vaughan. “We’ve heard concerns about stifling innovation, and that’s not the approach that we want to take.” When MIT Technology Review asked about the origin of these concerns, they were attributed to unidentified “third parties.” 

And fears of slowing innovation don’t just come from the Republican side. “What’s most important to me is that the United States of America is establishing aggressive rules of the road on the international stage,” says Stevens. “It’s concerning to me that actors within the Chinese Communist Party could outpace us on these technological advancements.”

But these bills come at a time when big tech companies have ramped up lobbying efforts on AI. “Industry lobbyists are in an interesting predicament—their CEOs have said that they want more AI regulation, so it’s hard for them to visibly push to kill all AI regulation,” says David Evan Harris, who teaches courses on AI ethics at the University of California, Berkeley. “On the bills that they don’t blatantly try to kill, they instead try to make them meaningless by pushing to transform the language in the bills to make compliance optional and enforcement impossible.”

“A [voluntary commitment] is something that is also only accessible to the largest companies,” says Jernite at Hugging Face, claiming that sometimes the ambiguous nature of voluntary commitments allows big companies to set definitions for themselves. “If you have a voluntary commitment—that is, ‘We’re going to develop state-of-the-art watermarking technology’—you don’t know what state-of-the-art means. It doesn’t come with any of the concrete things that make regulation work.”

“We are in a very aggressive policy conversation about how to do this right, and how this carrot and stick is actually going to work,” says Stevens, indicating that Congress may ultimately draw red lines that AI companies must not cross.

There are other interesting insights to be gleaned from looking at the bills all together. Two-thirds of the AI bills are sponsored by Democrats. This isn’t too surprising, since some House Republicans have claimed to want no AI regulations, believing that guardrails will slow down progress.

The topics of the bills (as specified by Congress) are dominated by science, tech, and communications (28%), commerce (22%), updating government operations (18%), and national security (9%). Topics that don’t receive much attention include labor and employment (2%), environmental protection (1%), and civil rights, civil liberties, and minority issues (1%).

The lack of a focus on equity and minority issues came into view during the Senate markup session at the end of July. Senator Ted Cruz, a Republican, added an amendment that explicitly prohibits any action “to ensure inclusivity and equity in the creation, design, or development of the technology.” Cruz said regulatory action might slow US progress in AI, allowing the country to fall behind China.

On the House side, there was also a hesitation to work on bills dealing with biases in AI models. “None of our bills are addressing that. That’s one of the more ideological issues that we’re not moving forward on,” says Vaughan.

The lead Democrat on the House committee, Representative Zoe Lofgren, told MIT Technology Review, “It is surprising and disappointing if any of my Republican colleagues have made that comment about bias in AI systems. We shouldn’t tolerate discrimination that’s overt and intentional any more than we should tolerate discrimination that occurs because of bias in AI systems. I’m not really sure how anyone can argue against that.”

After publication, Vaughan clarified that “[Bias] is one of the bigger, more cross-cutting issues, unlike the narrow, practical bills we considered that week. But we do care about bias as an issue,” and she expects it to be addressed within an upcoming House Task Force report.

One issue that may rise above the partisan divide is deepfakes. The Defiance Act, one of several bills addressing them, is cosponsored by a Democratic senator, Amy Klobuchar, and a Republican senator, Josh Hawley. Deepfakes have already been abused in elections; for example, someone faked Joe Biden’s voice for a robocall to tell citizens not to vote. And the technology has been weaponized to victimize people by incorporating their images into pornography without their consent. 

“I certainly think that there is more bipartisan support for action on these issues than on many others,” says Daniel Weiner, director of the Brennan Center’s Elections & Government Program. “But it remains to be seen whether that’s going to win out against some of the more traditional ideological divisions that tend to arise around these issues.” 

Although none of the current slate of bills have resulted in laws yet, the task of regulating any new technology, and specifically advanced AI systems that no one entirely understands, is difficult. The fact that Congress is making any progress at all may be surprising in itself. 

“Congress is not sleeping on this by any stretch of the means,” says Stevens. “We are evaluating and asking the right questions and also working alongside our partners in the Biden-Harris administration to get us to the best place for the harnessing of artificial intelligence.”

Update: We added further comments from the Republican spokesperson.

AI-generated content doesn’t seem to have swayed recent European elections 

AI-generated falsehoods and deepfakes seem to have had no effect on election results in the UK, France, and the European Parliament this year, according to new research. 

Since the beginning of the generative-AI boom, there has been widespread fear that AI tools could boost bad actors’ ability to spread fake content with the potential to interfere with elections or even sway the results. Such worries were particularly heightened this year, when billions of people were expected to vote in over 70 countries. 

Those fears seem to have been unwarranted, says Sam Stockwell, the researcher at the Alan Turing Institute who conducted the study. He focused on three elections over a four-month period from May to August 2024, collecting data on public reports and news articles on AI misuse. Stockwell identified 16 cases of AI-enabled falsehoods or deepfakes that went viral during the UK general election and only 11 cases in the EU and French elections combined, none of which appeared to definitively sway the results. The fake AI content was created by both domestic actors and groups linked to hostile countries such as Russia. 

These findings are in line with recent warnings from experts that the focus on election interference is distracting us from deeper and longer-lasting threats to democracy.   

AI-generated content seems to have been ineffective as a disinformation tool in most European elections this year so far. This, Stockwell says, is because most of the people who were exposed to the disinformation already believed its underlying message (for example, that levels of immigration to their country are too high). Stockwell’s analysis showed that people who were actively engaging with these deepfake messages by resharing and amplifying them had some affiliation or previously expressed views that aligned with the content. So the material was more likely to strengthen preexisting views than to influence undecided voters. 

Tried-and-tested election interference tactics, such as flooding comment sections with bots and exploiting influencers to spread falsehoods, remained far more effective. Bad actors mostly used generative AI to rewrite news articles with their own spin or to create more online content for disinformation purposes. 

“AI is not really providing much of an advantage for now, as existing, simpler methods of creating false or misleading information continue to be prevalent,” says Felix Simon, a researcher at the Reuters Institute for Journalism, who was not involved in the research. 

However, it’s hard to draw firm conclusions about AI’s impact upon elections at this stage, says Samuel Woolley, a disinformation expert at the University of Pittsburgh. That’s in part because we don’t have enough data.

“There are less obvious, less trackable, downstream impacts related to uses of these tools that alter civic engagement,” he adds.

Stockwell agrees: Early evidence from these elections suggests that AI-generated content could be more effective for harassing politicians and sowing confusion than changing people’s opinions on a large scale. 

Politicians in the UK, such as former prime minister Rishi Sunak, were targeted by AI deepfakes that, for example, showed them promoting scams or admitting to financial corruption. Female candidates were also targeted with nonconsensual sexual deepfake content, intended to disparage and intimidate them. 

“There is, of course, a risk that in the long run, the more that political candidates are on the receiving end of online harassment, death threats, deepfake pornographic smears—that can have a real chilling effect on their willingness to, say, participate in future elections, but also obviously harm their well-being,” says Stockwell. 

Perhaps more worrying, Stockwell says, his research indicates that people are increasingly unable to discern the difference between authentic and AI-generated content in the election context. Politicians are also taking advantage of that. For example, political candidates in the European Parliament elections in France have shared AI-generated content amplifying anti-immigration narratives without disclosing that they’d been made with AI. 

“This covert engagement, combined with a lack of transparency, presents in my view a potentially greater risk to the integrity of political processes than the use of AI by the general population or so-called ‘bad actors,’” says Simon. 

Google’s new tool lets large language models fact-check their responses

As long as chatbots have been around, they have made things up. Such “hallucinations” are an inherent part of how AI models work. However, they’re a big problem for companies betting big on AI, like Google, because they make the responses it generates unreliable. 

Google is releasing a tool today to address the issue. Called DataGemma, it uses two methods to help large language models fact-check their responses against reliable data and cite their sources more transparently to users. 

The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question—like “Has the use of renewable energy sources increased in the world?”—the model will come up with a “first draft” answer. Then RIG identifies what portions of the draft answer could be checked against Google’s Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.

The second method, which is commonly used in other large language models, is called Retrieval-Augmented Generation (RAG). Consider a prompt like “What progress has Pakistan made against global health goals?” In response, the model examines which data in the Data Commons could help it answer the question, such as information about access to safe drinking water, hepatitis B immunizations, and life expectancies. With those figures in hand, the model then builds its answer on top of the data and cites its sources.

“Our goal here was to use Data Commons to enhance the reasoning of LLMs by grounding them in real-world statistical data that you could source back to where you got it from,” says Prem Ramaswami, head of Data Commons at Google. Doing so, he says, will “create more trustable, reliable AI.”

It is only available to researchers for now, but Ramaswami says access could widen further after more testing. If it works as hoped, it could be a real boon for Google’s plan to embed AI deeper into its search engine.  

However, it comes with a host of caveats. First, the usefulness of the methods is limited by whether the relevant data is in the Data Commons, which is more of a data repository than an encyclopedia. It can tell you the GDP of Iran, but it’s unable to confirm the date of the First Battle of Fallujah or when Taylor Swift released her most recent single. In fact, Google’s researchers found that with about 75% of the test questions, the RIG method was unable to obtain any usable data from the Data Commons. And even if helpful data is indeed housed in the Data Commons, the model doesn’t always formulate the right questions to find it. 

Second, there is the question of accuracy. When testing the RAG method, researchers found that the model gave incorrect answers 6% to 20% of the time. Meanwhile, the RIG method pulled the correct stat from Data Commons only about 58% of the time (though that’s a big improvement over the 5% to 17% accuracy rate of Google’s large language models when they’re not pinging Data Commons). 

Ramaswami says DataGemma’s accuracy will improve as it gets trained on more and more data. The initial version has been trained on only about 700 questions, and fine-tuning the model required his team to manually check each individual fact it generated. To further improve the model, the team plans to increase that data set from hundreds of questions to millions.