AI companies are finally being forced to cough up for training data

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The generative AI boom is built on scale. The more training data, the more powerful the model. 

But there’s a problem. AI companies have pillaged the internet for training data, and many websites and data set owners have started restricting the ability to scrape their websites. We’ve also seen a backlash against the AI sector’s practice of indiscriminately scraping online data, in the form of users opting out of making their data available for training and lawsuits from artists, writers, and the New York Times, claiming that AI companies have taken their intellectual property without consent or compensation. 

Last week three major record labels—Sony Music, Warner Music Group, and Universal Music Group—announced they were suing the AI music companies Suno and Udio over alleged copyright infringement. The music labels claim the companies made use of copyrighted music in their training data “at an almost unimaginable scale,” allowing the AI models to generate songs that “imitate the qualities of genuine human sound recordings.” My colleague James O’Donnell dissects the lawsuits in his story and points out that these lawsuits could determine the future of AI music. Read it here

But this moment also sets an interesting precedent for all of generative AI development. Thanks to the scarcity of high-quality data and the immense pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free. 

It will likely take a few years at least before we have legal clarity around copyright law, fair use, and AI training data. But the cases are already ushering in changes. OpenAI has been striking deals with news publishers such as Politico, the AtlanticTime, the Financial Times, and others, and exchanging publishers’ news archives for money and citations. And YouTube announced in late June that it will offer licensing deals to top record labels in exchange for music for training. 

These changes are a mixed bag. On one hand, I’m concerned that news publishers are making a Faustian bargain with AI. For example, most of the media houses that have made deals with OpenAI say the deal stipulates that OpenAI cite its sources. But language models are fundamentally incapable of being factual and are best at making things up. Reports have shown that ChatGPT and the AI-powered search engine Perplexity frequently hallucinate citations, which makes it hard for OpenAI to honor its promises.   

It’s tricky for AI companies too. This shift could lead to them build smaller, more efficient models, which are far less polluting. Or they may fork out a fortune to access data at the scale they need to build the next big one. Only the companies most flush with cash, and/or with large existing data sets of their own (such as Meta, with its two decades of social media data), can afford to do that. So the latest developments risk concentrating power even further into the hands of the biggest players. 

On the other hand, the idea of introducing consent into this process is a good one—not just for rights holders, who can benefit from the AI boom, but for all of us. We should all have the agency to decide how our data is used, and a fairer data economy would mean we could all benefit. 


Now read the rest of The Algorithm

Deeper Learning

How AI video games can help reveal the mysteries of the human mind

Neuroscientists and psychologists have long been using games as research tools to learn about the human mind. Video games have been either co-opted or specially designed to study how people learn, navigate, and cooperate with others, for example. AI video games—where characters don’t need scripts and appear to play when you’re not watching—could allow us to probe more deeply and unravel enduring mysteries about our brains and behavior, suggests my colleague Jessica Hamzelou in our weekly biotech newsletter, The Checkup.

Ready, set, go: Scientists who have done this type of study were able to observe and study how players behaved in these games: how they explored their virtual environment, how they sought rewards, how they made decisions. And research volunteers didn’t need to travel to a lab—their gaming behavior could be observed from wherever they happened to be playing, whether that was at home, at a library, or even inside an MRI scanner. Read more from Jessica.

Bits and Bytes

AI is already wreaking havoc on global power systems
A really well-done data visualization of the insane amount of electricity AI requires and how it is transforming our energy grid. A startling statistic: Data centers use more electricity than most countries. (Bloomberg

The AI boom has an unlikely early winner: wonky consultants
It seems every company out there is thinking about how to use AI. But the problem is that nobody is sure exactly how to do that. And so in come consultants, who are profiting from AI FOMO. Work related to generative AI will make up about 40% of McKinsey’s business this year. (The New York Times)

Deepfake creators are revictimizing sex trafficking survivors
A new low: For the past few months, the largest deepfake sexual abuse website has posted deepfake videos based on footage from GirlsDoPorn, a now-defunct sex trafficking operation. (Wired)

I paid $365.63 to replace 404 Media with AI
A journalist paid gig workers to use ChatGPT to plagiarize news. The result: grammatically correct nonsense. (404 Media)

Why artists are becoming less scared of AI

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Knock, knock. 

Who’s there? 

An AI with generic jokes. Researchers from Google DeepMind asked 20 professional comedians to use popular AI language models to write jokes and comedy performances. Their results were mixed. 

The comedians said that the tools were useful in helping them produce an initial “vomit draft” that they could iterate on, and helped them structure their routines. But the AI was not able to produce anything that was original, stimulating, or, crucially, funny. My colleague Rhiannon Williams has the full story.

As Tuhin Chakrabarty, a computer science researcher at Columbia University who specializes in AI and creativity, told Rhiannon, humor often relies on being surprising and incongruous. Creative writing requires its creator to deviate from the norm, whereas LLMs can only mimic it.

And that is becoming pretty clear in the way artists are approaching AI today. I’ve just come back from Hamburg, which hosted one of the largest events for creatives in Europe, and the message I got from those I spoke to was that AI is too glitchy and unreliable to fully replace humans and is best used instead as a tool to augment human creativity. 

Right now, we are in a moment where we are deciding how much creative power we are comfortable giving AI companies and tools. After the boom first started in 2022, when DALL-E 2 and Stable Diffusion first entered the scene, many artists raised concerns that AI companies were scraping their copyrighted work without consent or compensation. Tech companies argue that anything on the public internet falls under fair use, a legal doctrine that allows the reuse of copyrighted-protected material in certain circumstances. Artists, writers, image companies, and the New York Times have filed lawsuits against these companies, and it will likely take years until we have a clear-cut answer as to who is right. 

Meanwhile, the court of public opinion has shifted a lot in the past two years. Artists I have interviewed recently say they were harassed and ridiculed for protesting AI companies’ data-scraping practices two years ago. Now, the general public is more aware of the harms associated with AI. In just two years, the public has gone from being blown away by AI-generated images to sharing viral social media posts about how to opt out of AI scraping—a concept that was alien to most laypeople until very recently. Companies have benefited from this shift too. Adobe has been successful in pitching its AI offerings as an “ethical” way to use the technology without having to worry about copyright infringement. 

There are also several grassroots efforts to shift the power structures of AI and give artists more agency over their data. I’ve written about Nightshade, a tool created by researchers at the University of Chicago, which lets users add an invisible poison attack to their images so that they break AI models when scraped. The same team is behind Glaze, a tool that lets artists mask their personal style from AI copycats. Glaze has been integrated into Cara, a buzzy new art portfolio site and social media platform, which has seen a surge of interest from artists. Cara pitches itself as a platform for art created by people; it filters out AI-generated content. It got nearly a million new users in a few days. 

This all should be reassuring news for any creative people worried that they could lose their job to a computer program. And the DeepMind study is a great example of how AI can actually be helpful for creatives. It can take on some of the boring, mundane, formulaic aspects of the creative process, but it can’t replace the magic and originality that humans bring. AI models are limited to their training data and will forever only reflect the zeitgeist at the moment of their training. That gets old pretty quickly.


Now read the rest of The Algorithm

Deeper Learning

Apple is promising personalized AI in a private cloud. Here’s how that will work.

Last week, Apple unveiled its vision for supercharging its product lineup with artificial intelligence. The key feature, which will run across virtually all of its product line, is Apple Intelligence, a suite of AI-based capabilities that promises to deliver personalized AI services while keeping sensitive data secure. 

Why this matters: Apple says its privacy-focused system will first attempt to fulfill AI tasks locally on the device itself. If any data is exchanged with cloud services, it will be encrypted and then deleted afterward. It’s a pitch that offers an implicit contrast with the likes of Alphabet, Amazon, or Meta, which collect and store enormous amounts of personal data. Read more from James O’Donnell here

Bits and Bytes

How to opt out of Meta’s AI training
If you post or interact with chatbots on Facebook, Instagram, Threads, or WhatsApp, Meta can use your data to train its generative AI models. Even if you don’t use any of Meta’s platforms, it can still scrape data such as photos of you if someone else posts them. Here’s our quick guide on how to opt out. (MIT Technology Review

Microsoft’s Satya Nadella is building an AI empire
Nadella is going all in on AI. His $13 billion investment in OpenAI was just the beginning. Microsoft has become an “the world’s most aggressive amasser of AI talent, tools, and technology” and has started building an in-house OpenAI competitor. (The Wall Street Journal)

OpenAI has hired an army of lobbyists
As countries around the world mull AI legislation, OpenAI is on a lobbyist hiring spree to protect its interests. The AI company has expanded its global affairs team from three lobbyists at the start of 2023 to 35 and intends to have up to 50 by the end of this year. (Financial Times)  

UK rolls out Amazon-powered emotion recognition AI cameras on trains
People traveling through some of the UK’s biggest train stations have likely had their faces scanned by Amazon software without their knowledge during an AI trial. London stations such as Euston and Waterloo have tested CCTV cameras with AI to reduce crime and detect people’s emotions. Emotion recognition technology is extremely controversial. Experts say it is unreliable and simply does not work. 
(Wired

Clearview AI used your face. Now you may get a stake in the company.
The facial recognition company, which has been under fire for scraping images of people’s faces from the web and social media without their permission, has agreed to an unusual settlement in a class action against it. Instead of paying cash, it is offering a 23% stake in the company for Americans whose faces are in its data sets. (The New York Times

Elephants call each other by their names
This is so cool! Researchers used AI to analyze the calls of two herds of African savanna elephants in Kenya. They found that elephants use specific vocalizations for each individual and recognize when they are being addressed by other elephants. (The Guardian

What I learned from the UN’s “AI for Good” summit

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Greetings from Switzerland! I’ve just come back from Geneva, which last week hosted the UN’s AI for Good Summit, organized by the International Telecommunication Union. The summit’s big focus was how AI can be used to meet the UN’s Sustainable Development Goals, such as eradicating poverty and hunger, achieving gender equality, promoting clean energy and climate action and so on. 

The conference featured lots of robots (including one that dispenses wine), but what I liked most of all was how it managed to convene people working in AI from around the globe, featuring speakers from China, the Middle East, and Africa too, such as Pelonomi Moiloa, the CEO of Lelapa AI, a startup building AI for African languages. AI can be very US-centric and male dominated, and any effort to make the conversation more global and diverse is laudable. 

But honestly, I didn’t leave the conference feeling confident AI was going to play a meaningful role in advancing any of the UN goals. In fact, the most interesting speeches were about how AI is doing the opposite. Sage Lenier, a climate activist, talked about how we must not let AI accelerate environmental destruction. Tristan Harris, the cofounder of the Center for Humane Technology, gave a compelling talk connecting the dots between our addiction to social media, the tech sector’s financial incentives, and our failure to learn from previous tech booms. And there are still deeply ingrained gender biases in tech, Mia Shah-Dand, the founder of Women in AI Ethics, reminded us. 

So while the conference itself was about using AI for “good,” I would have liked to see more talk about how increased transparency, accountability, and inclusion could make AI itself good from development to deployment.

We now know that generating one image with generative AI uses as much energy as charging a smartphone. I would have liked more honest conversations about how to make the technology more sustainable itself in order to meet climate goals. And it felt jarring to hear discussions about how AI can be used to help reduce inequalities when we know that so many of the AI systems we use are built on the backs of human content moderators in the Global South who sift through traumatizing content while being paid peanuts. 

Making the case for the “tremendous benefit” of AI was OpenAI’s CEO Sam Altman, the star speaker of the summit. Altman was interviewed remotely by Nicholas Thompson, the CEO of the Atlantic, which has incidentally just announced a deal for OpenAI to share its content to train new AI models. OpenAI is the company that instigated the current AI boom, and it would have been a great opportunity to ask him about all these issues. Instead, the two had a relatively vague, high-level discussion about safety, leaving the audience none the wiser about what exactly OpenAI is doing to make their systems safer. It seemed they were simply supposed to take Altman’s word for it. 

Altman’s talk came a week or so after Helen Toner, a researcher at the Georgetown Center for Security and Emerging Technology and a former OpenAI board member, said in an interview that the board found out about the launch of ChatGPT through Twitter, and that Altman had on multiple occasions given the board inaccurate information about the company’s formal safety processes. She has also argued that it is a bad idea to let AI firms govern themselves, because the immense profit incentives will always win. (Altman said he “disagree[s] with her recollection of events.”) 

When Thompson asked Altman what the first good thing to come out of generative AI will be, Altman mentioned productivity, citing examples such as software developers who can use AI tools to do their work much faster. “We’ll see different industries become much more productive than they used to be because they can use these tools. And that will have a positive impact on everything,” he said. I think the jury is still out on that one. 


Now read the rest of The Algorithm

Deeper Learning

Why Google’s AI Overviews gets things wrong

Google’s new feature, called AI Overviews, provides brief, AI-generated summaries highlighting key information and links on top of search results. Unfortunately, within days of AI Overviews’ release in the US, users were sharing examples of responses that were strange at best. It suggested that users add glue to pizza or eat at least one small rock a day.

MIT Technology Review explains: In order to understand why AI-powered search engines get things wrong, we need to look at how they work. The models that power them simply predict the next word (or token) in a sequence, which makes them appear fluent but also leaves them prone to making things up. They have no ground truth to rely on, but instead choose each word purely on the basis of a statistical calculation. Worst of all? There’s probably no way to fix things. That’s why you shouldn’t trust AI search enginesRead more from Rhiannon Williams here

Bits and Bytes

OpenAI’s latest blunder shows the challenges facing Chinese AI models
OpenAI’s GPT-4o data set is polluted by Chinese spam websites. But this problem is indicative of a much wider issue for those building Chinese AI services: finding the high-quality data sets they need to be trained on is tricky, because of the way China’s internet functions. (MIT Technology Review

Five ways criminals are using AI
Artificial intelligence has brought a big boost in productivity—to the criminal underworld. Generative AI has made phishing, scamming, and doxxing easier than ever. (MIT Technology Review)

OpenAI is rebooting its robotics team
After disbanding its robotics team in 2020, the company is trying again. The resurrection is in part thanks to rapid advancements in robotics brought by generative AI. (Forbes

OpenAI found Russian and Chinese groups using its tech for propaganda campaigns
OpenAI said that it caught, and removed, groups from Russia, China, Iran, and Israel that were using its technology to try to influence political discourse around the world. But this is likely just the tip of the iceberg when it comes to how AI is being used to affect this year’s record-breaking number of elections. (The Washington Post

Inside Anthropic, the AI company betting that safety can be a winning strategy
The AI lab Anthropic, creator of the Claude model, was started by former OpenAI employees who resigned over “trust issues.” This profile is an interesting peek inside one of OpenAI’s competitors, showing how the ideology behind AI safety and effective altruism is guiding business decisions. (Time

AI-directed drones could help find lost hikers faster
Drones are already used for search and rescue, but planning their search paths is more art than science. AI could change that. (MIT Technology Review