A new tool for copyright holders can show if their work is in AI training data

Since the beginning of the generative AI boom, content creators have argued that their work has been scraped into AI models without their consent. But until now, it has been difficult to know whether specific text has actually been used in a training data set. 

Now they have a new way to prove it: “copyright traps” developed by a team at Imperial College London, pieces of hidden text that allow writers and publishers to subtly mark their work in order to later detect whether it has been used in AI models or not. The idea is similar to traps that have been used by copyright holders throughout history—strategies like including fake locations on a map or fake words in a dictionary. 

These AI copyright traps tap into one of the biggest fights in AI. A number of publishers and writers are in the middle of litigation against tech companies, claiming their intellectual property has been scraped into AI training data sets without their permission. The New York Times’ ongoing case against OpenAI is probably the most high-profile of these.  

The code to generate and detect traps is currently available on GitHub, but the team also intends to build a tool that allows people to generate and insert copyright traps themselves. 

“There is a complete lack of transparency in terms of which content is used to train models, and we think this is preventing finding the right balance [between AI companies and content creators],” says Yves-Alexandre de Montjoye, an associate professor of applied mathematics and computer science at Imperial College London, who led the research. It was presented at the International Conference on Machine Learning, a top AI conference being held in Vienna this week. 

To create the traps, the team used a word generator to create thousands of synthetic sentences. These sentences are long and full of gibberish, and could look something like this: ”When in comes times of turmoil … whats on sale and more important when, is best, this list tells your who is opening on Thrs. at night with their regular sale times and other opening time from your neighbors. You still.”

The team generated 100 trap sentences and then randomly chose one to inject into a text many times, de Montjoy explains. The trap could be injected into text in multiple ways—for example, as white text on a white background, or embedded in the article’s source code. This sentence had to be repeated in the text 100 to 1,000 times. 

To detect the traps, they fed a large language model the 100 synthetic sentences they had generated, and looked at whether it flagged them as new or not. If the model had seen a trap sentence in its training data, it would indicate a lower “surprise” (also known as “perplexity”) score. But if the model was “surprised” about sentences, it meant that it was encountering them for the first time, and therefore they weren’t traps. 

In the past, researchers have suggested exploiting the fact that language models memorize their training data to determine whether something has appeared in that data. The technique, called a “membership inference attack,” works effectively in large state-of-the art models, which tend to memorize a lot of their data during training. 

In contrast, smaller models, which are gaining popularity and can be run on mobile devices, memorize less and are thus less susceptible to membership inference attacks, which makes it harder to determine whether or not they were trained on a particular copyrighted document, says Gautam Kamath, an assistant computer science professor at the University of Waterloo, who was not part of the research. 

Copyright traps are a way to do membership inference attacks even on smaller models. The team injected their traps into the training data set of CroissantLLM, a new bilingual French-English language model that was trained from scratch by a team of industry and academic researchers that the Imperial College London team partnered with. CroissantLLM has 1.3 billion parameters, a fraction as many as state-of-the-art models (GPT-4 reportedly has 1.76 trillion, for example).

The research shows it is indeed possible to introduce such traps into text data so as to significantly increase the efficacy of membership inference attacks, even for smaller models, says Kamath. But there’s still a lot to be done, he adds. 

Repeating a 75-word phrase 1,000 times in a document is a big change to the original text, which could allow people training AI models to detect the trap and skip content containing it, or just delete it and train on the rest of the text, Kamath says. It also makes the original text hard to read. 

This makes copyright traps impractical right now, says Sameer Singh, a professor of computer science at the University of California, Irvine, and a cofounder of the startup Spiffy AI. He was not part of the research. “A lot of companies do deduplication, [meaning] they clean up the data, and a bunch of this kind of stuff will probably get thrown out,” Singh says. 

One way to improve copyright traps, says Kamath, would be to find other ways to mark copyrighted content so that membership inference attacks work better on them, or to improve membership inference attacks themselves. 

De Montjoye acknowledges that the traps are not foolproof. A motivated attacker who knows about a trap can remove them, he says. 

“Whether they can remove all of them or not is an open question, and that’s likely to be a bit of a cat-and-mouse game,” he says. But even then, the more traps are applied, the harder it becomes to remove all of them without significant engineering resources.

“It’s important to keep in mind that copyright traps may only be a stopgap solution, or merely an inconvenience to model trainers,” says Kamath. “One can not release a piece of content containing a trap and have any assurance that it will be an effective trap forever.” 

New Ecommerce Tools: July 25, 2024

We publish a weekly rundown of new products and services for ecommerce and omnichannel merchants. This installment includes updates on AI-powered customer service agents, post-purchase platforms, logistics, business research resources, payments, and financing for ecommerce companies.

Got an ecommerce product release? Email releases@practicalecommerce.com.

New Tools for Merchants: July 25, 2024

Salesforce launches autonomous Einstein Service Agent for chatbot experiences. Salesforce has announced Einstein Service Agent, the company’s first fully autonomous AI agent. Unlike traditional chatbots, Einstein Service Agent uses generative AI to create conversational responses, grounding its responses in a company’s data and tailored to a company’s brand voice, tone, and guidelines. Currently in pilot and generally available later this year, Einstein Service Agent can be set up in minutes with user-friendly interfaces, pre-built templates, and low-code actions and workflows, per Salesforce.

Web page of Salesforce's Einstein Service Agent

Salesforce’s Einstein Service Agent

Loop and ReBound announce omnichannel returns partnership. Two returns technology providers, Loop and ReBound, have announced a partnership. Retailers using the Loop platform can now access ReBound’s omnichannel returns management services. The integration allows Loop users to leverage ReBound’s global logistics offering, including return shipments, advanced local processing, and consolidation. The companies stated they aim to maximize margins, improve customer service, and increase customer retention for retailers.

Clearco and Boundless partner to provide capital for ecommerce merchants. Clearco, an ecommerce invoice financing service, and Boundless, a capital marketplace for small and medium businesses, have partnered to provide access to working capital for ecommerce brands. Clearco will offer businesses direct access to Boundless’s marketplace, which includes lenders, financiers, and funding providers. Adding Clearco to Boundless’s marketplace provides a working capital provider specific to ecommerce.

WP Engine acquires NitroPack for managed WordPress site performance. WP Engine, a provider of premium WordPress hosting and other products, has acquired NitroPack, an all-in-one SaaS tool for improving site speed and performance metrics, including Google’s Core Web Vitals. NitroPack was previously introduced to WP Engine customers with the launch of Page Speed Boost, which leverages NitroPack’s WordPress optimization technology. The acquisition will further strengthen WP Engine’s website performance capabilities, providing customers with innovative features and flexibility for site optimization.

Home page of NitroPack

NitroPack

ReturnGO brings returns and exchanges to Salesforce Commerce Cloud. ReturnGO, a post-purchase service for ecommerce stores, has expanded its platform to include Salesforce Commerce Cloud, enabling retailers to benefit from automated returns and exchange processes. Key features for Salesforce Commerce Cloud merchants include streamlined returns, a user-friendly return process, and data-driven insights. This strategic move is part of ReturnGO’s broader initiative to support enterprise-scale ecommerce markets and enhance post-purchase experiences across diverse platforms.

Fashion marketplace Depop removes selling fees. Depop, a community-powered marketplace for secondhand fashion, is removing its 10% selling fee for U.S.-based users. The removal of selling fees is part of a wider Depop pricing update. A marketplace fee will continue to support investment across the platform. The new fee structure in the U.S. follows Depop’s move to remove selling fees in the U.K. earlier this year.

Logistics provider Cirro E-Commerce integrates with 17Track for post-purchase experience. Cirro E-Commerce, an ecommerce logistics provider, has integrated with 17Track, a global shipment tracking service platform. The integration improves access to real-time tracking information for marketplaces, ecommerce merchants, and consumers. By automatically sharing tracking data with 17Track’s partners and users, it becomes easy for all parties to stay informed about their orders. The tracking platform enhances traceability, ensuring updates are always available and accurate, boosting Cirro E-Commerce’s reliability with consumers.

Home page of 17Track

17Track

Reshop launches returns platform featuring instant refunds. Reshop has launched a platform to manage refunds for consumers and merchants, debuting with premium lifestyle brands such as Alo Yoga, Steve Madden, and Orveon’s premium cosmetic brands, as well as post-purchase provider Narvar. According to Reshop, the platform removes ​​the wait time for shoppers to get their money back, freeing up funds that normally sit dormant.

LexisNexis launches Nexis+ AI, a generative AI platform for company research. LexisNexis, a provider of information and analytics, has announced the commercial launch of Nexis+ AI, incorporating generative AI to transform and accelerate corporate research, intelligence gathering, and business decision-making. Enhanced by more than 700 hours of in-depth customer interviews, Nexis+ AI offers companies the ability to quickly research and analyze business and legal information, summarize documents, compile and share content, create first drafts or outlines, and derive actionable business insights.

Embedded lending platform Jifiti introduces tap now, pay later. Jifiti, an embedded lending technology company, has released its tap now, pay later technology, enabling consumers to add approved loan or credit funds into any digital wallet. Tap now, pay later requires no integration with the merchant and can go live within hours. Jifiti’s white-labeled embedded lending platform enables banks, lenders, and merchants to offer financing options at any point of sale via virtual cards, API integration, or ecommerce plugins.

Design.com launches AI design tools for small businesses. DesignCrowd, an Australia-based design company, has launched Design.com, an online design platform for entrepreneurs and small businesses. The platform offers design tools to help small businesses and entrepreneurs launch and grow their ventures, including 500,000 design templates. ​​With the launch of Design.com, the company also announced two new AI products — an AI business name generator and an AI logo generator.

Home page of Design.com

Design.com

Amazon’s Direct-from-China Plan Criticized

We asked industry pros to comment on Amazon’s plan to create a new section for Chinese sellers to ship directly to U.S. customers. The move is an apparent effort to recoup consumers who have turned to Temu and Schein, producers of inexpensive household goods and apparel, respectively.

Shein home page

China-based Shein sells inexpensive “fast fashion” apparel.

Bad Idea

It’s a bad idea, according to Phil Masiello, CEO of CrunchGrowth Revenue Acceleration Agency and a longtime Amazon seller and founder of multiple ecommerce companies.

Sellers and brands have been fighting with Amazon against cheap fakes from China for years. “It’s going to anger the brands on there,” Masiello said in a video interview, adding, “Amazon should go higher. They should go into exclusives versus trying to be a Temu.”

The Chinese competition is selling “junk to the uneducated. They buy it once. They’re not long-term Temu customers,” Masiello said. Temu’s popular, he says, but the business model is not sustainable.

Amazon has one thing that any brand would love, which is retention,” Masiello added. “Everybody has the Amazon app on their phone. It’s the first place we look for something.”

Masiello believes the move will cost Amazon, where quality sellers face increasing fees — about 50% of sales go to Amazon.

Masiello’s not alone in his opinion that Amazon’s making a mistake.

Inviting Competition

“Amazon made a deal with the devil by letting this crap in from overseas,” stated Rick Wilson, chief executive officer of Miva, an ecommerce platform. “They invited the competition.”

Higher-end products will be insulated from the new storefront, but “it ultimately depends on the item.”

“Amazon continues to aggressively pursue overseas manufacturers and make it easier for them to become consumer brands themselves,” James Thomson, managing partner at Equity Value Advisors and a former Amazon executive, said. “In many categories, U.S.-based brands on Amazon are sourcing stuff overseas and now competing against their manufacturers. Amazon’s enabling them.”

“As Amazon goes after lower cost options, it’s harder for small brands in the United States to do well,” Thomson said.

Still, Thomson doesn’t see much of a problem for better sellers at higher price points.

“Lots of stuff on Temu and Schein is a spontaneous purchase,” Thomson said. “I don’t go to these types of sites thinking here’s what I need to refill my supplies at home.”

Thrasio, the ecommerce aggregator that recently emerged from bankruptcy, is focusing on quality and loyalty to avoid competing with Temu and Schein products.

Commoditized items, like kitchen utensils, will become even cheaper for consumers with direct-from-China offerings.

Thrasio's brand page

By focusing on quality, Thrasio hopes to avoid competing against inexpensive alternatives.

“There’s just no way to compete on some of those commodity products,” Stephanie Fox, Thrasio’s new chief executive officer, said in an interview. “Your margins are going to be 5%, which will never support a scaled business. Could solo entrepreneurs potentially compete against those products? Sure, but they won’t make a ton of money doing it.”

“Competing in those low-margin commodity products, which is exactly what Amazon is focusing on, it’s not going to be worth it,” Fox said.

Mark Daoust, the founder of Quiet Light, an ecommerce brokerage, compared the move to the launch of Amazon Basics.

“I saw a business that was killed by Amazon Basics,” Daoust said, citing one client who sold lower-priced office chairs. “Most sellers want to build a brand, a quality product, and focus on a uniqueness that no one can imitate. The client made economical chairs that were accessible to a lot of people. That was the whole value proposition.”

It wasn’t necessarily the best chair on the market, but it was economical and worked very well — until it didn’t. “Amazon Basics destroyed that company.”

Study Backs Google’s Claims: AI Search Boosts User Satisfaction via @sejournal, @MattGSouthern

A new study finds that despite concerns about AI in online services, users are more satisfied with search engines and social media platforms than before.

The American Customer Satisfaction Index (ACSI) conducted its annual survey of search and social media users, finding that satisfaction has either held steady or improved.

This comes at a time when major tech companies are heavily investing in AI to enhance their services.

Search Engine Satisfaction Holds Strong

Google, Bing, and other search engines have rapidly integrated AI features into their platforms over the past year. While critics have raised concerns about potential negative impacts, the ACSI study suggests users are responding positively.

Google maintains its position as the most satisfying search engine with an ACSI score of 81, up 1% from last year. Users particularly appreciate its AI-powered features.

Interestingly, Bing and Yahoo! have seen notable improvements in user satisfaction, notching 3% gains to reach scores of 77 and 76, respectively. These are their highest ACSI scores in over a decade, likely due to their AI enhancements launched in 2023.

The study hints at the potential of new AI-enabled search functionality to drive further improvements in the customer experience. Bing has seen its market share improve by small but notable margins, rising from 6.35% in the first quarter of 2023 to 7.87% in Q1 2024.

Customer Experience Improvements

The ACSI study shows improvements across nearly all benchmarks of the customer experience for search engines. Notable areas of improvement include:

  • Ease of navigation
  • Ease of using the site on different devices
  • Loading speed performance and reliability
  • Variety of services and information
  • Freshness of content

These improvements suggest that AI enhancements positively impact various aspects of the search experience.

Social Media Sees Modest Gains

For the third year in a row, user satisfaction with social media platforms is on the rise, increasing 1% to an ACSI score of 74.

TikTok has emerged as the new industry leader among major sites, edging past YouTube with a score of 78. This underscores the platform’s effective use of AI-driven content recommendations.

Meta’s Facebook and Instagram have also seen significant improvements in user satisfaction, showing 3-point gains. While Facebook remains near the bottom of the industry at 69, Instagram’s score of 76 puts it within striking distance of the leaders.

Challenges Remain

Despite improvements, the study highlights ongoing privacy and advertising challenges for search engines and social media platforms. Privacy ratings for search engines remain relatively low but steady at 79, while social media platforms score even lower at 73.

Advertising experiences emerge as a key differentiator between higher- and lower-satisfaction brands, particularly in social media. New ACSI benchmarks reveal user concerns about advertising content’s trustworthiness and personal relevance.

Why This Matters For SEO Professionals

This study provides an independent perspective on how users are responding to the AI push in online services. For SEO professionals, these findings suggest that:

  1. AI-enhanced search features resonate with users, potentially changing search behavior and expectations.
  2. The improving satisfaction with alternative search engines like Bing may lead to a more diverse search landscape.
  3. The continued importance of factors like content freshness and site performance in user satisfaction aligns with long-standing SEO best practices.

As AI becomes more integrated into our online experiences, SEO strategies may need to adapt to changing user preferences.


Featured Image: kate3155/Shutterstock

seo enhancements
Elevating author and publisher entities in SEO

The SEO community has been buzzing following the release of internal Google documents, revealing more details about how author and publisher entities influence search rankings. These insights help you strategically optimize your author and publisher profiles. This article will explore these entities and give you some actionable strategies to incorporate their optimization into your existing SEO practices.

Table of contents

Tracking author and publisher entities

The leaked documents confirm that Google tracks and retains content authorship and publisher credibility data. These elements help the ranking algorithms. The rationale behind this is straightforward: credible and authoritative content is more likely to be accurate, reliable, and useful to users. Therefore, content attributed to recognized authors and reputable publishers is favored in search results.

Optimizing author and publisher entities

As interpreted by various sources, the Google document leak indicates that author and publisher entities play significant roles in search rankings. However, it does not clearly show whether one is inherently more important than the other. Instead, it highlights the complementary nature of these entities in establishing content credibility and authority.

Recently, Google’s Gary Illyes shed light on specific signals that are not considered beneficial for SEO. This emphasizes the importance of genuine user engagement and content quality rather than relying on easily manipulated elements. The following are signals Google deems less effective in contributing to your site’s search performance.

  1. Authorship markup: Google’s Gary Illyes mentioned that authorship markup, which is controlled by SEOs and site owners, is generally not considered a good signal for ranking purposes.
  2. Controlled markup: Any markup that can be easily manipulated by site owners or SEOs is not typically viewed as a reliable signal by Google.
  3. Quality signals: Google prefers signals that are harder to manipulate and more reflective of genuine user engagement and content quality.

While Google may describe certain signals as “not good signals,” it’s important to note that they are still considered signals. This situation is reminiscent of the famous exchange in “Pirates of the Caribbean”:

  • “You’re the worst pirate I’ve ever heard of.”
  • “But you have heard of me.”

In other words, even if these signals aren’t the best, they still have some recognition in SEO.

Practical implications

Google’s comprehensive approach to assessing online content trustworthiness involves many signals and metrics. Publishers can enhance their trustworthiness by focusing on content freshness, originality, structured data, and robust anti-spam practices. Their history further aids in evaluating long-term credibility, encouraging them to maintain high-quality standards consistently.

The Google document leak highlights the importance of both author and publisher entities in SEO. A balanced approach that optimizes both can significantly enhance content credibility and authority. Focus on detailed and accurate author and publisher profiles, leverage structured data, and employ tools like Yoast SEO. SEOs can build a strong foundation for improving search engine rankings and driving organic traffic.

Establishing credible author profiles

Building up your author profile is essential today. But you shouldn’t just limit yourself to building your profile; you should also make sure to present it properly on your publisher’s website. That means you have to build great author pages as well.

But how do you create detailed bio pages? Authors should have a dedicated bio page with qualifications, expertise, and a professional headshot. This page should be linked to all articles written by the author. For example: If Jane Doe writes for your publication, create a page like yourwebsite.com/authors/jane-doe that includes her bio, credentials, and links to all her articles.

On that author page, you should also include social proof. Incorporate links to the author’s social media profiles, professional networks like LinkedIn, and any notable publications they have contributed to. For example, on Jane Doe’s bio page, link to her LinkedIn profile and any major publications where her work has appeared.

Your author pages should have a solid foundation built on structured data, so implement schema markup. Use structured data to tag author information on each article. This helps search engines recognize and index author details accurately.

For example, add JSON-LD markup to each article page, including the author’s name, bio, and profile URL.

Use the Yoast SEO plugin’s schema framework to add author markup seamlessly. Yoast’s adaptable schema structure ensures all necessary author and publisher information is included and properly formatted.

Enhancing publisher credibility

What works for authors also works for publishers — these things go together. Don’t focus on just your authors; make sure you also put your publication in the spotlight.

Start by making it easy to find information about your publisher. Like author bio pages, create a dedicated publisher page detailing the organization’s mission, history, and achievements. Include logos, awards, and other forms of social proof. For example, create a page like yourwebsite.com/about-us/ that includes your publication’s background, mission statement, and accolades.

It’s important to prove who you are and what you stand for. List editorial policies, team members, and contact information to ensure transparency. For example, on the “About Us” page, include a section detailing your editorial guidelines and a list of key editorial staff with their bios.

Then, like enhancing authors, roll out structured data for your publishing house. Use Organization markup as implement schema markup provides search engines with detailed information about you. This includes the name, logo, contact details, and social media profiles.

Here’s a very basic example: Add JSON-LD markup to your publisher page including your organization’s name, logo, and contact information.

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Your Organization",
  "logo": "https://yourwebsite.com/logo.png",
  "url": "https://yourwebsite.com",
  "contactPoint": {
    "@type": "ContactPoint",
    "telephone": "+1-800-555-5555",
    "contactType": "Customer Service"
  }
}

Again, be sure to use Yoast SEO for your structured data needs. Its schema framework allows you to efficiently add and manage organization markup, ensuring consistency and accuracy.

Combining author and publisher strategies

As mentioned, it’s not one or the other strategy — combine your efforts to make the most of it. Only then can Google truly understand your authors and publications.

This means that you must unify your branding across all content. Ensure that all content consistently reflects the brand’s voice and values. This includes using uniform author bios and publisher information across different platforms. For example, ensure that every article Jane Doe writes includes a standardized author bio snippet linking to her full bio page.

In addition, you should always attribute content to verified authors and the publisher, reinforcing credibility. For example, at the end of each article, include a byline such as “Written by Jane Doe, Senior Editor at Your Organization.”

Focus on content quality, relevance, and topical expertise

You can highlight your publications and authors all you want, but you will never make it without your topical experts writing high-quality, relevant content. This should be at the top of everyone’s list.

Focus on producing high-quality, original content that adds value to readers. This enhances the reputation of both the author and the publisher. Conduct thorough research and provide in-depth analysis in your articles to establish expertise and authority.

Encourage authors to write within their areas of expertise to build authority in specific niches. For example, if Jane Doe specializes in SEO, make sure she writes predominantly on SEO-related topics.

Actionable SEO strategies

You can also use classic SEO tactics to build your authors and publishers’ reputations. For instance, you could encourage your authors to contribute to reputable external sites to get a link to their bio pages. This builds both author and publisher authority.

Also, try to build up your citations. Find ways and outlets to get your content cited or mentioned by authoritative sources. You could contact industry influencers to review and mention your content in their articles or social media posts.

Keep everything up to date

Regularly update bio and publisher pages with new achievements, publications, and credentials. For example, you could enhance Jane Doe’s bio page with her latest speaking engagements, citations, and published articles. Also, periodically update older content to keep it relevant and accurate, maintaining the credibility of both authors and the publisher.

Entity SEO and its importance for publishers

Entity SEO focuses on optimizing for entities—people, places, organizations, and things—rather than just keywords. Google’s algorithms leverage the Knowledge Graph to understand and rank entities based on their relationships and attributes. Publishers should also focus on entity SEO.

One of the foundations of entity SEO is helping Google recognize your entities. One way to do that is to implement structured data. This helps Google recognize and categorize entities accurately. This includes using schema markup for authors, publishers, and organizations. You can use schema markup to define relationships between authors, their articles, and the publisher.

Together with structured data, linking your entities is a staple of Entity SEO. Make sure that internal links connect related entities within your content. For example, link an author’s bio page to their articles and the publisher page.

Be consistent in your entities. Maintain consistent information about entities across various platforms and websites. Inconsistencies can confuse search engines and harm rankings.

Last but not least, try to improve your chances of being included in Google’s Knowledge Graph. Make sure that you provide comprehensive and accurate information. For example, submit your organization and key authors to Wikidata and ensure their information is accurate and up-to-date.

Leveraging structured data and Yoast SEO

Structured data is the backbone of effective SEO for author and publisher entities. It enables search engines to understand and index content more accurately, making attributing credibility to the proper sources easier. The Yoast SEO plugin offers a robust schema framework that simplifies the implementation of structured data.

Yoast SEO provides a comprehensive and adaptable schema framework that supports various schema types, including author and organization markup. This ensures all necessary information is included and formatted correctly, enhancing visibility in search results.

Use Yoast SEO to add structured data to all relevant pages, including author bio pages, publisher information, and individual articles. The plugin’s user-friendly interface makes it easy to manage and update schema markup as needed.

Conclusion

The recent Google document leak has highlighted the critical role of author and publisher entities in SEO. SEOs can significantly enhance a website’s authority and trustworthiness by adopting a structured approach to optimizing these entities.
Implementing detailed author and publisher pages, leveraging structured data, and utilizing tools like Yoast SEO can create a solid foundation for improved search engine rankings.

Integrating these insights into current SEO practices will help build a credible and authoritative online presence, ultimately driving more organic traffic and engagement.

Read more: What is E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness)? »

Coming up next!

OpenAI Launches SearchGPT: AI-Powered Search Prototype via @sejournal, @MattGSouthern

OpenAI has announced the launch of SearchGPT, a prototype AI-powered search engine.

This move marks the company’s entry into the competitive search market, potentially challenging established players.

Key Features & Functionality

SearchGPT aims to directly answer user queries by combining AI language models with real-time web information.

Rather than offering a list of links, SearchGPT attempts to deliver concise responses with citations to source material.

Here’s an example of a search results page for the query: “music festivals in boone north carolina in august.”

Screenshot from openai.com/index/searchgpt-prototype/, July 2024.

The SearchGPT prototype includes:

  • A conversational interface allowing follow-up questions
  • Real-time information retrieval from web sources
  • In-line attributions and links to original content

Publisher Controls & Content Management

OpenAI is also introducing tools for publishers to manage how their content appears in SearchGPT, giving them more control over their presence in AI-powered search results.

Key points about the publisher controls include:

  1. Separate from AI training: OpenAI emphasizes that SearchGPT is distinct from the training of their generative AI models. Sites can appear in search results even if they opt out of AI training data.
  2. Content management options: Publishers can influence how their content is displayed and used within SearchGPT.
  3. Feedback mechanism: OpenAI has provided an email (publishers-feedback@openai.com) for publishers to share their thoughts and concerns.
  4. Performance insights: The company plans to share information with publishers about their content’s performance within the AI search ecosystem.

These tools are OpenAI’s response to ongoing debates about AI’s use of web content and concerns over intellectual property rights.

Publisher Partnerships & Reactions

OpenAI reports collaborating with several publishers during the development of SearchGPT.

Nicholas Thompson, CEO of The Atlantic, provided a statement supporting the initiative, emphasizing the importance of valuing and protecting journalism in AI search development.

Robert Thomson, News Corp’s chief executive, also commented on the project, stressing the need for a symbiotic relationship between technology and content and the importance of protecting content provenance.

Limited Availability & Future Plans

Currently, SearchGPT is available to a restricted group of users and publishers.

OpenAI describes it as a temporary prototype, indicating plans to integrate features into their existing ChatGPT product eventually.

Why This Matters

The introduction of SearchGPT represents a potential shakeup to the search engine market.

This development could have far-reaching implications for digital marketing, content creation, and user behavior on the internet.

Potential effects include:

  • Changes in content distribution and discovery mechanisms
  • New considerations for search engine optimization strategies
  • Evolving relationships between AI companies and content creators

Remember, this is still a prototype, and we have yet to see its capabilities.

There’s a waitlist available for those trying to get their hands on it early.

What This Means For You

AI-powered search might offer users more direct access to information. However, the accuracy and comprehensiveness of results may depend on publisher participation and content management choices.

For content creators and publishers, these new tools provide opportunities to have more say in how their work is used in AI search contexts.

While it may increase content visibility and engagement, it also requires adapting to new formats and strategies to ensure content is AI-friendly and easily discoverable.

As SearchGPT moves from prototype to integration with ChatGPT, it will be vital to stay informed about these developments and adapt your strategies.

The future of search is evolving, and AI is at the forefront of this transformation.

Moving From Niche Sites & Affiliate To Agency Owner via @sejournal, @rollerblader

This week’s Ask An SEO question comes from Mike in New York, who asks:

“I have been creating affiliate blogs and niche websites throughout my 10 year career. Had some great successes and setbacks to share. But now I feel tired of it with Google also specifically targeting more of these types of websites in their updates.

What other fields are there to explore in SEO, what else is working right now? How is client side SEO? How can a person like me who has worked on content websites throughout their career can make a switch? What does future hold for SEO? “

Great question, and one that actually led to me starting my agency. Here’s the background and then the answer to your question.

I had niche sites in music, weddings, clothing, architecture, etc.

They were all growing and doing well, but I got bored of the same topics in the same niches, and it became incredibly disheartening when unqualified affiliate managers took over programs, destroying parts of my income, or companies closed their programs because of low-value affiliates being approved in.

That’s when a few interesting things happened:

  • A few of the advertisers on my sites (affiliates and sponsors) asked how I was driving traffic, so I shared with them how I optimized for search and then social later on.
  • Next, I walked them through the email funnels and automations I created. The email automation backfired on me big time when 3 of my sites got hacked through a plugin, and 20 spam newsletter blasts went out simultaneously as they posted spam blog posts to my sites). Never connect your newsletter to automatically drop when a new blog post goes live.
  • I met with two of the companies in person.

Both companies asked me if I could do the optimizations I do for myself for them. It was interesting because I was getting bored and missed the structure of a “real job,” which sounds weird because I’m not a 9 to 5 person.

After a few more conversations, it made sense. So I started letting my niche sites die and used them for training purposes as I brought on contractors and staff.

Now, probably 10 or 15 years later, my agency is still going, and each of the domains has expired (I chose not to sell them even though I got an offer when they were bigger).

Now onto the answer, there are four things to be aware of and prepare for.

The first is how agency work differs so you can mentally prepare.

Next is lining up why you’re more qualified than an experienced agency.

Third, you’ll want to outline the types of clients you can work on (can and want to are different) and projects you’re able to do.

Last is being ready for more instability and inconsistency than being a niche site owner.

Being Mentally Ready For SEO Consulting

When you leave building and monetizing your own websites for consulting, you leave as your own boss. You know what works and does not, and what needs to be included in content, for example.

But it is not your choice anymore, and that becomes frustrating.

You know that the Google Reviews update recommends listing multiple shopping options in order to provide a better user experience.

Still, the ad sales team or affiliate manager sold a sponsored post. Their agreement prevents you from adding the extra store. Then, they want to know why the review or list isn’t ranking.

Another obstacle is when they insist the content cannot have “real experience” or list certain or any “negatives” because the content is a sponsored post – and everyone needs to make the advertiser happy.

Google and social media guidelines do not matter in the ad and publisher affiliate manager world; it’s about the advertiser and getting more money from them. They are not SEO pros or social media specialists; they are sales and account managers.

Their job isn’t to know that their way of thinking and selling impacts the loss of traffic. They just need to close sales and negotiate higher commissions.

With ecommerce, you have to meet brand guidelines, which can include not being able to use direct and specific language.

Sometimes you cannot follow pixel lengths (character counts) for title tags because it goes “against brand,” and that same team will ask why the titles don’t show up in the search results.

Other times, you have to “stick to branding” instead of meeting customer intent. This will drive you crazy because the opportunities are right there, but branding almost always wins, even when it costs the company money. This is a cycle that runs on repeat.

As a niche site owner, you focus on UX and revenue, but branding usually takes center stage with a company.

The branding team is not a performance marketing team. It is guided by the general counsel and going for what is required for the trademark and appearance of the brand.

As a consultant, you will benefit from learning about branding and finding a way to balance the two. Corporate branding is different from niche site branding.

The general counsel also comes into play. When you run a niche site, you can take original photos, give real feedback, and share genuine opinions.

The general counsel of your client may decide that this is not allowed on an ecommerce store, service company, or within the publication.

A story could be about to break, and you have an opportunity to get massive backlinks and Google Discover traffic or beat the current articles ranking because none of them have original thoughts or experiences.

But the general counsel holds it up for review, and the opportunity passes you by. Or the counsel decides with branding that listing negatives is bad for the company or company image, even though it can build consumer confidence through transparency.

The lack of trust builders can negatively impact both SEO and conversion rates. It isn’t their job to know how to rank a website or convert a consumer – that is your job. You also need to work within the atmosphere they provide, as you are not part of the company.

You are no longer the boss, even though you own your agency. It is your client’s website and platform. You are an easy-to-fire version of an employee who relies on the decision-maker trusting your advice.

That means you cannot push back as heavily as an actual employee; you have to make more sacrifices while keeping things afloat.

That does not mean being dishonest or not sharing the downsides, but you do walk a fine line in every meeting.

It is frustrating not to be the boss or able to make the decisions, so prepare yourself mentally for the above and more. With that said, you will learn how to better select clients and who you work with as you establish yourself. My current client base is like living a dream.

It took me over a decade to learn how to detect red flags and when it is time to move on, but now that I did, I love working with all of them.

It is the same enjoyment I got from running my own sites, but I get the structure and deadlines I missed from the corporate world. Many of them have become friends of mine outside of work. Even when I part ways with them, many keep in touch, as we did become actual friends.

List Your Skills

Yes, literally create a list of what you are highly qualified to do and print it out. This comes in handy when pitching clients, and imposter syndrome kicks in. And it will!

Under each skill, list a few successes with each so you can mention it when talking to clients. Make sure you change them out as new ones happen.

Something you did five years ago is no longer relevant as an agency. You need new and consistent wins to stay in business.

Skills can include:

  • Email and SMS acquisition.
  • Monetization (CPC, CPM, CPL, CPV, download, sponsorships, affiliate, subscriptions, info products, etc…).
  • SEO.
  • Niche knowledge and the levels about which parts of the niche.
  • Connections with other niche influencers and experts.
  • Audience building (social media, readership, community forming, events planning, etc…).
  • CRO or conversion rate optimization.
  • Code and markup (HTML/CSS, Javascript, PHP, schema, python, etc…).
  • Datafeed optimization.
  • Syndication.
  • PR, interviews, and media training.

Once your list is built, rate each skill on a scale of one to 10, with 10 being the highest. Think about which skills help other skills or teams so you can become a go-to resource for clients.

Now invest in yourself to get each marketable skill to at least a seven. I did this through necessity, like learning CSS (which I’m still not very good at) because I had to.

I also went to conferences where I could learn and discovered that many “gurus” and “keynotes” are not actual experts.

A few shows that changed my trajectory learning-wise are Pubcon, State of Search in Dallas, Zenith Duluth, and Barbados SEO. These, in particular, really changed how I see things. But I don’t use them to build business.

As an agency owner with niche experience, if you want to build clients, you go where the hiring managers are. If you are in the electronics space, go to electronics shows and inventors shows and pitch to speak.

That is where the marketers and founders are listening and looking for help. Do you work in housewares, food and recipes, etc.? The home shows are your perfect market to build clients.

The Types Of Clients You Can Work On And Projects You Can Do

Ecommerce SEO is very different from niche sites – the same with publishers and service-based companies. Then you get into non-profits, which is an entirely different ballgame.

Decide if you want to work on competing companies or all complementary to each other.

Now, determine how much time and effort each will take, and price yourself accordingly.

Don’t be afraid to map out hours on your calendar. This helps me keep track of what I need to be doing, in addition to my to-do list (which I literally write out and check off each week).

  • News sites (wholesale, trade, organization, media).
  • Niche sites (publishers, bloggers, podcasters, YouTubers, influencers, etc…).
  • Ecommerce.
  • Service providers.
  • Lead gen.
  • SEO audits.
  • Retainer projects.
  • Advisor roles.
  • Hourly.
  • Workshops.
  • Public speaking.
  • Event hosting.

Prepare For Instability

If you think Google updates or having a social media account closed is bad, wait until all clients leave in the same week. Things spiral faster than niche sites tanking.

It happens to almost everyone.

Clients are going to get pitched all the time, and you will lose some, even though you did nothing wrong.

Or you grow a company and brand, they make money and hire a new VP or Director, and that person brings in a new agency because they worked with them in the past.

You’ll also hear that the new agency can scale and work with larger companies. This may not be true, but it is what the founders hear from other founders when they attend networking events.

You lose when their peers put ideas in their heads.

Always put a little bit of money away, even $100 a month. It adds up and lets you keep going when things get bad.

This post is getting long; if you can’t tell, I’m passionate about it. So I’ll stop here. Give it a try if you have the ability and can take the risk both financially and mentally.

The worst that will happen is you fail and go back to building niche sites, take a full-time job, or do a hybrid. At least you won’t ever have to wonder what could have been.

More resources:


 Featured Image: ESB Basic/Shutterstock

Agile SEO: Moving From Strategy To Action via @sejournal, @jes_scholz

Impactful SEO is rarely executed by a lone wolf.

You need resources. You need buy-in from higher-ups – a CMO, head of product, or even CEO.

But here’s the thing: those lengthy SEO documents outlining objectives, audiences, competitors, keywords, and that six-month Gantt chart vaguely detailing optimization projects – they’re not getting read.

On the contrary, it is a roadblock to you getting a green light for resources.

An executive can quickly scan one short email containing a clear request and sign off. However, they need to set aside dedicated time to read a strategy document in depth – and time is not something executives have a lot of.

And even if they sign off today, the reality is business priorities shift. Competitive landscapes change. Algorithms are updated.

SEO is executed in a constant state of flux. It demands flexibility on a monthly, even weekly basis.

So, let’s ditch the long documents and prioritize actions over proposals with agile SEO.

Why Agile SEO Strategies Work

Agile SEO involves incremental iteration.

Break complex, overarching projects down into small, frequent changes.

Enable continual progress.

google quoteImage from author

Forget the pursuit of SEO perfection.

The key is to launch a minimum viable product (MVP) and monitor the impact on metrics.

Once you are armed with performance data, you can move on. The key performance indicator (KPI) impact will get you buy-in for the resources you need.

Let me give you an example.

Say your overarching goal is to completely overhaul the website architecture of an e-commerce site – all the URL routes, page titles, meta descriptions, and H1s for the homepage, category pages, and product pages.

The Old Way: One Giant Leap

The traditional approach involves pitching the entire SEO project at once. Your argument is that it’s good for SEO.

The site will rank higher and significantly impact the organic sessions. Which is true.

However, the document communicating all the reasons and requirements is complicated to review.

The project will seem too large. It will likely not make it onto your development team’s roadmap, as they will likely feel your request will overload their development cycle.

overloaded donkeyImage from author

Agile SEO Approach: Small Iterations

What if you broke it down into micro-wins?

Instead of pitching the entire project, request approval for a small but impactful change.

For example, optimizing the title tag and meta description of the homepage.

The documentation for this will be less than one page. The change request is equivalent to snackable content. Because it’s easy to implement, it’s much easier to incorporate it into a development sprint.

Now, say this quick change positively impacts KPIs, such as a 3% lift in homepage organic sessions. You can then argue for similar changes for the category pages, pointing out that if we get a similar KPI lift as we did for the homepage, this will achieve X more organic sessions.

You have already proven such tactics can increase KPIs. So, there is more trust in your approach. And it’s, again, a small request. So, your development team is more likely to do it.

And you can rinse and repeat until you have the whole site migrated.

How To Document An Agile SEO Strategy

So now we know to stop writing long SEO strategy documents and instead start creating agile, “snackable” tactics.

But we still need to understand what:

  • Has been completed in the past.
  • Is being worked on now.
  • Is coming up next.
  • All the ideas are.

This information must be easy to digest, centrally accessible, and flexible.

One solution for this is an “SEO calendar” document.

<span class=

Elements of an SEO calendar:

  • Date column: Ideally matched against IT sprint cycles. This does not mean every SEO initiative involves IT. But if you need a developer’s assistance, it will simplify cross-functional team projects. Having it set, for example, every two weeks also promotes small but constant releases from the SEO team.
  • Backlog: This provides space for team members to record ideas without having to make any significant commitment of time. Assess all ideas regularly as you fill your next available calendar slot.
  • Change column: A clear and concise sentence on what has been or will be changed.
  • Tactic brief: A link to the detailed information of that test. More details coming below.
  • Sign off: Ensuring all SEO changes pass a four-eye principle from a strategic point of view lowers the risk of any errors. These quick-to-read, snackable briefs make it easy to get your managers to buy in and sign off for resources.
  • Outcome: One short sentence summing up the KPI impact.

The benefit of a calendar layout is it is fully flexible but time-relevant. Changing priorities is as simple as moving the de-prioritized item to the backlog.

It can act as a website change log for SEO. Everyone can know the timetable of changes, both past and planned upcoming.

Those interested in why the KPIs increased on a certain date have the answer at a glance and more detailed information in one click. This can be invaluable for troubleshooting.

And, for team leaders, if any gaps appear in the iteration cycle, you can see this as gaps will appear in the calendar, allowing you to address the root cause.

Snackable Tactic Briefs

The benefits of tactics briefs are twofold:

  • Pre-launch: They concisely answer the Five Ws of your SEO change to get buy-in from stakeholders. Once aligned, it will act as the specification if you need someone else to execute it.
  • Post-launch: Be the record of what was actually changed. What impact did it have on the KPI funnel? What did we learn? And what are the next steps, if any?

Tactics briefs have five sections:

  • Overview.
  • SMART Goal.
  • Specifications.
  • Results.
  • Learnings & Action Items.

Overview

The overview section should cover the basics of the test:

  • Who is the one person ultimately responsible for leading the execution of the test?
  • When will it (pre-launch)/did it (post-launch) go live?
  • When will we (pre-launch)/did we (post-launch) assess results?
  • Who proposed the change? (It may be important to know if you need more information on the background for the test or if an action has come from senior management.)
  • Who has agreed to this execution? (This may be development, the line manager in marketing, or another key stakeholder. Allowing everyone to see who is on board.)
Overview tableScreenshot from author

SMART Goal

The SMART goal is the high-level tactical approach.

Align your goal with your stakeholders before a detailed documentation effort goes into a task. This also ensures the change is in line with business goals.

<span class=

Specifications

This section will vary based on your test. But always try to communicate the “before” and the “after.” This way, you have a clear historical record you can refer back to.

The key is to have only the details needed. Nothing more, nothing less.

You can use tables to keep it easy to scan.

For example, in the case of a title tag change, it could be as simple as a single table.

Title tag formula for category pagesScreenshot from author

The key is to avoid long paragraphs of text. Focus on clearly communicating the outcome. What was it before, and what will be it after?

Don’t explain how the task was executed.

Results

This section should contain one table to effectively communicate the percentage change between the benchmark weeks and the SEO change from a full-funnel perspective, as well as any additional tables to drill down for more insights.

An example of a table could be similar to the one below.

Category page organic KPI results tableScreenshot from author

Learnings & Action Items

Here is where you can succinctly analyze the results.

Remember, you have the data clearly available in the table above, so you don’t need to list the numbers again.

Explain what the numbers mean and what actions will be taken next.

Final Thoughts

An agile SEO system provides flexibility and visibility.

At any time, you can understand what actions are underway and what has shifted KPIs.

Forget the fantasy of the perfect SEO strategy, and focus your energy on getting sh!t done.

More resources: 


Featured Image: Andrey_Popov/Shutterstock

Bing’s Updated AI Search Will Make Site Owners Happy via @sejournal, @martinibuster

Bing is rolling out a new version of Generative Search that displays information in an intuitive way that encourages exploration but also prioritizes clicks from the search results to websites.

Microsoft introduced their new version of AI search:

“After introducing LLM-powered chat answers on Bing in February of last year, we’ve been hard at work on the ongoing revolution of search. …Today, we’re excited to share an early view of our new generative search experience which is currently shipping to a small percentage of user queries.”

New Layout

Bing’s announcement discusses new features that not only make it easy for users to find information, Bing also makes it easy for users to see the organic search results and click through and browse websites.

On the desktop view, Bing shows three panels:

  • A table of content on the left
  • AI answers in the center (with links to website sources)
  • Traditional organic search results on the right hand side
  • Even more organic search results beneath “the fold”

The table of contents that is on the right hand side is invites exploration. It has the main topic at the top, with directly related subtopics beneath it. This is so much better than a People Also Asked type of navigation because it invites the user to explore and click on an organic search result to keep on exploring.

Screenshot: Table Of Contents

This layout is the result of a conscious decision at Bing to engineer it so that that it preserves and encourages clicks to websites.

Below is a screenshot of the new generative AI search experience. What’s notable is how Bing surrounds the AI answers with organic search results.

Screenshot Of The New Bing AI Search Results

Bing makes a point to explain that they have tested the new interface to make sure that the search results will send the same amount of traffic and to avoid creating a layout that results in an increase in zero click search results.

When other search engines talk about search quality it is always from the context of user satisfaction. Bing’s announcement makes it clear that sustaining traffic to websites was an important context that guided the design of the new layout.

Below is a screenshot of a typical Bing AI search result for a query about the life span of elephants.

Note that all the areas that I bounded with blue boxes are AI answers while everything outside of the blue boxes are organic search results.

Screenshot Of Mix of AI And Organic Results

Bing's new AI search layout emphasizes organic search results

The screenshot makes it clear that there is a balance of organic search results and AI answers. In addition to those contextually relevant organic search results there are also search results on the right hand side (not shown in the above screenshot).

Microsoft’s blog post explained:

“We are continuing to look closely at how generative search impacts traffic to publishers. Early data indicates that this experience maintains the number of clicks to websites and supports a healthy web ecosystem. The generative search experience is designed with this in mind, including retaining traditional search results and increasing the number of clickable links, like the references in the results.”

Bing’s layout is a huge departure from the zero-click style of layouts seen in other search engines. Bing has purposely designed their generative AI layout to maintain clicks to websites. It cannot be overstated how ethical Bing’s approach to the web ecosystem is.

Bing Encourages Browsing And Discovery

An interesting feature of Bing’s implementation of generative AI search is that it shows the answer to the initial question first, and it also anticipates related questions. This is similar to a technique called “information gain” where an AI search assistant will rank an initial set of pages that answers a search query, but will also rank a second, third and fourth set of search results that contain additional information that a user may be interested in, information on related topics.

What Bing does differently from the Information Gain technique is that Bing displays all the different search results on a single page and then uses a table of contents on the left hand side that makes it easy for a user to click and go straight to the additional AI answers and organic search results.

Bing’s Updated AI Search Is Rolling Out Now

Bing’s newly updated AI search engine layout is slowly rolling out and they are observing the feedback from users. Microsoft has already tested it and is confident that it will continue to send clicks to websites. Search engines have a relationship with websites, what is commonly referred to as the web ecosystem. Every strong relationship is based on giving, not taking. When both sides give it creates a situation where both sides receive.

More search engines should take Bing’s approach of engineering their search results to satisfy users in a way that encourages discovery on the websites that originate the content.

Read Bing’s announcement:

Introducing Bing generative search

Featured Image by Shutterstock/Primakov

AI trained on AI garbage spits out AI garbage

AI models work by training on huge swaths of data from the internet. But as AI is increasingly being used to pump out web pages filled with junk content, that process is in danger of being undermined.

New research published in Nature shows that the quality of the model’s output gradually degrades when AI trains on AI-generated data. As subsequent models produce output that is then used as training data for future models, the effect gets worse.  

Ilia Shumailov, a computer scientist from the University of Oxford, who led the study, likens the process to taking photos of photos. “If you take a picture and you scan it, and then you print it, and you repeat this process over time, basically the noise overwhelms the whole process,” he says. “You’re left with a dark square.” The equivalent of the dark square for AI is called “model collapse,” he says, meaning the model just produces incoherent garbage. 

This research may have serious implications for the largest AI models of today, because they use the internet as their database. GPT-3, for example, was trained in part on data from Common Crawl, an online repository of over 3 billion web pages. And the problem is likely to get worse as an increasing number of AI-generated junk websites start cluttering up the internet. 

Current AI models aren’t just going to collapse, says Shumailov, but there may still be substantive effects: The improvements will slow down, and performance might suffer. 

To determine the potential effect on performance, Shumailov and his colleagues fine-tuned a large language model (LLM) on a set of data from Wikipedia, then fine-tuned the new model on its own output over nine generations. The team measured how nonsensical the output was using a “perplexity score,” which measures an AI model’s confidence in its ability to predict the next part of a sequence; a higher score translates to a less accurate model. 

The models trained on other models’ outputs had higher perplexity scores. For example, for each generation, the team asked the model for the next sentence after the following input:

“some started before 1360—was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish labourers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.”

On the ninth and final generation, the model returned the following:

“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”

Shumailov explains what he thinks is going on using this analogy: Imagine you’re trying to find the least likely name of a student in school. You could go through every student name, but it would take too long. Instead, you look at 100 of the 1,000 student names. You get a pretty good estimate, but it’s probably not the correct answer. Now imagine that another person comes and makes an estimate based on your 100 names, but only selects 50. This second person’s estimate is going to be even further off.

“You can certainly imagine that the same happens with machine learning models,” he says. “So if the first model has seen half of the internet, then perhaps the second model is not going to ask for half of the internet, but actually scrape the latest 100,000 tweets, and fit the model on top of it.”

Additionally, the internet doesn’t hold an unlimited amount of data. To feed their appetite for more, future AI models may need to train on synthetic data—or data that has been produced by AI.   

“Foundation models really rely on the scale of data to perform well,” says Shayne Longpre, who studies how LLMs are trained at the MIT Media Lab, and who didn’t take part in this research. “And they’re looking to synthetic data under curated, controlled environments to be the solution to that. Because if they keep crawling more data on the web, there are going to be diminishing returns.”

Matthias Gerstgrasser, an AI researcher at Stanford who authored a different paper examining model collapse, says adding synthetic data to real-world data instead of replacing it doesn’t cause any major issues. But he adds: “One conclusion all the model collapse literature agrees on is that high-quality and diverse training data is important.”

Another effect of this degradation over time is that information that affects minority groups is heavily distorted in the model, as it tends to overfocus on samples that are more prevalent in the training data. 

In current models, this may affect underrepresented languages as they require more synthetic (AI-generated) data sets, says Robert Mahari, who studies computational law at the MIT Media Lab (he did not take part in the research).

One idea that might help avoid degradation is to make sure the model gives more weight to the original human-generated data. Another part of Shumailov’s study allowed future generations to sample 10% of the original data set, which mitigated some of the negative effects. 

That would require making a trail from the original human-generated data to further generations, known as data provenance.

But provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet. Though a number of tools now exist that aim to determine whether text is AI-generated, they are often inaccurate.

“Unfortunately, we have more questions than answers,” says Shumailov. “But it’s clear that it’s important to know where your data comes from and how much you can trust it to capture a representative sample of the data you’re dealing with.”