Does Google’s AI Overviews Violate Its Own Spam Policies? via @sejournal, @martinibuster

Search marketers assert that Google’s new long-form AI Overviews answers have become the very thing Google’s documentation advises publishers against: scraped content lacking originality or added value, at the expense of content creators who are seeing declining traffic.

Why put the effort into writing great content if it’s going to be rewritten into a complete answer that removes the incentive to click the cited source?

Rewriting Content And Plagiarism

Google previously showed Featured Snippets, which were excerpts from published content that users could click on to read the rest of the article. Google’s AI Overviews (AIO) expands on that by presenting entire articles that answer a user’s questions and sometimes anticipates follow-up questions and provides answers to those, too.

And it’s not an AI providing answers. It’s an AI repurposing published content. That action is called plagiarism when a student does the same thing by repurposing an existing essay without adding unique insight or analysis.

The thing about AI is that it is incapable of unique insight or analysis, so there is zero value-add in Google’s AIO, which in an academic setting would be called plagiarism.

Example Of Rewritten Content

Lily Ray recently published an article on LinkedIn drawing attention to a spam problem in Google’s AIO. Her article explains how SEOs discovered how to inject answers into AIO, taking advantage of the lack of fact checking.

Lily subsequently checked on Google, presumably to see if her article was ranking and discovered that Google had rewritten her entire article and was providing an answer that was almost as long as her original.

She tweeted:

“It re-wrote everything I wrote in a post that’s basically as long as my original post “

Did Google Rewrite Entire Article?

An algorithm that search engines and LLMs may use to analyze content is to determine what questions the content answers. This way the content can be annotated according to what answers it provides, making it easier to match a query to a web page.

I used ChatGPT to analyze Lily’s content and also AIO’s answer. The number of questions answered by both documents were almost exactly the same, twelve. Lily’s article answered 13 questions while AIO provided answeredo twelve.

Both articles answered five similar questions:

  • Spam Problem In AI Overviews
    AIO: “s there a spam problem affecting Google AI Overviews?
    Lily Ray: What types of problems have been observed in Google’s AI Overviews?
  • Manipulation And Exploitation of AI Overviews
    AIO: How are spammers manipulating AI Overviews to promote low-quality content?
    Lily Ray: What new forms of SEO spam have emerged in response to AI Overviews?
  • Accuracy And Hallucination Concerns
    AIO: Can AI Overviews generate inaccurate or contradictory information?
    Lily Ray: Does Google currently fact-check or validate the sources used in AI Overviews?
  • Concern About AIO In The SEO Community
    AIO: What concerns do SEO professionals have about the impact of AI Overviews?
    Lily Ray: Why is the ability to manipulate AI Overviews so concerning?
  • Deviation From Principles of E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness)
    AIO: What kind of content is Google prioritizing in response to these issues?
    Lily Ray: How does the quality of information in AI Overviews compare to Google’s traditional emphasis on E-E-A-T and trustworthy content?

Plagiarizing More Than One Document

Google’s AIO system is designed to answer follow-up and related questions, “synthesizing” answers from more than one original source and that’s the case with this specific answer.

Whereas Lily’s content argues that Google isn’t doing enough, AIO rewrote the content from another document to say that Google is taking action to prevent spam. Google’s AIO differs from Lily’s original by answering five additional questions with answers that are derived from another web page.

This gives the appearance that Google’s AIO answer for this specific query is “synthesizing” or “plagiarizing” from two documents to answer the question Lily Ray’s search query, “spam in ai overview google.”

Takeaways

  • Google’s AI Overviews is repurposing web content to create long-form content that lacks originality or added-value.
  • Google’s AIO answers mirror the content they summarize, copying the structure and ideas to answer identical questions inherent in the articles.
  • Google’s AIO arguably deviates from Google’s own quality standards, using rewritten content in a manner that mirrors Google’s own definitions of spam.
  • Google’s AIO features apparent plagiarism of multiple sources.

The quality and trustworthiness of AIO responses may  not reach the quality levels set by Google’s principles of Experience, Expertise, Authoritativeness, and Trustworthiness because AI lacks experience and apparently there is no mechanism for fact-checking.

The fact that Google’s AIO system provides essay-length answers arguably removes any incentive for users to click through to the original source and may help explain why many in the search and publisher communities are seeing less traffic. The perception of AIO traffic is so bad that one search marketer quipped on X that ranking #1 on Google is the new place to hide a body, because nobody would ever find it there.

Google could be said to plagiarize content because AIO answers are rewrites of published articles that lack unique analysis or added value, placing AIO firmly within most people’s definition of a scraper spammer.

Featured Image by Shutterstock/Luis Molinero

WordPress Scraper Plugin Compromised By Security Vulnerability via @sejournal, @martinibuster

A WordPress plugin that automatically posts content scraped from other websites has been discovered to contain a critical vulnerability that allows anyone to upload malicious files to affected websites. The severity of the vulnerability is rated at 9.8 on a scale of 1-10.

Crawlomatic Multisite Scraper Post Generator Plugin for WordPress

The Crawlomatic WordPress plugin is sold via the Envato CodeCanyon store for $59 per license. It enables users to crawl forums, weather statistics, articles from RSS feeds, and directly scrape the content from other websites and then automatically publish the content on the user’s website.

The plugin’s Envato CodeCanyon web page features a banner that notes that the author of the plugin has been recognized for having met “WordPress quality standards” and displays a badge indicating that it is “Envato WP Requirements Compliant,” an indication that it meets Envato’s “security, quality, performance and coding standards in WordPress plugins and themes.”

The plugin’s directory page explains that it it can crawl and scrape virtually any website, including JavaScript-based sites, promising that it can turn a user’s website into a “money making machine.”

Unauthenticated Arbitrary File Upload

The Crawlomatic WordPress plugin is missing a filetype validation check in all version prior to and including version 2.6.8.1.

According to a warning posted on Wordfence:

“The Crawlomatic Multipage Scraper Post Generator plugin for WordPress is vulnerable to arbitrary file uploads due to missing file type validation in the crawlomatic_generate_featured_image() function in all versions up to, and including, 2.6.8.1. This makes it possible for unauthenticated attackers to upload arbitrary files on the affected site’s server which may make remote code execution possible.”

Users of the plugin are recommended by Wordfence to update to at least version 2.6.8.2.

Read more at Wordfence:

Crawlomatic Multipage Scraper Post Generator <= 2.6.8.1 – Unauthenticated Arbitrary File Upload

Featured Image by Shutterstock/nakaridore

Googler’s Deposition Offers View Of Google’s Ranking Systems via @sejournal, @martinibuster

A Google engineer’s redacted testimony published online by the U.S. Justice Department offers a look inside Google’s ranking systems, offering an idea about Google’s quality scores and introduces a mysterious popularity signal that uses Chrome data.

The document offers a high level and very general view of ranking signals, providing a sense of what the algorithms do but not the specifics.

Hand-Crafted Signals

For example, it begins with a section about the “hand crafting” of signals which describes the general process of taking data from quality raters, clicks and so on and applying mathematical and statistical formulas to generate a ranking score from three kinds of signals. Hand crafted means scaled algorithms that are tuned by search engineers. It doesn’t mean that they are manually ranking websites.

Google’s ABC Signals

The DOJ document lists three kinds of signals that are referred to as ABC Signals and correspond to the following:

  • A – Anchors (pages linking to the target pages),
  • B – Body (search query terms in the document),
  • C – Clicks (user dwell time before returning to the SERP)

The statement about the ABC signals is a generalization of one part of the ranking process. Ranking search results is far more complex and involves hundreds if not thousands of additional algorithms at every step of the ranking process, from indexing, link analysis, anti-spam processes, personalization, re-ranking, and other processes. For example, Liz Reid has discussed Core Topicality Systems as part of the ranking algorithm and Martin Splitt has discussed annotations as a part of understanding web pages.

This is what the document says about the ABC signals:

“ABC signals are the key components of topicality (or a base score), which is Google’s determination of how the document is relevant to the query.

T* (Topicality) effectively combines (at least) these three signals in a relatively hand-crafted way. Google uses to judge the relevance of the document based on the query terms.”

The document offers an idea of the complexity of ranking web pages:

“Ranking development (especially topicality) involves solving many complex mathematical problems. For topicality, there might be a team of engineers working continuously on these hard problems within a given project.

The reason why the vast majority of signals are hand-crafted is that if anything breaks Google knows what to fix. Google wants their signals to be fully transparent so they can trouble-shoot them and improve upon them.”

The document compares their hand-crafted approach to Microsoft’s automated approach, saying that when something breaks at Bing it’s far more difficult to troubleshoot than it is with Google’s approach.

Interplay Between Page Quality And Relevance

An interesting point revealed by the search engineer is that page quality is independent of query. If a page is determined to be high quality, trustworthy, it’s regarded as trustworthy across all related queries which is what is meant by the word static, it’s not dynamically recalculated for each query. However, there are relevance-related signals in the query that can be used to calculate the final rankings, which shows how relevance plays a decisive role in determining what gets ranked.

This is what they said:

“Quality
Generally static across multiple queries and not connected to a specific query.

However, in some cases Quality signal incorporates information from the query in addition to the static signal. For example, a site may have high quality but general information so a query interpreted as seeking very narrow/technical information may be used to direct to a quality site that is more technical.

Q* (page quality (i.e., the notion of trustworthiness)) is incredibly important. If competitors see the logs, then they have a notion of “authority” for a given site.

Quality score is hugely important even today. Page quality is something people complain about the most…”

AI Gives Cause For Complaints Against Google

The engineer states that people complain about quality but also says that AI aggravates the situation by making it worse.

He says about page quality:

“Nowadays, people still complain about the quality and AI makes it worse.

This was and continues to be a lot of work but could be easily reverse engineered because Q is largely static and largely related to the site rather than the query.”

eDeepRank – A Way To Understand LLM Rankings

The Googler lists other ranking signals, including one called eDeepRank which is an LLM-based system that uses BERT, which is a language related model.

He explains:

“eDeepRank is an LLM system that uses BERT, transformers. Essentially, eDeepRank tries to take LLM-based signals and decompose them into components to make them more transparent. “

That part about decomposing LLM signals into components seems to be a reference of making the LLM-based ranking signals more transparent so that search engineers can understand why the LLM is ranking something.

PageRank Linked To Distance Ranking Algorithms

PageRank is Google’s original ranking innovation and it has since been updated. I wrote about this kind of algorithm six years ago . Link distance algorithms calculate the distance from authoritative websites for a given topic (called seed sites) to other websites in the same topic. These algorithms start with a seed set of authoritative sites in a given topic and sites that are further away from their respective seed site are determined to be less trustworthy. Sites that are closer to the seed sets are likelier to be more authoritative and trustworthy.

This is what the Googler said about PageRank:

“PageRank. This is a single signal relating to distance from a known good source, and it is used as an input to the Quality score.”

Read about this kind of link ranking algorithm: Link Distance Ranking Algorithms

Cryptic Chrome-Based Popularity Signal

There is another signal whose name is redacted that’s related to popularity.

Here’s the cryptic description:

“[redacted] (popularity) signal that uses Chrome data.”

A plausible claim can be made that this confirms that the Chrome API leak is about actual ranking factors. However, many SEOs, myself included, believe that those APIs are developer-facing tools used by Chrome to show performance metrics like Core Web Vitals within the Chrome Dev Tools interface.

I suspect that this is a reference to a popularity signal that we might not know about.

The Google engineer does refer to another leak of documents that reference actual “components of Google’s ranking system” but that they don’t have enough information for reverse engineering the algorithm.

They explain:

“There was a leak of Google documents which named certain components of Google’s ranking system, but the documents don’t go into specifics of the curves and thresholds.

For example
The documents alone do not give you enough details to figure it out, but the data likely does.”

Takeaway

The newly released document summarizes a U.S. Justice Department deposition of a Google engineer that offers a general outline of parts of Google’s search ranking systems. It discusses hand-crafted signal design, the role of static page quality scores, and a mysterious popularity signal derived from Chrome data.

It provides a rare look into how signals like topicality, trustworthiness, click behavior, and LLM-based transparency are engineered and offers a different perspective on how Google ranks websites.

Featured Image by Shutterstock/fran_kie

HTTP Status Codes Google Cares About (And Those It Ignores) via @sejournal, @MattGSouthern

Google’s Search Relations team recently shared insights about how the search engine handles HTTP status codes during a “Search Off the Record” podcast.

Gary Illyes and Martin Splitt from Google discussed several status code categories commonly misunderstood by SEO professionals.

How Google Views Certain HTTP Status Codes

While the podcast didn’t cover every HTTP status code (obviously, 200 OK remains fundamental), it focused on categories that often cause confusion among SEO practitioners.

Splitt emphasized during the discussion:

“These status codes are actually important for site owners and SEOs because they tell a story about what happened when a particular request came in.”

The podcast revealed several notable points about how Google processes specific status code categories.

The 1xx Codes: Completely Ignored

Google’s crawlers ignore all status codes in the 1xx range, including newer features like “early hints” (HTTP 103).

Illyes explained:

“We are just going to pass through [1xx status codes] anyway without even noticing that something was in the 100 range. We just notice the next non-100 status code instead.”

This means implementing early hints might help user experience, but won’t directly benefit your SEO.

Redirects: Simpler Than Many SEOs Believe

While SEO professionals often debate which redirect type to use (301, 302, 307, 308), Google’s approach focuses mainly on whether redirects are permanent or temporary.

Illyes stated:

“For Google search specifically, it’s just like ‘yeah, it was a redirection.’ We kind of care about in canonicalization whether something was temporary or permanent, but otherwise we just [see] it was a redirection.”

This doesn’t mean redirect implementation is unimportant, but it suggests the permanent vs. temporary distinction is more critical than the specific code number.

Client Error Codes: Standard Processing

The 4xx range of status codes functions largely as expected.

Google appropriately processes standard codes like 404 (not found) and 410 (gone), which remain essential for proper crawl management.

The team humorously mentioned status code 418 (“I’m a teapot”), an April Fool’s joke in the standards, which has no SEO impact.

Network Errors in Search Console: Looking Deeper

Many mysterious network errors in Search Console originate from deeper technical layers below HTTP.

Illyes explained:

“Every now and then you would get these weird messages in Search Console that like there was something with the network… and that can actually happen in these layers that we are talking about.”

When you see network-related crawl errors, you may need to investigate lower-level protocols like TCP, UDP, or DNS.

What Wasn’t Discussed But Still Matters

The podcast didn’t cover many status codes that definitely matter to Google, including:

  • 200 OK (the standard successful response)
  • 500-level server errors (which can affect crawling and indexing)
  • 429 Too Many Requests (rate limiting)
  • Various other specialized codes

Practical Takeaways

While this wasn’t a comprehensive guide to HTTP status codes, the discussion revealed several practical insights:

  • For redirects, focus primarily on the permanent vs. temporary distinction
  • Don’t invest resources in optimizing 1xx responses specifically for Google
  • When troubleshooting network errors, look beyond HTTP to deeper protocol layers
  • Continue to implement standard status codes correctly, including those not specifically discussed

As web technology evolves with HTTP/3 and QUIC, understanding how Google processes these signals can help you build more effective technical SEO strategies without overcomplicating implementation.


Featured Image: Roman Samborskyi/Shutterstock

How Referral Traffic Undermines Long-Term Brand Growth via @sejournal, @martinibuster

Mordy Oberstein, a search marketing professional whom I hold in high esteem, recently shared the provocative idea that referral traffic is not a brand’s friend and that every brand, as it matures, should wean itself from it. Referrals from other websites are generally considered a sign of a high-performing business, but it’s not a long-term strategy because it depends on sources that cannot be controlled.

Referral Traffic Is Necessary But…

Mordy Oberstein (LinkedIn profile), formerly of Wix, asserted in a Facebook post that relying on a traffic source, whether that’s another website or a search engine, offers a degree of vulnerability to maintaining steady traffic and performance.

He broke it down as a two-fold weakness:

  • Relying on the other site to keep featuring your brand.
  • Relying on Google to keep ranking that other site which in turn sends visitors to your brand.

The flow of traffic can stop at either of those two points, which is a hidden weakness that can affect the long-term sustainability of healthy traffic and sales.

Mordy explained:

“It’s a double vulnerability…

1) Relying on being featured by the website (the traffic source)
2) Relying on Google to give that website …traffic (the channel)

There are two levels of exposure & vulnerability.

As your brand matures, you want to own your own narrative.

More referral traffic is not your friend. It’s why, as a brand matures, it should wean off of it.

Full disclosure, this is my opinion. I am sure a lot of people will disagree.”

Becoming A Destination

I’ve always favored promoting a site in a way that helps it become synonymous with a given topic because that’s how to make it a default destination and encourage the kinds of signals that Google interprets as authoritative. I’ve done things like created hats with logos to give away, annual product giveaways and other promotional activities, both online and offline. While my competition was doing SEO busy work I created fans. Promoting a site is basically just getting it in front of people, both online and offline.

Brand Authority Is An Excuse, Not A Goal

Some SEOs believe in a concept called Brand Authority, which is a misleading explanation for why a website rank.  The term Brand Authority is not about Branding and it’s not about Authoritativeness, either. It’s just an excuse for why a site is top-ranked.

The phrase Brand Authority has its roots in PageRank. Big brand websites used to have a PageRank of 9 out of 10 and even a 10/10, which enabled them to rank for virtually any keywords they wanted. A link from one of those sites practically guarantee a top ten ranking. But Google ended the outsized influence of PageRank because it resulted in less relevant results, which was around 2004-ish, about the time that Google started using Navboost, a ranking signal that essentially measures how people feel about a site, which is what PageRank does, too.

This insight, that Google uses signals about how people feel about a site, is important because the feelings people have for a business are what being a brand is all about.

Marty Neumeier, a thought leader on how to promote companies (author of The Brand Gap) explained what being a brand is all about:

“Instead of creating the brand first, the company creates customers (through products and social media), the customers build the brand (through purchases and advocacy), and the customer-built brand sustains the company (through “tribal” loyalty). This model takes into account a profound and counterintuitive truth: a brand is not owned by the company, but by the customers who draw meaning from it. Your brand isn’t what you say it is. It’s what they say it is.”

Neumeier also explains how brand is about customer feelings:

“The best brands are vivid. They create clear mental pictures and powerful feelings in the minds and hearts of customers. They’re brought to life through their touchpoints, the places where customers experience them, from the first exposure to a brand’s name, to buying the product, to eventually making it part of who they are.”

That “tribal loyalty” is the kind of thing Google tries to measure. So when Danny Sullivan talks about differentiating your site to make it like a brand, he is not referring to so-called “brand authority.” He is talking about doing the kinds of things that influence people to feel positive about a site.

Getting Back To Mordy Oberstein

It seems to me that what he’s saying is that referral traffic is a stepping stone towards becoming a destination, it’s a means to an end. It’s not the goal, it’s a step toward the goal of becoming a destination.

On the other side of that process, I think it’s important to maintain relevance with potential site visitors and customers, especially today with the rapid pace of innovation, generational change, new inventions, and new product models. Relevance to people has been a Google ranking signal for a long time, beginning with PageRank, then with additional signals like Navboost.

The SEO factor that the SEO industry has largely missed is the part about about getting people to think positive thoughts about your site and your business, enough to share with other people.

Mordy’s insight about traffic is beautiful and elegant.

Read Mordy’s entire post on Facebook.

Featured Image by Shutterstock/Yunus Praditya

Google Clarifies: AI Overview Links Share Single Position In Search Console via @sejournal, @MattGSouthern

Google’s John Mueller has clarified that all links within AI Overviews (AIOs) share a single position in Google Search Console.

SEO consultant Gianluca Fiorelli asked Mueller how Search Console tracks position data for URLs in Google’s AI-generated answer boxes.

Mueller referenced Google’s official help docs, explaining:

“Basically an AIO counts as a block, so it’s all one position. It can be first position, if the block is shown first, but I don’t know if AIO is always shown first.”

This indicates that every website linked in an AI Overview receives the same position value in Search Console reports.

This occurs regardless of where the link appears in the overview panel, whether immediately visible or hidden until a user expands the box.

What Google’s Documentation Says

Google’s Search Console Help docs explain how AI Overview metrics work:

  • Position: “An AI Overview occupies a single position in search results, and all links in the AI Overview are assigned that same position.”
  • Clicks: “Clicking a link to an external page in the AI Overview counts as a click.”
  • Impressions: “Standard impression rules apply. To be counted as an impression, the link must be scrolled or expanded into view.”

The docs also note:

“Search Console doesn’t include data from experiments in Search Labs, as these experiments are still in active development.”

The Missing Data Behind Google’s Click Claims

This discussion highlights an ongoing debate in the SEO community regarding the performance of links in AI Overviews.

Lily Ray, Vice President of SEO Strategy & Research at Amsive, recently pointed out Google’s year-old claim that websites receive more clicks when featured in AI Overviews, stating:

“I would love to see a single GSC report that confirms this statement, because every study so far has shown the opposite.”

Ray’s statement reflects the concerns of many SEO professionals, as Google has not provided data to support its claims.

Looking Ahead

While we now understand how position metrics are recorded, the question remains: Do AI Overview placements drive more or less traffic than traditional search listings?

Google claims one thing, but many people report different experiences.

Since all AIO links share the same position, it’s difficult to determine which specific placements perform better.

This debate highlights the need for more precise data about how AIOs affect website traffic compared to regular search results.


Featured Image: Roman Samborskyi/Shutterstock

Google Links To Itself: 43% Of AI Overviews Point Back To Google via @sejournal, @MattGSouthern

New research shows that Google’s AI Overviews often link to Google, contributing to the walled garden effect that encourages users to stay longer on Google’s site.

A study by SE Ranking examined Google’s AI Overviews in five U.S. states. It found that 43% of these AI answers contain links redirecting users to Google’s search results. Each answer typically includes 4-6 links to Google.

This aligns with recent data indicating that Google users make 10 clicks before visiting other websites. These patterns suggest that Google is working to keep users within its ecosystem for longer periods.

Google Citing Itself in AI Answers

The SE Ranking study analyzed 100,013 keywords across five states: Colorado, Texas, California, New York, and Washington, D.C.

It tracked how Google’s AI summaries function in different regions. Although locations showed slight variance, the study found that Google.com is the most-cited website in AI Overviews.

Google appears in about 44% of all AI answers, significantly ahead of the next most-cited sources, YouTube, Reddit, Quora, and Wikipedia, appearing in about 13%.

The research states:

“Based on the data combined from all five states (141,507 total AI Overview appearances), our data analysis shows that 43.42% (61,437 times) of AI Overview responses contain links to Google organic results, while 56.58% of responses do not.”

Image Credit: SE Ranking

Building on the Walled Garden Trend

These findings complement a recent analysis from Momentic, which found that Google’s “pages per visit” has reached 10, indicating users spend significantly more clicks on Google before visiting other sites.

Overall, this research reveals Google is creating a more self-contained search experience:

  • AI Overviews appear in approximately 30% of all searches
  • Nearly half of these AI answers link back to Google itself
  • Users now make 10 clicks within Google before leaving
  • Longer, more specific queries trigger AI Overviews more frequently

Google still drives substantial traffic outward; 175.5 million visits in March, according to Momentic.

However, it’s less effective at sending users away than ChatGPT. Google produces just 0.6 external visits per user, while ChatGPT generates 1.4 visits per user.

More Key Stats from the Study

The SE Ranking research uncovered several additional findings:

  • AI Overviews almost always appear alongside other SERP features (99.25% of the time), most commonly with People Also Ask boxes (98.5%)
  • The typical AI Overview consists of about 1,766 characters (roughly 254 words) and cites an average of 13.3 sources
  • Medium-difficult keywords (21-40 on the difficulty scale) most frequently trigger AI Overviews (33.4%), whereas highly competitive terms (81-100) rarely generate them (just 3.7%)
  • Keywords with CPC values between $2-$5 produce the highest rate of AI Overviews (32%), while expensive keywords ($10+) yield them the least (17.3%)
  • Fashion and Beauty has the lowest AI Overview appearance rate (just 1.4%), followed by E-Commerce (2.1%) and News/Politics (3.8%)
  • The longer an AI Overview’s answer, the more sources it cites. Responses under 600 characters cite about five sources, while those over 6,600 characters cite around 28 sources.

These statistics further emphasize how Google’s AI Overviews are reshaping search behavior.

This data stresses the need to optimize for multiple traffic sources while remaining visible within Google’s results pages.

U.S. Copyright Office Cites Legal Risk At Every Stage Of Generative AI via @sejournal, @martinibuster

The United States Copyright Office released a pre-publication version of a report on the use of copyrighted materials for training generative AI, outlining a legal and factual case that identifies copyright risks at every stage of generative AI development.

The report was created in response to public and congressional concern about the use of copyrighted content, including pirated versions, by AI systems without first obtaining permission. While the Copyright Office doesn’t make legal rulings, the reports it creates offer legal and technical guidance that can influence legislation and court decisions.

The report offers four reasons AI technology companies should be concerned:

  1. The report states that many acts of data acquisition, the process of creating datasets from copyrighted work, and training could “constitute prima facie infringement.”
  2. It challenges the common industry defense that training models does not involve “copying,” noting that the process of creating datasets involves the creation of multiple copies, and that improvements in model weights can also contain copies of those works. The report cites reports of instances where AI reproduces copyrighted works, either word for word or “near identical” copies.
  3. It states that the training process implicates the right of reproduction, one of the exclusive rights granted to emphasizes that memorization and regurgitation of copyrighted content by models may constitute infringement, even if unintended.
  4. Transformative use, where it adds a new meaning to an original work, is an important consideration in fair use analysis. The report acknowledges that “some uses of copyrighted works in AI training are likely to be transformative,” but it “disagrees” with the argument that AI training is transformative simply because it resembles “human learning,” such as when a person reads a book and learns from it.

Copyright Implications At Every Stage of AI Development

Perhaps the most damning part of the report is where it says that there may be copyright issues at every stage of the AI development and lists each stage of development and what may be wrong with it.

A. Data Collection and Curation

The steps required to produce a training dataset containing copyrighted works clearly implicate the right of reproduction…

B. Training

The training process also implicates the right of reproduction. First, the speed and scale of training requires developers to download the dataset and copy it to high-performance storage prior to training.96 Second, during training, works or substantial portions of works are temporarily reproduced as they are “shown” to the model in batches.

Those copies may persist long enough to infringe the right of reproduction,160 depending on the model at issue and the specific hardware and software implementations used by developers.

Third, the training process—providing training examples, measuring the model’s performance against expected outputs, and iteratively updating weights to improve performance—may result in model weights that contain copies of works in the training data. If so, then subsequent copying of the model weights, even by parties not involved in the training process, could also constitute prima facie infringement.

C. RAG

RAG also involves the reproduction of copyrighted works.110 Typically, RAG works in one of two ways. In one, the AI developer copies material into a retrieval database, and the generative AI system can later access that database to retrieve relevant material and supply it to the model along with the user’s prompt.111 In the other, the system retrieves material from an external source (for example, a search engine or a specific website).181 Both methods involve making reproductions, including when the system copies retrieved content at generation time to augment its response.

D. Outputs

Generative AI models sometimes output material that replicates or closely resembles copyrighted works. Users have demonstrated that generative AI can produce near exact replicas of still images from movies,112 copyrightable characters,113 or text from news stories.114 Such outputs likely infringe the reproduction right and, to the extent they adapt the originals, the right to prepare derivative works.”

The report finds infringement risks at every stage of generative AI development, and while its findings are not legally binding, they could be used to create legislation and serve as guidance for courts.

Takeaways

  • AI Training And Copyright Infringement:
    The report argues that both data acquisition and model training can involve unauthorized copying, possibly constituting “prima facie infringement.”
  • Rejection Of Industry Defenses:
    The Copyright Office disputes common AI industry claims that training does not involve copying and that AI training is analogous to human learning.
  • Fair Use And Transformative Use:
    The report disagrees with the broad application of transformative use as a defense, especially when based on comparisons to human cognition.
  • Concern About All Stages Of AI Development:
    Copyright concerns are identified at every stage of AI development, from data collection, training, retrieval-augmented generation (RAG), and model outputs.
  • Memorization and Model Weights:
    The Office warns that AI models may retain copyrighted content in weights, meaning even use or distribution of those weights could be infringing.
  • Output Reproduction and Derivative Works:
    The ability of AI to generate near-identical outputs (e.g., movie stills, characters, or articles) raises concerns about violations of both reproduction and derivative work rights.
  • RAG-Specific Infringement Risk:
    Both methods of RAG, copying content into a database or retrieving from external sources, are described as involving potentially infringing reproductions.

The U.S. Copyright Office report describes multiple ways that generative AI development may infringe copyright law, challenging the legality of using copyrighted data without permission at every technical stage, from dataset creation to model outputs. It rejects the use of the analogy of human learning as a defense and the industry’s broad application of fair use. Although the report doesn’t have the same force as a judicial finding, the report can be used as guidance for lawmakers and courts.

Featured Image by Shutterstock/Treecha