News Archives News

Jan 10 2024

Google Alters Purpose Of SpecialAnnouncement Structured Data via @sejournal, @martinibuster

Google’s updated guidance to the Special Announcements structured data shifts the focus toward a more general purpose use that is different from what Schema.org originally intended.

SpecialAnnouncement Structured Data

The SpecialAnnouncement structured data was created by Schema.org in March 2020 as a way to communicate special announcements that are specifically related to Covid-19. In fact, the Schema.org documentation states that it currently remains as a work in progress that is specifically focused on the then current Covid-19 crisis.

This is how Schema.org defines this specific structured data:

“A SpecialAnnouncement combines a simple date-stamped textual information update with contextualized Web links and other structured data. It represents an information update made by a locally-oriented organization, for example schools, pharmacies, healthcare providers, community groups, police, local government.

The motivating scenario for SpecialAnnouncement is the Coronavirus pandemic, and the initial vocabulary is oriented to this urgent situation. Schema.org expect to improve the markup iteratively as it is deployed and as feedback emerges from use.”

Google’s Implementation Of SpecialAnnouncement

The SpecialAnnouncement structured data was introduced by Google in April 2020 as specifically for Covid-19 related situations. It was created as a way for businesses, government entities, schools and other organizations could communicate special announcements such as quarantines, hours of operations, closings, and restrictions that are related to the Covid-19.

SpecialAnnouncement was released as a Beta feature and it currently remains in Beta status, officially in a testing phase, meaning that Google’s implementation of it was subject to changes or complete removal which is what quietly happened in the first week of 2024.

How SpecialAnnouncement Structured Data Documentation Changed

There are multiple changes throughout the document that are too numerous to document. Google’s previous version of the Special Announcement structured data documentation as of December 2023 had 38 references to Covid-19 and a total of 221 removals of text.

After the change the new version of the SpecialAnnouncement documentation contains only 13 references to Covid-19.

Most of the changes are similar to the following examples.

The previous documentation for this section of the guidance:

“How to implement your COVID-19 announcements
There are two ways that you can implement your COVID-19 announcements:”

Was rewritten to this:

“How to implement special announcements
There are two ways that you can implement your special announcements:”

The changes to the above passage transforms the purpose of the SpecialAnnouncement structured data into one with a wider application than the formerly narrow one of Covid-19 but still remains focused on local medical-related events (more details on this below).

Here’s another example that has a similar broadening effect.

This previous passage:

“Information about public transport closures in the context of COVID-19, if applicable to the announcement.”

Was rewritten to remove references to Covid-19 like this:

“Information about public transport closures, if applicable to the announcement.”

The changes, while subtle, represent an evolution of the SpecialAnnouncement structured data specifications away from being specific to Covid-19.

Perhaps the most notable change in the documentation is at the very top of the page in the very first paragraph.

The previous version of the documentation at the top of the page featured this:

“COVID-19 announcement (SpecialAnnouncement) structured data (BETA)
Note: We’re currently developing support for COVID-19 announcements in Google Search, and you may see changes in requirements, guidelines, and how the feature appears in Google Search. Learn more about the feature’s availability.

Due to COVID-19, many organizations, such governments, health organizations, schools, and more, are publishing urgent announcements…”

The updated and considerably shortened passage of the guidance now looks like this:

“Special announcement (SpecialAnnouncement) structured data (BETA)
Many organizations, such governments, health organizations, schools, and more, may need to publish urgent announcements (such as COVID-19 announcements)…”

Impact Of Changes To SpecialAnnouncement Structured Data

The dozens of changes to the SpecialAnnouncement structured data has the effect of making it more flexible for use in situations beyond Covid-19.

But the changes as they currently stand do not widen the scope of the SpecialAnnouncement structured data beyond what they previously were. This is because the nine examples of how the structured data can be used remain exactly the same.

Nine Examples Of Special Announcements

Announcement of a shelter-in-place directive
Closure notice (for example, closing a school or public transportation)
Announcement of government benefits (for example, unemployment support, paid leave, or one-time payments)
Quarantine guidelines
Travel restrictions
Notification of a new drive-through testing center
Announcement of an event transitioning from offline to online, or cancellation
Announcement of revised hours and shopping restrictions
Disease spread statistics and maps

The guidance continues to encourage the use of this structured data for extraordinary situations related to widespread communicable diseases and other disruptive emergency situations.

Given that the examples remain the same it’s probably safe to say that it may be inappropriate to use this structured data for something isolated like the rescheduling of a theater performance due to unforeseen circumstances. But because this structured data is in Beta it may at some point evolve

Read the updated SpecialAnnouncement structured data guidance:

Special announcement (SpecialAnnouncement) structured data (BETA)

Featured Image by Shutterstock/Lauritta

Ecommerce MGMT 0 Comments

Jan 9 2024

Google: No Perfect Formula For Search Rankings via @sejournal, @MattGSouthern

In a recent social media statement, Google’s Search Liaison reminds that there’s no “perfect page” formula that websites must follow to rank well in search results.

The statement begins:

“Today I wanted to share about the belief that there is some type of “perfect page” formula that must be used to rank highly in Google Search.”

Google clarifies that no universal ranking formula exists despite claims that specific word counts, page structures, or other optimizations can guarantee high placement.

The statement continues:

“There isn’t, and no one should feel they must work to some type of mythical formula. It’s a belief dating back to even before Google was popular.”

Dispelling SEO Myths

Third-party SEO tools often advise constructing pages in specific ways to succeed in search. However, Google asserts these tools can’t predict rankings.

The tools’ advice is frequently based on finding averages among top pages, while Google’s algorithm values commonalities and unique differences.

“Third-party advice, even news articles, might suggest some type of thing. Following such advice doesn’t guarantee a top ranking. Moreover, such predictions and advice is often based on looking at averages — which misses the point that completely different and unique pages can and do succeed in search.”

Instead of formulas, Google’s advice is to focus on being helpful and relevant to users.

For example, if an author byline fits a page’s purpose for readers, include it – but not because it may supposedly boost rankings.

The Liaison’s statement concludes:

“Google’s key advice is to focus on doing things for your readers that is helpful. For example, if it makes sense for your readers to see a byline for an article (and it might!), do it for them. Don’t do it because you’ve heard having a byline ranks you better in Google (it doesn’t).

Put your readers and audience first. Be helpful to them. If you do this, if you’re doing things for them, you are more likely to align with completely different signals we use to reward content.”

Key Takeaways

The key takeaway from Google’s message is to adopt a reader-first approach.

For those hoping for that one perfect blueprint to guaranteed rankings, Google’s message remains consistent – no such formula exists. But creating content that genuinely serves its purpose? That continues to be rewarded.

Featured Image: egaranugrah/Shutterstock

Ecommerce MGMT 0 Comments

Jan 9 2024

Google’s John Mueller Offers Help With Spammy Foreign Language Hack via @sejournal, @MattGSouthern

Google Search Advocate John Mueller recently responded to a post on Reddit from a website owner experiencing a significant increase in indexed foreign language pages.

The website owner reported that over 20,000 pages in Japanese and Chinese suddenly appeared on their site, which they didn’t create or intend to host. They asked the Reddit community for help removing unwanted pages and restoring their site’s rankings.

Mueller suggested ways to clean up the issue and prevent a recurrence.

The Incident

The website owner said Google indexed thousands of foreign language pages in one day, and they didn’t exist in the backend website management system, known as cPanel.

This led the owner to worry their site may have fallen victim to a security breach or misconfiguration that allowed unknown parties to post content.

The sudden influx of pages is a technique known in search engine optimization circles as a “Japanese keyword hack.”

Perpetrators can manipulate search results by flooding a site with junk pages optimized for Japanese keywords.

These attacks are a rising threat to website security and integrity, and the Reddit user’s situation highlights the need for increased vigilance.

Mueller’s Guidance

Responding to the call for help, Mueller confirmed the website had been hacked and said the next step was to identify how the breach occurred.

Mueller advised,

“Since someone hacked your site, even if you’ve cleaned up the hacked traces, it’s important to understand how they did it, so that you can make sure that the old vulnerabilities are locked down.”

He advised that even after cleaning up traces of the hack, it’s crucial to understand how it happened to lock down those vulnerabilities.

Mueller suggested automatic updates and potentially switching to a hosting platform that handles security could be beneficial solutions.

SEO Implications

Mueller said that once a site’s most important pages are cleaned of unwanted content, they can be reindexed quickly.

He said there’s no need to worry about old hacked pages that remain indexed but invisible to users, as they can stay that way for months without issue.

“Old pages will remain indexed for months, they don’t cause any problems if they tend not to be seen.”

Mueller also clarified that spammy backlinks pointing to these invisible indexed pages do not require disavowing.

Instead, he advised focusing cleanup efforts on a site’s visible content and preventing internal search results from being indexed.

Addressing Spammy Links & Indexing

The website owner asked Mueller for advice regarding spammy backlinks causing internal search pages to be indexed.

Mueller clarified that this was separate from the hacking issue. He recommended against disavowing the links, saying the pages would naturally drop from search results over time.

He advised proactively blocking search results pages for any new or existing sites to avoid potential exploitation by spammers.

“Block the search results from indexing (robots.txt or noindex). For new/other sites, I’d generally block search results pages from indexing, no need to wait until someone takes advantage of your site like this.”

Insights For SEO Professionals

This dialogue with Mueller highlights the importance of proactive measures to prevent hacking and spammy links from hurting sites’ search rankings.

Regular security updates, malware scans, and link audits should be part of your routine maintenance. Websites share responsibility with search engines to keep results free of hacked and spammy content.

Featured Image: ColorMaker/Shutterstock

Ecommerce MGMT 0 Comments

Jan 9 2024

Google Clarifies Job Posting Structured Data Guidance via @sejournal, @martinibuster

Google updated their job posting structured data guidance in order to change the requirements of how to best keep Google notified of new webpages and changes to existing webpages.

Google Job Posting Structured Data Guidance

Google’s job posting structured data guidance is designed to help publishers become eligible for enhanced visibility in Google Search results through interactive job listings in search. It also provides an overview of the process of adding, testing and maintaining job posting structured data.

Of particular importance is the ability to notify Google of new job posting webpages and changes to existing pages, which benefits publishers by having the most relevant and useful job postings available in the SERPs.

Changes In The Guidance For Notifying Google

The change in the guidance was to clarify how publishers can notify Google of changes and new webpages. There doesn’t seem to be a change in the guidance of what to do but rather there is a change in how the use of sitemaps is emphasized and encouraged.

In the previous version of the guidance Google encouraged users to rely on the Indexing API “instead of sitemaps” to directly notify Google of pages that need immediate crawling.

The previous guidance stated:

“Keep Google informed by doing one of the following actions:
For job posting URLs, we recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your page sooner than updating the sitemap and pinging Google. However, we still recommend submitting a sitemap for coverage of your entire site.”

The recommendation to use the Indexing API “instead of sitemaps” seems to discourage the use of sitemaps, even though the next sentence recommends submitting a sitemap for the entire site.

The updated guidance fixes the appearance of a conflict between the two sentences by replacing the phrase “instead of” with the word “and”.

It now reads:

“Keep Google informed by using the Indexing API and submitting a sitemap. For job posting URLs, we recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your page sooner. Use the Indexing API to notify Google of a new URL to crawl or that content at a URL has been updated.”

The updated guidance now recommends using both the Indexing API and a sitemap, while clarifying that the Indexing API is faster.

There are similar changes to two more passages further down the page in addition to removing all mentions of “pinging” Google about changes to the sitemap with a GET request.

This statement:

“Keep Google informed by doing one of the following actions:”

Was changed to this:

“Keep Google informed of changes:”

That change makes it clearer that publishers should still use a sitemap while also recommending the use of the Indexing API for fast crawling.

Google also removed the recommendation to use a GET request for getting the sitemap crawled from the guidance:

“If you’re not using the Indexing API, submit a new sitemap to Google by sending a GET request to the following URL:

https://www.google.com/ping?sitemap=https://www.example.com/sitemap.xml”

Read the updated Job Posting Structured Data guidance here:

Job posting (JobPosting) structured data for Job Search

Featured Image by Shutterstock/object_photo

Ecommerce MGMT 0 Comments

Jan 8 2024

OpenAI: New York Times Lawsuit Based On Misuse Of ChatGPT via @sejournal, @martinibuster

OpenAI published a response to The New York Times’ lawsuit by alleging that The NYTimes used manipulative prompting techniques in order to induce ChatGPT to regurgitate lengthy excerpts, stating that the lawsuit is based on misuse of ChatGPT in order to “cherry pick” examples for the lawsuit.

The New York Times Lawsuit Against OpenAI

The New York Times filed a lawsuit against OpenAI (and Microsoft) for copyright infringement alleging that ChatGPT “recites Times content verbatim” among other complaints.

The lawsuit introduced evidence showing how GPT-4 could output large amounts of New York Times content without attribution as proof that GPT-4 infringes on The New York Times content.

The accusation that GPT-4 is outputting exact copies of New York Times content is important because it counters OpenAI’s insistence that its use of data is transformative, which is a legal framework related to the doctrine of fair use.

The United States Copyright office defines the fair use of copyrighted content that is transformative:

“Fair use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances.

…’transformative’ uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.”

That’s why it’s important for The New York Times to assert that OpenAI’s use of content is not fair use.

The New York Times lawsuit against OpenAI states:

“Defendants insist that their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose. But there is nothing “transformative” about using The Times’s content …Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.”

The following screenshot shows evidence of how GPT-4 outputs exact copy of the Times’ content. The content in red is original content created by the New York Times that was output by GPT-4.

OpenAI Response Undermines NYTimes Lawsuit Claims

OpenAI offered a strong rebuttal of the claims made in the New York Times lawsuit, claiming that the Times’ decision to go to court surprised OpenAI because they had assumed the negotiations were progressing toward a resolution.

Most importantly, OpenAI debunked The New York Times claims that GPT-4 outputs verbatim content by explaining that GPT-4 is designed to not output verbatim content and that The New York Times used prompting techniques specifically designed to break GPT-4’s guardrails in order to produce the disputed output, undermining The New York Times’ implication that outputting verbatim content is a common GPT-4 output.

This type of prompting that is designed to break ChatGPT in order to generate undesired output is known as Adversarial Prompting.

Adversarial Prompting Attacks

Generative AI is sensitive to the types of prompts (requests) made of it and despite the best efforts of engineers to block the misuse of generative AI there are still new ways of using prompts to generate responses that get around the guardrails built into the technology that are designed to prevent undesired output.

Techniques for generating unintended output is called Adversarial Prompting and that’s what OpenAI is accusing The New York Times of doing in order to manufacture a basis of proving that GPT-4 use of copyrighted content is not transformative.

OpenAI’s claim that The New York Times misused GPT-4 is important because it undermines the lawsuit’s insinuation that generating verbatim copyrighted content is typical behavior.

That kind of adversarial prompting also violates OpenAI’s terms of use which states:

What You Cannot Do

Use our Services in a way that infringes, misappropriates or violates anyone’s rights.

Interfere with or disrupt our Services, including circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services.

OpenAI Claims Lawsuit Based On Manipulated Prompts

OpenAI’s rebuttal claims that the New York Times used manipulated prompts specifically designed to subvert GPT-4 guardrails in order to generate verbatim content.

OpenAI writes:

“It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate.

Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.”

OpenAI also fired back at The New York Times lawsuit saying that the methods used by The New York Times to generate verbatim content was a violation of allowed user activity and misuse.

They write:

“Despite their claims, this misuse is not typical or allowed user activity.”

OpenAI ended by stating that they continue to build resistance against the kinds of adversarial prompt attacks used by The New York Times.

They write:

“Regardless, we are continually making our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.”

OpenAI backed up their claim of diligence to respecting copyright by citing their response to July 2023 to reports that ChatGPT was generating verbatim responses.

We’ve learned that ChatGPT’s “Browse” beta can occasionally display content in ways we don’t want, e.g. if a user specifically asks for a URL’s full text, it may inadvertently fulfill this request. We are disabling Browse while we fix this—want to do right by content owners.

— OpenAI (@OpenAI) July 4, 2023

The New York Times Versus OpenAI

There’s always two sides of a story and OpenAI just released their side that shows that The New York Times claims are based on adversarial attacks and a misuse of ChatGPT in order to elicit verbatim responses.

Read OpenAIs response:

OpenAI and journalism:
We support journalism, partner with news organizations, and believe The New York Times lawsuit is without merit.

Featured Image by Shutterstock/pizzastereo

Ecommerce MGMT 0 Comments

Jan 8 2024

Google: Author Bylines Not A Ranking Factor via @sejournal, @MattGSouthern

In a recent clarification on social media, Google’s Search Liaison addressed a widespread misconception regarding the influence of author bylines on search rankings.

Misconception In The SEO Community

The issue arose after a notable publication suggested that including author bylines could boost content’s visibility in Google’s search results.

According to the publication, this claim has prompted many websites to adjust their content strategies under the assumption that bylines would boost their ranking.

Google’s Clarification On Bylines

Google’s Search Liaison took to X (formerly Twitter) to address the misinformation.

In a response, the liaison stated:

“I know this will be a ‘simple, almost quaint answer’ but this part of the article is wrong nor cites us saying this. Google doesn’t somehow ‘check out our credentials.’”

He emphasized that Google doesn’t use bylines as a direct ranking signal and that the publication’s claim was incorrect.

Bylines, the liaison explained, aren’t a tool for improving search rankings but rather for the benefit of the readers.

He added:

“Author bylines aren’t something you do for Google, and they don’t help you rank better. They’re something you do for your readers — and publications doing them may exhibit the type of other characteristics our ranking systems find align with useful content.”

Further Clarification Regarding Author Bylines

Google’s Search Liaison continued:

“Just adding a byline doesn’t give a ranking boost. Nor do we somehow read information in or near a byline and think ‘Oh, they say they’re an expert, so this must be written by an expert.’”

He noted that while having accurate bylines and information might correlate with quality content, they aren’t direct ranking factors.

He adds that plenty of content without bylines ranks well, reinforcing that they’re not required to succeed in search rankings.

Key Takeaways For SEO Professionals

For SEO professionals, the key points to remember are:

Google doesn’t use author bylines as a factor in search rankings.
Bylines should be included for reader benefit and might coincide with other quality signals.
Quality content can rank well with or without an author byline.
Google plans to update its documentation to help clarify ranking factors and improve communication with SEO professionals.

This clarification from Google’s Search Liaison serves as another reminder to create high-quality content that serves your audience rather than strategies that have no direct impact on search rankings.

Featured Image: Vectorium/Shutterstock

Ecommerce MGMT 0 Comments

Jan 8 2024

Google Gives Cookie Reprieve To Select Sites Through New Trials via @sejournal, @MattGSouthern

As Google starts restricting third-party cookies in Chrome, it’s offering a limited deprecation trial for sites proving functionality issues.

Google is phasing out third-party cookie access in Chrome by 2024.
Some third-party services can get temporary cookie access through a deprecation trial.
Businesses should audit their site’s cookie usage before more users are impacted.

Ecommerce MGMT 0 Comments

Jan 8 2024

Google Clarifies How Algorithm Chooses Search Snippets via @sejournal, @martinibuster

Google updated their search snippet documentation to clarify what influences Google’s algorithm for choosing what to display as the snippet in the search results. This change may represent a big change in how meta descriptions are written and how content is optimized.

Google Search Results Snippets

A webpage displayed in the search engine results pages (SERPs) consists of a title, a URL breadcrumb and a one to two sentence description of what the webpage is about. That last part is called a snippet. A snippet is defined as a concise or brief description of what the webpage is about.

Traditionally the snippet was derived from the meta description. But that hasn’t been the case for awhile now.

Google Clarifies Snippet Guidance

Google updated their Search Central documentation to clarify that the content of the page is the main source of where the snippet comes from. The changes also made it clearer that the structured data and the meta description are not the primary source of the search snippets.

The official documentation for the change says:

“What: Clarified in our documentation about snippets that the primary source of the snippet is the page content itself.

Why: The previous wording incorrectly implied that structured data and the meta description HTML element are the primary sources for snippets.”

What Changed In Google’s Snippet Documentation

Google removed a substantial amount of words from the previous version of the documentation.

This is what the first part of the documentation previously advised:

“Google uses a number of different sources to automatically determine the appropriate snippet, including descriptive information in the meta description tag for each page. We may also use information found on the page, or create rich results based on markup and content on the page.”

The previous version implied that the snippet was derived mostly from the meta description and said that Google “may” also select on-page content for the snippet.

The updated documentation now makes it clear that the page content is the main source of the snippet and uses the word “may” for the meta description.

This is the new version of the documentation:

“Google primarily uses the content on the page to automatically determine the appropriate snippet. We may also use descriptive information in the meta description element when it describes the page better than other parts of the content.”

Significant Amount Of Content Removed

Google also removed an entire paragraph of content and replaced it with new documentation. Both the removal and the addition dramatically change the message of the documentation.

This section was removed:

“Site owners have two main ways to suggest content for the snippets that we create:

Rich results: Add structured data to your site to help Google understand the page: for example, a review, recipe, business, or event. Learn more about how rich results can improve your site’s listing in search results.

Meta description tags: Google sometimes uses tag content to generate snippets, if we think they give users a more accurate description than can be taken directly from the page content.”

This is the new wording:

“Snippets are primarily created from the page content itself. However, Google sometimes uses the meta description HTML element if it might give users a more accurate description of the page than content taken directly from the page.”

What Change In Guidance Means For SEO

Many SEO guides that are published online (wrongly) advise that the best way to optimize a meta description is to use it as “advertising copy” and to use “target keywords” in it. The idea is that keywords displayed in the snippets are bolded in the SERPs, making them stand out, so keywords in the meta description will gain bolded keywords which will call attention to it and inspire a higher click through rate.

That is 100% wrong advice and is outdated. Adding keywords to the meta description is not important (meta descriptions are not used for ranking) and the purpose of a meta description is not to entice clicks from the SERPs. That’s old and outdated advise and will cause Google to not use the meta description for the snippet.

The correct use of the meta description is to accurately and concisely describe what the webpage is about, period.

The official W3C HTML specification for the meta description outlines the correct use of the meta description:

“The value must be a free-form string that describes the page. The value must be appropriate for use in a directory of pages, e.g. in a search engine. There must not be more than one meta element with its name attribute set to the value description per document.”

Google isn’t interested in displaying search optimized snippets. They want to show a description of what the webpage is about and Google’s advice on how to write a meta description conforms with the official meta description HTML specification.

This is how Google advises to write a meta description:

“Google will sometimes use the tag from a page to generate a snippet in search results, if we think it gives users a more accurate description than would be possible purely from the on-page content. A meta description tag generally informs and interests users with a short, relevant summary of what a particular page is about.”

Google then describes the content of a meta description using a simile that compares it to promoting something in the form of a pitch:

“They are like a pitch that convince the user that the page is exactly what they’re looking for.”

Google doesn’t say to write a pitch to use for the meta description. The word “like” is used which signifies a simile, a comparison.

Takeaway

The big takeaway from the updated snippet guidance is that the primary source of the snippet is content and that Google “may” use the meta description. Lastly, Google makes it clear that the structured data plays no role in the selection of words to use as a snippet.

What then means for SEO is that the days of packing keywords into the meta description are most definitely over. Use them properly and it may help you better control the snippet that Google uses in the search results.

Read Google’s updated guidance on search snippets:

Control your snippets in search results

Ecommerce MGMT 0 Comments

Jan 6 2024

AI Content Detection: Bard Vs ChatGPT Vs Claude via @sejournal, @martinibuster

Researchers tested the idea that an AI model may have an advantage in self-detecting its own content because the detection was leveraging the same training and datasets. What they didn’t expect to find was that out of the three AI models they tested, the content generated by one of them was so undetectable that even the AI that generated it couldn’t detect it.

The study was conducted by researchers from the Department of Computer Science, Lyle School of Engineering at Southern Methodist University.

AI Content Detection

Many AI detectors are trained to look for the telltale signals of AI generated content. These signals are called “artifacts” which are generated because of the underlying transformer technology. But other artifacts are unique to each foundation model (the Large Language Model the AI is based on).

These artifacts are unique to each AI and they arise from the distinctive training data and fine tuning that is always different from one AI model to the next.

The researchers discovered evidence that it’s this uniqueness that enables an AI to have a greater success in self-identifying its own content, significantly better than trying to identify content generated by a different AI.

Bard has a better chance of identifying Bard-generated content and ChatGPT has a higher success rate identifying ChatGPT-generated content, but…

The researchers discovered that this wasn’t true for content that was generated by Claude. Claude had difficulty detecting content that it generated. The researchers shared an idea of why Claude was unable to detect its own content and this article discusses that further on.

This is the idea behind the research tests:

“Since every model can be trained differently, creating one detector tool to detect the artifacts created by all possible generative AI tools is hard to achieve.

Here, we develop a different approach called self-detection, where we use the generative model itself to detect its own artifacts to distinguish its own generated text from human written text.

This would have the advantage that we do not need to learn to detect all generative AI models, but we only need access to a generative AI model for detection.

This is a big advantage in a world where new models are continuously developed and trained.”

Methodology

The researchers tested three AI models:

ChatGPT-3.5 by OpenAI
Bard by Google
Claude by Anthropic

All models used were the September 2023 versions.

A dataset of fifty different topics was created. Each AI model was given the exact same prompts to create essays of about 250 words for each of the fifty topics which generated fifty essays for each of the three AI models.

Each AI model was then identically prompted to paraphrase their own content and generate an additional essay that was a rewrite of each original essay.

They also collected fifty human generated essays on each of the fifty topics. All of the human generated essays were selected from the BBC.

The researchers then used zero-shot prompting to self-detect the AI generated content.

Zero-shot prompting is a type of prompting that relies on the ability of AI models to complete tasks for which they haven’t specifically trained to do.

The researchers further explained their methodology:

“We created a new instance of each AI system initiated and posed with a specific query: ‘If the following text matches its writing pattern and choice of words.’ The procedure is
repeated for the original, paraphrased, and human essays, and the results are recorded.

We also added the result of the AI detection tool ZeroGPT. We do not use this result to compare performance but as a baseline to show how challenging the detection task is.”

They also noted that a 50% accuracy rate is equal to guessing which can be regarded as essentially a level of accuracy that is a failure.

Results: Self-Detection

It must be noted that the researchers acknowledged that their sample rate was low and said that they weren’t making claims that the results are definitive.

Below is a graph showing the success rates of AI self-detection of the first batch of essays. The red values represent the AI self-detection and the blue represents how well the AI detection tool ZeroGPT performed.

Results Of AI Self-Detection Of Own Text Content

Bard did fairly well at detecting its own content and ChatGPT also performed similarly well at detecting its own content.

ZeroGPT, the AI detection tool detected the Bard content very well and performed slightly less better in detecting ChatGPT content.

ZeroGPT essentially failed to detect the Claude-generated content, performing worse than the 50% threshold.

Claude was the outlier of the group because it was unable to to self-detect its own content, performing significantly worse than Bard and ChatGPT.

The researchers hypothesized that it may be that Claude’s output contains less detectable artifacts, explaining why both Claude and ZeroGPT were unable to detect the Claude essays as AI-generated.

So, although Claude was unable to reliably self-detect its own content, that turned out to be a sign that the output from Claude was of a higher quality in terms of outputting less AI artifacts.

ZeroGPT performed better at detecting Bard-generated content than it did in detecting ChatGPT and Claude content. The researchers hypothesized that it could be that Bard generates more detectable artifacts, making Bard easier to detect.

So in terms of self-detecting content, Bard may be generating more detectable artifacts and Claude is generating less artifacts.

Results: Self-Detecting Paraphrased Content

The researchers hypothesized that AI models would be able to self-detect their own paraphrased text because the artifacts that are created by the model (as detected in the original essays) should also be present in the rewritten text.

However the researchers acknowledged that the prompts for writing the text and paraphrasing are different because each rewrite is different than the original text which could consequently lead to a different self-detection results for the self-detection of paraphrased text.

The results of the self-detection of paraphrased text was indeed different from the self-detection of the original essay test.

Bard was able to self-detect the paraphrased content at a similar rate.
ChatGPT was not able to self-detect the paraphrased content at a rate much higher than the 50% rate (which is equal to guessing).
ZeroGPT performance was similar to the results in the previous test, performing slightly worse.

Perhaps the most interesting result was turned in by Anthropic’s Claude.

Claude was able to self-detect the paraphrased content (but it was not able to detect the original essay in the previous test).

It’s an interesting result that Claude’s original essays apparently had so few artifacts to signal that it was AI generated that even Claude was unable to detect it.

Yet it was able to self-detect the paraphrase while ZeroGPT could not.

The researchers remarked on this test:

“The finding that paraphrasing prevents ChatGPT from self-detecting while increasing Claude’s ability to self-detect is very interesting and may be the result of the inner workings of these two transformer models.”

Screenshot of Self-Detection of AI Paraphrased Content

These tests yielded almost unpredictable results, particularly with regard to Anthropic’s Claude and this trend continued with the test of how well the AI models detected each others content, which had an interesting wrinkle.

Results: AI Models Detecting Each Other’s Content

The next test showed how well each AI model was at detecting the content generated by the other AI models.

If it’s true that Bard generates more artifacts than the other models, will the other models be able to easily detect Bard-generated content?

The results show that yes, Bard-generated content is the easiest to detect by the other AI models.

Regarding detecting ChatGPT generated content, both Claude and Bard were unable to detect it as AI-generated (justa as Claude was unable to detect it).

ChatGPT was able to detect Claude-generated content at a higher rate than both Bard and Claude but that higher rate was not much better than guessing.

The finding here is that all of them weren’t so good at detecting each others content, which the researchers opined may show that self-detection was a promising area of study.

Here is the graph that shows the results of this specific test:

At this point it should be noted that the researchers don’t claim that these results are conclusive about AI detection in general. The focus of the research was testing to see if AI models could succeed at self-detecting their own generated content. The answer is mostly yes, they do a better job at self-detecting but the results are similar to what was found with ZEROGpt.

The researchers commented:

“Self-detection shows similar detection power compared to ZeroGPT, but note that the goal of this study is not to claim that self-detection is superior to other methods, which would require a large study to compare to many state-of-the-art AI content detection tools. Here, we only investigate the models’ basic ability of self detection.”

Conclusions And Takeaways

The results of the test confirm that detecting AI generated content is not an easy task. Bard is able to detect its own content and paraphrased content.

ChatGPT can detect its own content but works less well on its paraphrased content.

Claude is the standout because it’s not able to reliably self-detect its own content but it was able to detect the paraphrased content, which was kind of weird and unexpected.

Detecting Claude’s original essays and the paraphrased essays was a challenge for ZeroGPT and for the other AI models.

The researchers noted about the Claude results:

“This seemingly inconclusive result needs more consideration since it is driven by two conflated causes.

1) The ability of the model to create text with very few detectable artifacts. Since the goal of these systems is to generate human-like text, fewer artifacts that are harder to detect means the model gets closer to that goal.

2) The inherent ability of the model to self-detect can be affected by the used architecture, the prompt, and the applied fine-tuning.”

The researchers had this further observation about Claude:

“Only Claude cannot be detected. This indicates that Claude might produce fewer detectable artifacts than the other models.

The detection rate of self-detection follows the same trend, indicating that Claude creates text with fewer artifacts, making it harder to distinguish from human writing”.

But of course, the weird part is that Claude was also unable to self-detect its own original content, unlike the other two models which had a higher success rate.

The researchers indicated that self-detection remains an interesting area for continued research and propose that further studies can focus on larger datasets with a greater diversity of AI-generated text, test additional AI models, a comparison with more AI detectors and lastly they suggested studying how prompt engineering may influence detection levels.

Read the original research paper and the abstract here:

AI Content Self-Detection for Transformer-based Large Language Models

Featured Image by Shutterstock/SObeR 9426

Ecommerce MGMT 0 Comments

Jan 3 2024

Complianz WordPress GDPR Compliance Plugin Vulnerability via @sejournal, @martinibuster

A popular WordPress plugin for privacy compliance with over 800,000 installations recently patched a stored XSS vulnerability that could allow an attacker to upload malicious scripts for launching attacks against site visitors.

Complianz | GDPR/CCPA Cookie Consent WordPress Plugin

The Complianz plugin for WordPress is a powerful tool that helps website owners comply with privacy regulations such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

The plugin manages multiple facets of user privacy including blocking third-party cookies, managing cookie consent (including per subregion), and managing multiple aspects related to cookie banners.

It’s versatility and usefulness may account for the popularity of the tool which currently has over 800,000 installations.

Complianz Plugin Stored XSS Vulnerability

The Complianz WordPress plugin was discovered to have a stored XSS vulnerability which is a type of vulnerability that allows a user to upload a malicious script directly to the website server. Unlike a reflected XSS that requires a website user to click a link, a stored XSS involves a malicious script stored and served from the target website’s server.

The vulnerability is in the Complianz admin settings which is in the form of a lack of two security functions.

1. Input Sanitization
The plugin lacked sufficient input sanitization and output escaping. Input sanitization is a standard process for checking what’s input into a website, like into a form field, to make sure that what’s input is what’s expected, like a text input as opposed to a script upload.

The official WordPress developer guide describes data sanitization as:

“Sanitizing input is the process of securing/cleaning/filtering input data. Validation is preferred over sanitization because validation is more specific. But when “more specific” isn’t possible, sanitization is the next best thing.”

2. Escaping Output
The plugin lacked Output Escaping which is a security process that removes unwanted data before it gets rendered for a user.

How Serious Is The Vulnerability?

The vulnerability requires the attacker to obtain admin permission levels and higher in order to execute the attack. That may be the reason why this vulnerability is scored 4.4 out of 10, with ten representing the highest level of vulnerability.

The vulnerability only affects specific kinds of installations, too.

According to Wordfence:

“This makes it possible for authenticated attackers, with administrator-level permissions and above, to inject arbitrary web scripts in pages that will execute whenever a user accesses an injected page.

This only affects multi-site installations and installations where unfiltered_html has been disabled.”

Update To Latest Version

The vulnerability affects Complianz versions equal to or less than version 6.5.5. Users are encouraged to update to version 6.5.6 or higher.

Read the Wordfence advisory about the vulnerability:

Complianz | GDPR/CCPA Cookie Consent <= 6.5.5 – Authenticated(Administrator+) Stored Cross-site Scripting via settings

Ecommerce MGMT 0 Comments