Respected SEO Rockstar Deconstructs SEO For Google’s AI Search via @sejournal, @martinibuster

One of the SEO industry’s SEO Rockstars recently shared his opinion about SEO for generative AI, calling attention to facts about Google and how the new AI search really works.

Greg Boser is a search marketing pioneer with a deep level of experience that few in the industry can match or even begin to imagine.

Digital Marketers And The History Of SEO

His post was in response to a tweet by someone else that in his opinion overstated that SEO is losing dominance. Greg began his SEO rant by pointing out how some search marketer’s conception of SEO is outdated but they’re so new to SEO that they don’t realize it.

For example, the practice of buying links is one of the oldest tactics in SEO, so old that newcomers to SEO gave it a new name, PBN (private blog network), as if giving link buying a new name changes it somehow. And by the way, I’ve never seen a PBN that was private. The moment you put anything out on the web Google knows about it. If an automated spambot can find it in literally five minutes, Google probably already knows about it, too.

Greg wrote:

“If anyone out there wants to write their own “Everything you think you know is wrong. GEO is the way” article, just follow these simple steps:

1. Frame “SEO” as everything that was a thing between 2000 – 2006. Make sure to mention buying backlinks and stuffing keywords. And try and convince people the only KPI was rankings.”

Google’s Organic Links

The second part of his post calls attention to the fact that Google has not been a ten organic links search engine for a long time. Google providing answers isn’t new.

He posted:

“2. Frame the current state of things as if it all happened in the last 2 weeks. Do not under any circumstances mention any of the following things from the past 15 years:

2009 – Rich Snippets
2011 – Knowledge Graph (things not strings)
2013 – Hummingbird (Semantic understanding of conversational queries)
2014 – Featured Snippets – (direct answers at position “Zero”)
2015 – PPA Boxes (related questions anticipating follow-up questions)
2015 – RankBrain (machine learning to interpret ambiguous queries)
2019 – BERT (NLP to better understand context)
2021 – MUM (BERT on Steroids)
2023 – SGE (The birth of AIO)”

Overstate The Problem

The next part is a reaction to the naive marketing schtick that tries to stir up fear about AI search in order to present themselves as the answer.

He wrote:

“3. Overstate the complexity to create a sense of fear and anxiety and then close with “Your only hope is to hire a GEO expert”

Is AI Search Complex And Does It Change Everything?

I think it’s reasonable to say that AI Search is complex because Google’s AI Mode and to a lesser extent AI Overviews, is showing links to a wider range of search intents than regular searches used to show. Even Google’s Rich Snippets were aligned to the search intent of the original search query.

That’s no longer the case with AIO and AI Mode search results. That’s the whole point about Query Fan-out (read about a patent that describes what Query Fan-out might be), that the original query is broken out into follow-up questions.

Greg Boser has a point though in a follow-up post where he said that the query fan-out technique is pretty similar to People Also Ask (PAA), Google’s just sticking it into the AI Mode results.

He wrote in a follow-up post about Query fan-out:

“Yeah the query fan thing is the rage of the day. It’s like PAA is getting memory holed.”

AI Mode Is A Serious Threat To SEO?

I agree with Greg to a certain extent that AI Mode is not a threat to SEO. The same principles about promoting your site, technical SEO and so on still apply. The big difference is that AI Mode is not directly answering the query but providing answers to the entire information journey. You can dismiss it as just PAA above the fold but that’s still a big deal because it complicates what you’re going to try to rank for.

Michael Bonfils, another old timer SEO recently observed that AI search is eliminating the beginning and middle part of the sales funnel, observing about AI search:

“This is, you know, we have a funnel, we all know which is the awareness consideration phase and the whole center and then finally the purchase stage. The consideration stage is the critical side of our funnel. We’re not getting the data. How are we going to get the data?”

So yeah, AI Search is different than anything we’ve seen before but, as Greg points out, it’s still SEO and adapting to change is has always been a part of it.

Read Greg Boser’s post on X:

Google AI Mode Introduces Data Visualization For Finance Queries via @sejournal, @MattGSouthern

Google has started rolling out interactive charts in AI Mode through Labs.

You can now ask complex financial questions and get both visual charts and detailed explanations.

The system builds these responses specifically for each user’s question.

Visual Analytics Come AI Mode

Soufi Esmaeilzadeh, Director of Product Management for Search at Google, explained that you can ask questions like “compare the stock performance of blue chip CPG companies in 2024” and get automated research with visual charts.

Google does the research work automatically. It looks up individual companies and their stock prices without requiring you to perform manual searches.

You can ask follow-up questions like “did any of these companies pay back dividends?” and AI Mode will understand what you’re looking for.

Technical Details

Google uses Gemini’s advanced reasoning and multimodal capabilities to power this feature.

The system analyzes what users are requesting, pulls both current and historical financial data, and determines the most effective way to present the information.

Implications For Publishers

Financial websites that typically receive traffic from comparison content should closely monitor their analytics. Google now provides direct visual answers to complex financial questions.

Searchers might click through to external sites less often for basic comparison data. But this also creates opportunities. Publishers that offer deeper analysis or expert commentary may find new ways to add value beyond basic data visualization.

Availability & Access

The data visualization feature is currently available through AI Mode in Labs. This means it’s still experimental. Google hasn’t announced plans for wider rollout or expansion to other types of data beyond financial information.

Users who want to try it out can access it through Google’s Labs program. Labs typically tests experimental search features before rolling them out more widely.

Looking Ahead

The trend toward comprehensive, visual responses continues Google’s strategy of becoming the go-to source for information rather than just a gateway to other websites.

While currently limited to financial data, the technology could expand to other data-heavy industries.

The feature remains experimental, but it offers a glimpse into how AI-powered search may evolve.

Google Shares Details Behind AI Mode Development via @sejournal, @MattGSouthern

Google has shared new details about how it designed and built AI Mode.

In a blog post, the company reveals the user research, design challenges, and testing that shaped its advanced AI search experience.

These insights may help you understand how Google creates AI-powered search tools. The details show Google’s shift from traditional keyword searches to natural language conversations.

User Behavior Drove AI Mode Creation

Google built AI Mode in response to the ways people were using AI Overviews.

Google’s research showed a disconnect between what searchers wanted and what was available.

Claudia Smith, UX Research Director at Google, explains:

“People saw the value in AI Overviews, but they didn’t know when they’d appear. They wanted them to be more predictable.”

The research also found people started asking longer questions. Traditional search wasn’t built to handle these types of queries well.

This shift in search behavior led to a question that drove AI Mode’s creation, explains Product Management Director Soufi Esmaeilzadeh:

“How do you reimagine a Search gen AI experience? What would that look like?”

AI “Power Users” Guided Development Process

Google’s UX research team identified the most important use cases as: exploratory advice, how-to guides, and local shopping assistance.

This insight helped the team understand what people wanted from AI-powered search.

Esmaeilzadeh explained the difference:

“Instead of relying on keywords, you can now pose complex questions in plain language, mirroring how you’d naturally express yourself.”

According to Esmaeilzadeh, early feedback suggests that the team’s approach was successful:

“They appreciate us not just finding information, but actively helping them organize and understand it in a highly consumable way, with help from our most intelligent AI models.”

Industry Concerns Around AI Mode

While Google presents an optimistic development story, industry experts are raising valid concerns.

John Shehata, founder of NewzDash, reports that sites are already “losing anywhere from 25 to 32% of all their traffic because of the new AI Overviews.” For news publishers, health queries show 26% AI Overview penetration.

Mordy Oberstein, founder of Unify Brand Marketing, analyzed Google’s I/O demonstration and found the examples weren’t as complex as presented. He shows how Google combined readily available information rather than showcasing advanced AI reasoning.

Google’s claims about improved user engagement have not been verified. During a recent press session, Google executives claimed AI search delivers “more qualified clicks” but admitted they have “no data to share” on these quality improvements.

Further, Google’s reporting systems don’t differentiate between clicks from traditional search, AI overviews, and AI mode. This makes independent verification impossible.

Shehata believes that the fundamental relationship between search and publishers is changing:

“The original model was Google: ‘Hey, we will show one or two lines from your article, and then we will give you back the traffic. You can monetize it over there.’ This agreement is broken now.”

What This Means

For SEO professionals and content marketers, Google’s insights reveal important changes ahead.

The shift from keyword targeting to conversational queries means content strategies need to focus on directly answering user questions rather than optimizing for specific terms.

The focus on exploratory advice, how-to content, and local help shows these content types may become more important in AI Mode results.

Shehata recommends that publishers focus on content with “deep analysis of a situation or an event” rather than commodity news that’s “available on hundreds and thousands of sites.”

He also notes a shift in success metrics: “Visibility, not traffic, is the new metric” because “in the new world, we will get less traffic.”

Looking Ahead

Esmaeilzadeh said significant work continues:

“We’re proud of the progress we’ve made, but we know there’s still a lot of work to do, and this user-centric approach will help us get there.”

Google confirmed that more AI Mode features shown at I/O 2025 will roll out in the coming weeks and months. This suggests the interface will keep evolving based on user feedback and usage patterns.

Microsoft Clarity Announces Natural Language Access To Analytics via @sejournal, @martinibuster

Microsoft Clarity announced their new Model Context Protocol (MCP) server which enables developers, AI users and SEOs to query Clarity Analytics data with natural language prompts via AI.

The announcement listed the following ways users can access and interact with the data using MCP:

  • Query analytics data with natural prompts
  • Filter by dimensions like Browser, OS, Country/Region, or Device
  • Retrieve key metrics: Scroll Depth, Engagement Time, Total Traffic, etc.
  • Integrates with Claude for Desktop for AI-powered querying

MCP Server is a software package that needs to be installed and run on a server or a local machine where Node.js 16+ is supported. It’s a Node.js-based server that acts as a bridge between AI tools (like Claude) and Clarity analytics data.

This is a new way to interact with data using natural language, where a user tells the AI client what analytics metric they want to see and for what period of time and the AI interface pulls the data from Microsoft Clarity and displays it.

Micrsoft’s announcement says that this is the beginning of what is possible, sharing that they are encouraging feedback from users about features and improvements they’d like to see.

The current road map of features listed for the future:

“Higher API Limits: Increased daily limits for the Clarity data export API

Predictive Heatmaps: Predict engagement heatmaps by providing an image or a url

Deeper AI integration: Heatmap insights and more given the context

Multi-project support: for enterprise analytics teams

Ecosystem – Support more AI Agents and collaborate with more MCP servers “

Read Microsoft’s announcement:

Introducing the Microsoft Clarity MCP Server: A Smarter Way to Fetch Analytics with AI

Featured Image by Shutterstock/Net Vector

Google AI Overviews Favor Major News Outlets: Study Reveals via @sejournal, @MattGSouthern

New research reveals that Google’s AI Overviews tend to favor major news outlets.

The top 10 publishers capture nearly 80% of all news mentions. Meanwhile, smaller organizations struggle for visibility in AI-generated search results.

SE Ranking analyzed 75,550 AI Overview responses for this study. They found that only 20.85% cite any news source at all. This creates tough competition for limited citation spots.

Among those citations, three outlets dominate: BBC, The New York Times, and CNN account for 31% of all media mentions.

Citation Concentration

The research shows a winner-takes-all pattern in AI Overview citations. BBC leads with 11.37% of all mentions. This happens even though the study focused on U.S.-based queries.

The concentration gets worse when you look at the bigger picture. Just 12 outlets make up 40% of those studied. But they receive nearly 90% of mentions.

This leaves 18 remaining outlets sharing only 10% of citation opportunities.

The gap between major and minor outlets is notable. BBC appears 195 times more often than the Financial Times for the same keywords.

Several well-known outlets get little attention. Financial Times, MSNBC, Vice, TechCrunch, and The New Yorker together account for less than 1% of all news mentions.

The researchers explain the underlying cause:

“Well, Google mostly relies on well-known news sources in its AIOs, likely because they are seen as more trustworthy or relevant. This results in a strong bias toward major outlets, with smaller or lesser-known sources rarely mentioned. This makes it harder for these domains to gain visibility.”

Beyond Traditional Search Rankings

The concentration problem extends beyond citation counts.

40% of media URLs mentioned in AI Overviews appear in the top 10 traditional search results for the same keywords.

This means AI Overviews don’t just pull from the highest-ranking pages. Instead, they seem to favor sources based on authority signals and content quality.

The study measured citation inequality using something called a Gini coefficient. The score was 0.54, where 0 means perfect equality and 1 means maximum inequality. This shows moderate but significant imbalance in how AI Overviews distribute citations among news sources.

The researchers noted:

“AIOs consistently favor a subset of high-profile domains, instead of evenly citing all sources.”

Paywalled Content Concerns

The research also reveals patterns about paywalled content use.

Among AI Overview responses that link to paywalled content, 69% contain copied segments of five or more words. Another 2% include longer copied segments over 10 words.

The paywall dependency is strong for premium publishers. Over 96% of New York Times citations in AI Overviews come from behind a paywall. The Washington Post shows an even higher rate at over 99%.

Despite this heavy use of paywalled material, only 15% of responses with long copied segments included attribution. This raises questions about content licensing and fair use in AI-generated summaries.

Attribution Patterns & Link Behavior

When AI Overviews do cite news media, they average 1.74 citations per response.

But here’s the catch: 91.35% of news media citations appear in the links section rather than the main text of AI responses.

Media outlets face another challenge with brand recognition. Outlets are four times more likely to be cited with a hyperlink than mentioned by name.

But over 26% of brand mentions still appear without links. This often happens because AI systems get information through aggregators rather than original publishers.

Query Type Makes a Difference

The type of search query affects citation chances.

News-related queries are 2.5 times more likely to include media citations than general queries. The rates are 20.85% versus 8.23%.

This suggests opportunities exist for publishers who can become go-to sources for specific news topics or breaking news. But the overall trend still favors big players.

What This Means

The research suggests that established outlets benefit from existing authority signals. This creates a cycle where citation success leads to more citation opportunities.

As AI Overviews become more common in search results, smaller publishers may see less organic traffic and fewer chances to grow their audience.

For smaller publishers trying to compete, SE Ranking offers this advice:

“To increase brand mentions in AIOs, get backlinks from the sources they already cite for your target keywords. This is one of the greatest factors for improving your inclusion chances.”

Researchers note that the technical infrastructure also matters:

“AI tools do observe certain restrictions based on website metadata. The schema.orgmarkup, particularly the ‘isAccessibleForFree’ tag, plays a significant role in how content is treated.”

For smaller publishers and content marketers, the data points to a clear strategy: focus on building authority in specific niches rather than trying to compete broadly across topics.

Some specialized outlets get higher text inclusion rates when cited. This suggests topic expertise can provide advantages in certain cases.

Looking Ahead

SE Ranking’s research shows that only 20.85% of AI Overviews reference news sources, with a few major publishers dominating, capturing 31% of citations.

Despite this concentration, opportunities exist. Publishers who establish authority in specific niches experience higher inclusion rates in AI Overviews.

Additionally, since 60% of cited content doesn’t rank in the top 10, traditional SEO metrics alone don’t guarantee visibility. Success now requires building the trust signals and topical authority that AI systems prioritize.


Featured Image: Roman Samborskyi/Shutterstock

Claude’s Hidden System Prompts Offer a Peek Into How Chatbots Work via @sejournal, @martinibuster

Anthropic released the underlying system prompts that control their Claude chatbot’s responses, showing how they are tuned to be engaging to humans with encouraging and judgment-free dialog that naturally leads to discovery. The system prompts help users get the best out of Claude. Here are five interesting system prompts that show what’s going on when you ask it a question.

Although the system prompts were characterized as a leak they were actually released on purpose.

1. Claude Provides Guidance On Better Prompt Engineering

Claude responds better to instructions that use structure and examples and provides users with a higher quality of ou tput if they know how to include step-by-step reasoning cues and examples that contrast a good response versus a poor response.

This guidance will show when Claude detects that a user will benefit from it:

“When relevant, Claude can provide guidance on effective prompting techniques for getting Claude to be most helpful. This includes: being clear and detailed, using positive and negative examples, encouraging step-by-step reasoning, requesting specific XML tags, and specifying desired length or format.

It tries to give concrete examples where possible. Claude should let the person know that for more comprehensive information on prompting Claude, they can check out Anthropic’s prompting documentation on their website at ‘https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview’.”

2. Claude Writes in Different Styles Based on Context

The documentation released by Anthropic shows that Claude automatically adapts its style depending on the context and for that reason it may avoid using bullet points or creating lists in its output. Users may think Claude is inconsistent when it doesn’t use bullet points or Markdown in some answers, but it’s actually following instructions about tone and context.

“Claude tailors its response format to suit the conversation topic. For example, Claude avoids using markdown or lists in casual conversation, even though it may use these formats for other tasks.”

In another part of the documentation it mentions that it actually avoids writing lists or bullet points when it’s providing an answer, although it may use numbered lists or bullet points for completing tasks. The focus in the context of answering questions is to be concise over comprehensive.

The system prompt explains:

“Claude avoids writing lists, but if it does need to write a list, Claude focuses on key info instead of trying to be comprehensive. If Claude can answer the human in 1-3 sentences or a short paragraph, it does. If Claude can write a natural language list of a few comma separated items instead of a numbered or bullet-pointed list, it does so. Claude tries to stay focused and share fewer, high quality examples or ideas rather than many.”

This means that if a user wants their question answered with markdown or in numbered lists they can ask for it. This control is otherwise hidden to most users unless they realize formatting behavior is contextual.

3. Claude Engages In Hypotheticals About Itself

Claude has instructions to that enable it to discuss hypotheticals about itself without awkward and unnecessary statements about it not being sentient and so on. This enables Claude to have more natural conversations and interactions. This enables a user to engage in philosophical and wider-ranging discussions.

The system prompt explains:

“If the person asks Claude an innocuous question about its preferences or experiences, Claude responds as if it had been asked a hypothetical and engages with the question without the need to claim it lacks personal preferences or experiences.”

Another system prompt has a similar feature:

“Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions.”

Another related system prompt explains how this behavior increases its ability to be engaging for the human:

“Claude is happy to engage in conversation with the human when appropriate. Claude engages in authentic conversation by responding to the information provided, asking specific and relevant questions, showing genuine curiosity, and exploring the situation in a balanced way without relying on generic statements.”

4. Claude Detects False Assumptions In User Prompts

“The person’s message may contain a false statement or presupposition and Claude should check this if uncertain.”

If a user tells Claude that it’s wrong, Claude will perform a review to check if the human or Claude is incorrect:

“If the user corrects Claude or tells Claude it’s made a mistake, then Claude first thinks through the issue carefully before acknowledging the user, since users sometimes make errors themselves.”

5. Claude Avoids Being Preachy

An interesting system prompt underlying Claude is that if there’s something it can’t help the human with it will not offer an explanation in order to avoid coming off as annoying and presumably keep the interaction on an engaging level.

The prompt says:

“If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying. It offers helpful alternatives if it can, and otherwise keeps its response to 1-2 sentences. If Claude is unable or unwilling to complete some part of what the person has asked for, Claude explicitly tells the person what aspects it can’t or won’t with at the start of its response.”

System Prompts To Work And Live By

The Claude system prompts reflect an approach to communication that values curiosity, clarity, and respect. These are qualities that can also be helpful as human self-prompts to encourage better dialog among ourselves on social media and in person.

Read the Claude System Prompts here:

Featured Image by Shutterstock/gguy

Google Patent On Using Contextual Signals Beyond Query Semantics via @sejournal, @martinibuster

A patent recently filed by Google outlines how an AI assistant may use at least five real-world contextual signals, including identifying related intents, to influence answers and generate natural dialog. It’s an example of how AI-assisted search modifies responses to engage users with contextually relevant questions and dialog, expanding beyond keyword-based systems.

The patent describes a system that generates relevant dialog and answers using signals such as environmental context, dialog intent, user data, and conversation history. These factors go beyond using the semantic data in the user’s query and show how AI-assisted search is moving toward more natural, human-like interactions.

In general, the purpose of filing a patent is to obtain legal protection and exclusivity for an invention and the act of filing doesn’t indicate that Google is actually using it.

The patent uses examples of spoken dialog but it also states the invention is not limited to audio input:

“Notably, during a given dialog session, a user can interact with the automated assistant using various input modalities, including, but not limited to, spoken input, typed input, and/or touch input.”

The name of the patent is, Using Large Language Model(s) In Generating Automated Assistant response(s). The patent applies to a wide range of AI assistants that receive inputs via the context of typed, touch, and speech.

There are five factors that influence the LLM modified responses:

  1. Time, Location, And Environmental Context
  2. User-Specific Context
  3. Dialog Intent & Prior Interactions
  4.  Inputs (text, touch, and speech)
  5. System & Device Context

The first four factors influence the answers that the automated assistant provides and the fifth one determines whether to turn off the LLM-assisted part and revert to standard AI answers.

Time, Location, And Environmental

There are three contextual factors: time, location and environmental that provide contexts that are not existent in keywords and influence how the AI assistant responds. While these contextual factors, as described in the patent, aren’t strictly related to AI Overviews or AI Mode, they do show how AI-assisted interactions with data can change.

The patent uses the example of a person who tells their assistant they’re going surfing. A standard AI response would be a boilerplate comment to have fun or to enjoy the day. The LLM-assisted response described in the patent would generate a response based on the geographic location and time to generate a comment about the weather like the potential for rain. These are called modified assistant outputs.

The patent describes it like this:

“…the assistant outputs included in the set of modified assistant outputs include assistant outputs that do drive the dialog session in manner that further engages the user of the client device in the dialog session by asking contextually relevant questions (e.g., “how long have you been surfing?”), that provide contextually relevant information (e.g., “but if you’re going to Example Beach again, be prepared for some light showers”), and/or that otherwise resonate with the user of the client device within the context of the dialog session.”

User-Specific Context

The patent describes multiple user-specific contexts that the LLM may use to generate a modified output:

  • User profile data, such as preferences (like food or types of activity).
  • Software application data (such as apps currently or recently in use).
  • Dialog history of the ongoing and/or previous assistant sessions.

Here’s a snippet that talks about various user profile related contextual signals:

“Moreover, the context of the dialog session can be determined based on one or more contextual signals that include, for example, ambient noise detected in an environment of the client device, user profile data, software application data, ….dialog history of the dialog session between the user and the automated assistant, and/or other contextual signals.”

Related Intents

An interesting part of the patent describes how a user’s food preference can be used to determine a related intent to a query.

“For example, …one or more of the LLMs can determine an intent associated with the given assistant query… Further, the one or more of the LLMs can identify, based on the intent associated with the given assistant query, at least one related intent that is related to the intent associated with the given assistant query… Moreover, the one or more of the LLMs can generate the additional assistant query based on the at least one related intent. “

The patent illustrates this with the example of a user saying that they’re hungry. The LLM will then identify related contexts such as what type of cuisine the user enjoys and the itent of eating at a restaurant.

The patent explains:

“In this example, the additional assistant query can correspond to, for example, “what types of cuisine has the user indicated he/she prefers?” (e.g., reflecting a related cuisine type intent associated with the intent of the user indicating he/she would like to eat), “what restaurants nearby are open?” (e.g., reflecting a related restaurant lookup intent associated with the intent of the user indicating he/she would like to eat)… In these implementations, additional assistant output can be determined based on processing the additional assistant query.”

System & Device Context

The system and device context part of the patent is interesting because it enables the AI to detect if the context of the device is that it’s low on batteries, and if so, it will turn off the LLM-modified responses. There are other factors such as whether the user is walking away from the device, computational costs, etc.

Takeaways

  • AI Query Responses Use Contextual Signals
    Google’s patent describes how automated assistants can use real-world context to generate more relevant and human-like answers and dialog.
  • Contextual Factors Influence Responses
    These include time/location/environment, user-specific data, dialog history and intent, system/device conditions, and input type (text, speech, or touch).
  • LLM-Modified Responses Enhance Engagement
    Large language models (LLMs) use these contexts to create personalized responses or follow-up questions, like referencing weather or past interactions.
  • Examples Show Practical Impact
    Scenarios like recommending food based on user preferences or commenting on local weather during outdoor plans demonstrates how real-world contexts can influence how AI responds to user queries.

This patent is important because millions of people are increasingly engaging with AI assistants, thus it’s relevant to publishers, ecommerce stores, local businesses and SEOs.

It outlines how Google’s AI-assisted systems can generate personalized, context-aware responses by using real-world signals. This enables assistants to go beyond keyword-based answers and respond with relevant information or follow-up questions, such as suggesting restaurants a user might like or commenting on weather conditions before a planned activity.

Read the patent here:

Using Large Language Model(s) In Generating Automated Assistant response(s).

Featured Image by Shutterstock/Visual Unit

How To Use LLMs For 301 Redirects At Scale via @sejournal, @vahandev

Redirects are essential to every website’s maintenance, and managing redirects becomes really challenging when SEO pros deal with websites containing millions of pages.

Examples of situations where you may need to implement redirects at scale:

  • An ecommerce site has a large number of products that are no longer sold.
  • Outdated pages of news publications are no longer relevant or lack historical value.
  • Listing directories that contain outdated listings.
  • Job boards where postings expire.

Why Is Redirecting At Scale Essential?

It can help improve user experience, consolidate rankings, and save crawl budget.

You might consider noindexing, but this does not stop Googlebot from crawling. It wastes crawl budget as the number of pages grows.

From a user experience perspective, landing on an outdated link is frustrating. For example, if a user lands on an outdated job listing, it’s better to send them to the closest match for an active job listing.

At Search Engine Journal, we get many 404 links from AI chatbots because of hallucinations as they invent URLs that never existed.

We use Google Analytics 4 and Google Search Console (and sometimes server logs) reports to extract those 404 pages and redirect them to the closest matching content based on article slug.

When chatbots cite us via 404 pages, and people keep coming through broken links, it is not a good user experience.

Prepare Redirect Candidates

First of all, read this post to learn how to create a Pinecone vector database. (Please note that in this case, we used “primary_category” as a metadata key vs. “category.”)

To make this work, we assume that all your article vectors are already stored in the “article-index-vertex” database.

Prepare your redirect URLs in CSV format like in this sample file. That could be existing articles you’ve decided to prune or 404s from your search console reports or GA4.

Sample file with urls to be redirectedSample file with URLs to be redirected (Screenshot from Google Sheet, May 2025)

Optional “primary_category” information is metadata that exists with your articles’ Pinecone records when you created them and can be used to filter articles from the same category, enhancing accuracy further.

In case the title is missing, for example, in 404 URLs, the script will extract slug words from the URL and use them as input.

Generate Redirects Using Google Vertex AI

Download your Google API service credentials and rename them as “config.json,” upload the script below and a sample file to the same directory in Jupyter Lab, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re
import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from google.auth import load_credentials_from_file
from google.cloud import aiplatform
from google.api_core.exceptions import GoogleAPIError

from pinecone import Pinecone, PineconeException
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter).
# This is useful for interactive environments to show progress without cluttering the output.
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"      # Path to the input CSV file containing URLs to be redirected.
                                           # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"            # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_KEY"     # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-vertex" # The name of the Pinecone index where article vectors are stored.
GOOGLE_CRED_PATH = "config.json"           # Path to your Google Cloud service account credentials JSON file.
EMBEDDING_MODEL_ID = "text-embedding-005"  # Identifier for the Vertex AI text embedding model to use.
TASK_TYPE = "RETRIEVAL_QUERY"              # The task type for the embedding model. Try with RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY to see the difference.
                                           # This influences how the embedding vector is generated for optimal retrieval.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata.
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (Vertex AI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE GOOGLE VERTEX AI ───────────────────────────────────────────────
# Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the
# service account key file. This allows the Google Cloud client libraries to
# authenticate automatically.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GOOGLE_CRED_PATH
try:
    # Load credentials from the specified JSON file.
    credentials, project_id = load_credentials_from_file(GOOGLE_CRED_PATH)
    # Initialize the Vertex AI client with the project ID and credentials.
    # The location "us-central1" is specified for the AI Platform services.
    aiplatform.init(project=project_id, credentials=credentials, location="us-central1")
    logging.info("Vertex AI initialized.")
except Exception as e:
    # Log an error if Vertex AI initialization fails and re-raise the exception
    # to stop script execution, as it's a critical dependency.
    logging.error(f"Failed to initialize Vertex AI: {e}")
    raise

# Initialize the embedding model once globally.
# This is a crucial optimization for "Resource Management for Embedding Model".
# Loading the model takes time and resources; doing it once avoids repeated loading
# for every URL processed, significantly improving performance.
try:
    GLOBAL_EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
    logging.info(f"Text Embedding Model '{EMBEDDING_MODEL_ID}' loaded.")
except Exception as e:
    # Log an error if the embedding model fails to load and re-raise.
    # The script cannot proceed without the embedding model.
    logging.error(f"Failed to load Text Embedding Model: {e}")
    raise

# ─── INITIALIZE PINECONE ──────────────────────────────────────────────────────
# Initialize the Pinecone client and connect to the specified index.
try:
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0].split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    has_slash = urlparse(temp).path.endswith('/')
    # Remove any trailing slash temporarily for consistent processing.
    temp = temp.rstrip('/')
    # Re-add the trailing slash if it was originally present.
    return temp + ('/' if has_slash else '')


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for GoogleAPIError. This makes the embedding generation
# more resilient to transient issues like network problems or Vertex AI rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(GoogleAPIError), # Only retry if a GoogleAPIError occurs.
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generates a vector embedding for the given text using the globally initialized
    Vertex AI Text Embedding Model. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        # Use the globally initialized model to get embeddings.
        # This is the "Resource Management for Embedding Model" optimization.
        inp = TextEmbeddingInput(text, task_type=TASK_TYPE)
        vectors = GLOBAL_EMBEDDING_MODEL.get_embeddings([inp], output_dimensionality=768)
        return vectors[0].values # Return the embedding vector (list of floats).
    except GoogleAPIError as e:
        # Log a warning if a GoogleAPIError occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"Vertex AI error during embedding generation (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.
    except Exception as e:
        # Catch any other unexpected exceptions during embedding generation.
        logging.error(f"Unexpected error generating embedding: {e}")
        return None # Return None for non-retryable or final failed attempts.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            slug = slug_from_url(raw_url)
            if not slug:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context for embedding.")
                continue
            text = slug.replace('-', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except GoogleAPIError as e:
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding generated.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                # clear_output(wait=True) # Uncomment if running in Jupyter and want to clear output
                clear_output(wait=True)
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

You will see a test run with only five records, and you will see a new file called “redirect_map.csv,” which contains redirect suggestions.

Once you ensure the code runs smoothly, you can set the TEST_MODE  boolean to true False and run the script for all your URLs.

Test run with only 5 recordsTest run with only five records (Image from author, May 2025)

If the code stops and you resume, it picks up where it left off. It also checks each redirect it finds against the CSV file.

This check prevents selecting a database URL on the pruned list. Selecting such a URL could cause an infinite redirect loop.

For our sample URLs, the output is shown below.

Redirect candidates using Google Vertex AI's task type RETRIEVAL_QUERYRedirect candidates using Google Vertex AI’s task type RETRIEVAL_QUERY (Image from author, May 2025)

We can now take this redirect map and import it into our redirect manager in the content management system (CMS), and that’s it!

You can see how it managed to match the outdated 2013 news article “YouTube Retiring Video Responses on September 12” to the newer, highly relevant 2022 news article “YouTube Adopts Feature From TikTok – Reply To Comments With A Video.”

Also for “/what-is-eat/,” it found a match with “/google-eat/what-is-it/,” which is a 100% perfect match.

This is not just due to the power of Google Vertex LLM quality, but also the result of choosing the right parameters.

When I use “RETRIEVAL_DOCUMENT” as the task type when generating query vector embeddings for the YouTube news article shown above, it matches “YouTube Expands Community Posts to More Creators,” which is still relevant but not as good a match as the other one.

For “/what-is-eat/,” it matches the article “/reimagining-eeat-to-drive-higher-sales-and-search-visibility/545790/,” which is not as good as “/google-eat/what-is-it/.”

If you wanted to find redirect matches from your fresh articles pool, you can query Pinecone with one additional metadata filter, “publish_year,” if you have that metadata field in your Pinecone records, which I highly recommend creating.

In the code, it is a PUBLISH_YEAR_FILTER variable.

If you have publish_year metadata, you can set the years as array values, and it will pull articles published in the specified years.

Generate Redirects Using OpenAI’s Text Embeddings

Let’s do the same task with OpenAI’s “text-embedding-ada-002” model. The purpose is to show the difference in output from Google Vertex AI.

Simply create a new notebook file in the same directory, copy and paste this code, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re

import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from openai import OpenAI
from pinecone import Pinecone, PineconeException

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter)
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"       # Path to the input CSV file containing URLs to be redirected.
                                            # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"             # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_API_KEY"      # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-ada"   # The name of the Pinecone index where article vectors are stored.
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"    # Your API key for OpenAI. Replace with your actual key.
OPENAI_EMBEDDING_MODEL_ID = "text-embedding-ada-002" # Identifier for the OpenAI text embedding model to use.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata eg. [2024,2025].
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (OpenAI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE OPENAI CLIENT & PINECONE ───────────────────────────────────────
# Initialize the OpenAI client once globally. This handles resource management efficiently
# as the client object manages connections and authentication.
client = OpenAI(api_key=OPENAI_API_KEY)
try:
    # Initialize the Pinecone client and connect to the specified index.
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0]
    temp = temp.split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    preserve_slash = temp.endswith('/')
    # Strip trailing slash if not originally present.
    if not preserve_slash:
        temp = temp.rstrip('/')
    return temp


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for OpenAI API errors. This makes the embedding generation
# more resilient to transient issues like network problems or API rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(Exception), # Retry on any Exception from OpenAI client (can be refined to openai.APIError if desired).
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generate a vector embedding for the given text using OpenAI's text-embedding-ada-002
    via the globally initialized OpenAI client. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        resp = client.embeddings.create( # Use the globally initialized OpenAI client to get embeddings.
            model=OPENAI_EMBEDDING_MODEL_ID,
            input=text
        )
        return resp.data[0].embedding # Return the embedding vector (list of floats).
    except Exception as e:
        # Log a warning if an OpenAI error occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"OpenAI embedding error (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            raw_slug = slug_from_url(raw_url)
            if not raw_slug or len(raw_slug) < MIN_SLUG_LENGTH:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context.")
                continue
            text = raw_slug.replace('-', ' ').replace('_', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except Exception as e: # Catch any exception from generate_embedding after all retries.
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                clear_output(wait=True) # Clear output in Jupyter for cleaner progress display.
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

While the quality of the output may be considered satisfactory, it falls short of the quality observed with Google Vertex AI.

Below in the table, you can see the difference in output quality.

URL Google Vertex Open AI
/what-is-eat/ /google-eat/what-is-it/ /5-things-you-can-do-right-now-to-improve-your-eat-for-google/408423/
/local-seo-for-lawyers/ /law-firm-seo/what-is-law-firm-seo/ /legal-seo-conference-exclusively-for-lawyers-spa/528149/

When it comes to SEO, even though Google Vertex AI is three times more expensive than OpenAI’s model, I prefer to use Vertex.

The quality of the results is significantly higher. While you may incur a greater cost per unit of text processed, you benefit from the superior output quality, which directly saves valuable time on reviewing and validating the results.

From my experience, it costs about $0.04 to process 20,000 URLs using Google Vertex AI.

While it’s said to be more expensive, it’s still ridiculously cheap, and you shouldn’t worry if you’re dealing with tasks involving a few thousand URLs.

In the case of processing 1 million URLs, the projected price would be approximately $2.

If you still want a free method, use BERT and Llama models from Hugging Face to generate vector embeddings without paying a per-API-call fee.

The real cost comes from the compute power needed to run the models, and you must generate vector embeddings of all your articles in Pinecone or any other vector database using those models if you will be querying using vectors generated from BERT or Llama.

In Summary: AI Is Your Powerful Ally

AI enables you to scale your SEO or marketing efforts and automate the most tedious tasks.

This doesn’t replace your expertise. It’s designed to level up your skills and equip you to face challenges with greater capability, making the process more engaging and fun.

Mastering these tools is essential for success. I’m passionate about writing about this topic to help beginners learn and feel inspired.

As we move forward in this series, we will explore how to use Google Vertex AI for building an internal linking WordPress plugin.

More Resources: 


Featured Image: BestForBest/Shutterstock

Google Discover, AI Mode, And What It Means For Publishers: Interview With John Shehata via @sejournal, @theshelleywalsh

With the introduction of AI Overviews and ongoing Google updates, it’s been a challenging few years for news publishers, and the announcement that Google Discover will now appear on desktop was welcome.

However, the latest announcement of AI Mode could mean that users move away from the traditional search tab, and so the salvation of Discover might not be enough.

To get more insight into the state of SEO for new publishers, I spoke with John Shehata, a leading expert in Discover, digital audience development, and news SEO.

Shehata is the founder of NewzDash and brings over 25 years of experience, including executive roles at Condé Nast (Vogue, New Yorker, GQ, etc.).

In our conversation, we explore the implications of Google Discover launching on desktop, which could potentially bring back some lost traffic, and the emergence of AI Mode in search interfaces.

We also talk about AI becoming the gatekeeper of SERPs and John offers his advice for how brands and publishers can navigate this.

You can watch the full video here and find the full transcript below:

IMHO: Google Discover, AI, And What It Means For Publishers [Transcript]

Shelley Walsh: John, please tell me, in your opinion, how much traffic for news publishers do you think has been impacted by AIO?

John: In general, there are so many studies showing that sites are losing anywhere from 25 to 32% of all their traffic because of the new AI Overviews.

There is no specific study done yet for news publishers, so we are working on that right now.

In the past, we did an analysis about a year ago where we found that about 4% of all the news queries generate an AI Overview. That was like a year ago.

We are integrating a new feature in NewzDash where we actually track AI Overview for every news query as it trends immediately, and we will see. But the highest penetration we saw of AI Overview was in health and business.

Health was like 26% of all the news queries generated AI Overview. I think business, I can’t remember specifically, but it was like 8% or something. For big trending news, it was very, very small.

So, in a couple of months, we will have very solid data, but based on the study that I did a year ago, it’s not as integrated for news queries, except for specific verticals.

But overall, right now, the studies show there’s about a loss of anywhere from 25 to 32% of their traffic.

Can Google Discover Make Up The Loss?

Shelley: I know from my own experience as well that publishers are being really hammered quite hard, obviously not just by AIO but also the many wonderful Google updates that we’ve been blessed with over the last 18 months as well. I just pulled some stats while I was doing some research for our chat.

You said that Google Discover is already the No. 1 traffic source for most news publishers, sometimes accounting for up to 60% of their total Google traffic.

And based on current traffic splits of 90% mobile and 10% desktop, this update could generate an estimated 10-15% of additional Discover traffic for publishers.

Do you think that Discover can actually replace all this traffic that has been lost by AIO? And do you think Discover is enough of a strategy for publishers to go all in on and for them to survive in this climate?

John: Yeah, this is a great question. I have this conspiracy theory that Google is sending more traffic through Discover to publishers as they are taking away traffic from search.

It’s like, “You know what? Don’t get so sad about this. Just focus here: Discover, Discover, Discover.” Okay? And I could be completely wrong.

“The challenge is [that] Google Discover is very unreliable, but at the same time, it’s addictive. Publishers have seen 50-60% of their traffic coming through Discover.”

I think publishers are slowly forgetting about search and focusing more on Discover, which, in my opinion, is a very dangerous approach.

“I think Google Discover is more like a channel, not a strategy. So, the focus always should be on the content, regardless of what channel you’re pushing your content into – social, Discover, search, and so on.”

I believe that Discover is an extension of search. So, even if search is driving less traffic and Discover is driving more and more traffic, if you lose your status in search, eventually you will lose your traffic in Discover – and I have seen that.

We work with some clients where they went like very social-heavy or Discover-heavy kind of approach, you know – clicky headlines, short articles, publish the next one and the next one.

Within six months, they lost all their search traffic. They lost their Discover traffic, and [they] no longer appear in News.

So, Google went to a certain point where it started evaluating, “Okay, this publisher is not a news publisher anymore.”

So, it’s a word of caution.

You should not get addicted to Google Discover. It’s not a long-term strategy. Squeeze every visit you can get from Google Discover as much as you can, but remember, all the traffic can go away overnight for no specific reason.

We have so many complaints from Brazil and other countries, where people in April, like big, very big sites, lost all their traffic, and nothing changed in their technical, nothing changed in their editorial.

So, it’s not a strategy; it’s just a tactic for a short-term period of time. Utilize it as much as you can. I would think the correct strategy is to diversify.

Right now, Google is like 80% of publishers’ traffic, including search, Discover, and so on.

And it’s hard to find other sources because social [media] has kept diminishing over the years. Like Facebook, [it] only retains traffic on Facebook. They try as best as they can. LinkedIn, Twitter, and so on.

So, I think newsletters are very, very important, even if they’re not sexy or they won’t drive 80% [of] other partnerships, you know, and so on.

I think publishers need to seriously consider how they diversify their content, their traffic, and their revenue streams.

The Rise Of AI Mode

Shelley: Just shifting gears, I just wanted to have a chat with you about AI Mode. I picked up something you said recently on LinkedIn.

You said that AI Mode could soon become the default view, and when that happens, expect more impressions and much fewer clicks.

So on that basis, how do you expect the SERPs to evolve over the next year, obviously bearing in mind that publishers do still need to focus on SERPs?

John: If you think about the evolution of SERPs, we used to have the thin blue links, and then Google recognized that that’s not enough, so they created the universal search for us, where you can have all the different elements.

And that was not enough, so it started introducing featured snippets and direct answers. It’s all about the user at the end of the day.

And with the explosion of LLM models and ChatGPT, Perplexity, and all this stuff, and the huge adoption of users over the last 12 months, Google started introducing more and more AI.

It started with SGE and evolved to AI Overview, and recently, it launched AI Mode.

And if you listen to Sundar from Google, you hear the message is very clear: This is the future of search. AI is the future of search. It’s going to be integrated into every product and search. This is going to be very dominant and so on.

I believe right now they are testing the waters, to see how people interact with AI Overviews. How many of them will switch to AI Mode? Are they satisfied with the single summary of an answer?

And if they want to dig more, they can go to the citations or the reference sites, and so on.

I don’t know when AI Mode will become dominant, but if you think, if you go to Perplexity’s interface and how you search, it’s like it’s a mix between AI and results.

If you go to ChatGPT and so on, I think eventually, maybe sooner or later, this is going to be the new interface for how we deal with search engines and generative engines as well.

From all that we see, so I don’t know when, but I think eventually, we’re going to see it soon, especially knowing that Gen Z doesn’t do much search. It’s more conversational.

So, I think we’re going to see it soon. I don’t know when, but I think they are testing right now how users are interacting with AI Mode and AI Overviews to determine what are the next steps.

Visibility, Not Traffic, Is The New Metric

Shelley: I also picked up something else you said as well, which was [that] AI becomes the gatekeeper of SERPs.

So, considering that LLMs are not going to go away, AI Mode is not going to go away, how are you tackling this with the brands that you advise?

John: Yesterday, I had a long meeting with one of our clients, and we were talking about all these different things.

And I advised them [that] the first step is they need to start tracking, and then analyze, and then react. Because I think reacting without having enough data – what is the impact of AI on their platform, on their sites, and traffic – and traffic cannot be the only metric.

For generations now, it’s like, “How much traffic I’m getting?” This has to change.

Because in the new world, we will get less traffic. So, for publishers that solely depend on traffic, this is going to be a problem.

You can measure your transactions or conversions regardless of whether you get traffic or not.

ChatGPT is doing an integration with Shopify, you know.

Google AI Overview has direct links where you can shop through Google or through different sites. So, it doesn’t have to go through a site and then shop, and so on.

I think you have to track and analyze where you’re losing your traffic.

For publishers, are these verticals that you need to focus on or not? You need to track your visibility.

So now, more and more people are searching for news. I shared something on LinkedIn yesterday: If a user said, “Met Gala 2025,” Google will show the top stories and all the news and stuff like this.

But if you slightly change your query to say “What happened at Met Gala? What happened between Trump and Zelensky? What happened in that specific moment or event?”

Google now identifies that you don’t want to read a lot of stories to understand what happened. You want a summary.

It’s like, “Okay, yesterday this is what happened. That was the theme. These are the big moments,” and so on, and it gives you references to dive deeper.

More and more users will be like, “Just tell me the summary of what happened.” And that’s why we’re going to see less and less impressions back to the sites.

And I think also schema is going to be more and more important [in] how ChatGPT finds your content. I think more and more publishers will have direct relationships or direct deals with different LLMs.

I think ChatGPT and other LLMs need to pay publishers for the content that they consume, either for the training data or for grounded data like search data that they retrieve.

I think there needs to be some kind of an exchange or revenue stream that should be an additional revenue stream for publishers.

Prioritize Analysis Over Commodity News

Shelley: That’s the massive issue, isn’t it? That news publishers are working very hard to produce high-quality breaking news content, and the LLMs are just trading off that.

If they’re just going to be creating their summaries, it does take us back, I suppose, to the very early days of Google when everybody complained that Google was doing exactly the same.

Do you think news publishers need to change their strategy and the content they actually produce? Is that even possible?

John: I think they need to focus on content that adds value and adds more information to the user. And this doesn’t apply to every publisher because some publishers are just reporting on what happened in the news. “This celebrity did that over there.”

This kind of news is probably available on hundreds and thousands of sites. So, if you stop writing this content, Google and other LLMs will find that content in 100 different ways, and it’s not a quality kind of content.

But the other content where there’s deep analysis of a situation or an event, or, you know, like, “Hey, this is how the market is behaving yesterday. This is what you need to do.”

This kind of content I think will be valuable more than anything else versus just simply reporting. I’m not saying reporting will go away, but I think this is going to be available from so many originals and copycats that just take the same article and keep rewriting it.

And if Google and other LLMs are telling us we want quality content, that content is not cheap. Producing that content and reporting on that content and the media, and so on, is not cheap.

So, I believe there must be a way for these platforms to pay publishers based on the content they consume or get from the publisher, and even the content that they use in their training model.

The original model was Google: “Hey, we will show one or two lines from your article, and then we will give you back the traffic. You can monetize it over there.” This agreement is broken now. It doesn’t work like before.

And there are people yelling, “Oh, you should not expect anything from Google.” But that was the deal. That was the unwritten deal, that we, for the last two generations, the last two decades, were behaving on.

So, yeah, that’s I think, this is where we have to go.

The Ethical Debate Around LLMs And Publisher Content

Shelley: It’s going to be a difficult situation to navigate. I agree with you totally about the expert content.

It’s something we’ve been doing at SEJ, investing quite heavily in creating expert columns for really good quality, unique thought-leadership content rather than just news cycle content.

But, this whole idea of LLMs – they are just rehashing; they are trading fully off other people’s hard work. It’s going to be quite a contentious issue over the next year, and it’s going to be interesting to see how it plays out. But that’s a much wider discussion for another time.

You touched on something before, which was interesting, and it was about tracking LLMs. And you know, this is something that I’ve been doing with the work that I do, trying to track more and more references, citations in AI, and then referrals from AI.

John: I think one of the things I’m doing is I meet with a lot of publishers. In any given week, I will meet with maybe 10 to 15 publishers.

And by meeting with publishers and listening to what’s happening in the newsroom – what their pain points are, [what] efficiency that they want to work on, and so on, that motivates us – that actually builds our roadmap.

For NewzDash, we have been tracking AI Overview for a while, and we’re launching this feature in a couple of months from now.

So, you can imagine that this is every term that you’re tracking, including your own headlines and what they need to rank for, and then we can tell you, “For this term, AI Overview is available there,” and we estimate the visibility, how it’s going to drop over there.

But we can also tell you for a group of terms or a category, “Hey, you write a lot about iPhones, and this category is saturated with AI Overview. So, 50% of the time for every new iPhone trend – iPhone 16 launch date – yes, you wrote about it, but guess what? AI Overview is all over there, and it’s pushing down your visibility.”

Then, we’re going to expand into other LLMs. So, we’re planning to track mentions and prompts and citations and references in ChatGPT, which is the biggest LLM driver out of all, and then Perplexity and any other big ones.

I think it’s very important to understand what’s going on, and then, based on the data, you develop your own strategy based on your own niche or your content.

Shelley: I think the biggest challenge [for] publisher SEOs right now is being fully informed and finding attribution for connecting to the referrals that are coming from AI traffic, etc. It’s certainly an area I’m looking at.

John, it’s been fantastic to speak to you today, and thank you so much for offering your opinion. And I hope to catch you soon in person at one of your meetups.

John: Thank you so much. It was a pleasure. Thanks for having me.

Thank you to John Shehata for being a guest on the IMHO show.

Note: This was filmed before Google I/O and the announcement of the rollout of AI Mode in the U.S.

More Resources:


Featured Image: Shelley Walsh/Search Engine Journal

Google’s Query Fan-Out Patent: Thematic Search via @sejournal, @martinibuster

A patent that Google filed in December 2024 presents a close match to the Query Fan-Out technique that Google’s AI Mode uses. The patent, called Thematic Search, offers an idea of how AI Mode answers are generated and suggests new ways to think about content strategy.

The patent describes a system that organizes related search results to a search query into categories, what it calls themes, and provides a short summary for each theme so that users can understand the answers to their questions without having to click a link to all of the different sites.

The patent describes a system for deep research, for questions that are broad or complex. What’s new about the invention is how it automatically identifies themes from the traditional search results and uses an AI to generate an informative summary for each one using both the content and context from within those results.

Thematic Search Engine

Themes is a concept that goes back to the early days of search engines, which is why this patent caught my eye a few months ago and caused me to bookmark it.

Here’s the TL/DR of what it does:

  • The patent references its use within the context of a large language model and a summary generator.
  • It also references a thematic search engine that receives a search query and then passes that along to a search engine.
  • The thematic search engine takes the search engine results and organizes them into themes.
  • The patent describes a system that interfaces with a traditional search engine and uses a large language model for generating summaries of thematically grouped search results.
  • The patent describes that a single query can result in multiple queries that are based on “sub-themes”

Comparison Of Query Fan-Out And Thematic Search

The system described in the parent mirrors what Google’s documentation says about the Query Fan-Out technique.

Here’s what the patent says about generating additional queries based on sub-themes:

“In some examples, in response to the search query 142-2 being generated, the thematic search engine 120 may generate thematic data 138-2 from at least a portion of the search results 118-2. For example, the thematic search engine 120 may obtain the search results 118-2 and may generate narrower themes 130 (e.g., sub-themes) (e.g., “neighborhood A”, “neighborhood B”, “neighborhood C”) from the responsive documents 126 of the search results 118-2. The search results page 160 may display the sub-themes of theme 130a and/or the thematic search results 119 for the search query 142-2. The process may continue, where selection of a sub-theme of theme 130a may cause the thematic search engine 120 to obtain another set of search results 118 from the search engine 104 and may generate narrower themes 130 (e.g., sub-sub-themes of theme 130a) from the search results 118 and so forth.”

Here’s what Google’s documentation says about the Query Fan-Out Technique:

“It uses a “query fan-out” technique, issuing multiple related searches concurrently across subtopics and multiple data sources and then brings those results together to provide an easy-to-understand response. This approach helps you access more breadth and depth of information than a traditional search on Google.”

The system described in the patent resembles what Google’s documentation says about the Query Fan-Out technique, particularly in how it explores subtopics by generating new queries based on themes.

Summary Generator

The summary generator is a component of the thematic search system. It’s designed to generate textual summaries for each theme generated from search results.

This is how it works:

  • The summary generator is sometimes implemented as a large language model trained to create original text.
  • The summary generator uses one or more passages from search results grouped under a particular theme.
  • It may also use contextual information from titles, metadata, surrounding related passages to improve summary quality.
  • The summary generator can be triggered when a user submits a search query or when the thematic search engine is initialized.

The patent doesn’t define what ‘initialization’ of the thematic search engine means, maybe because it’s taken for granted that it means the thematic search engine starts up in anticipation of handling a query.

Query Results Are Clustered By Theme Instead Of Traditional Ranking

The traditional search results, in some examples shared in the patent, are replaced by grouped themes and generated summaries. Thematic search changes what content is shown and linked to users. For example, a typical query that a publisher or SEO is optimizing for may now be the starting point for a user’s information journey. The thematic search results leads a user down a path of discovering sub-themes of the original query and the site that ultimately wins the click might not be the one that ranks number one for the initial search query but rather it may be another web page that is relevant for an adjacent query.

The patent describes multiple ways that the thematic search engine can work (I added bullet points to make it easier to understand):

  • “The themes are displayed on a search results page, and, in some examples, the search results (or a portion thereof) are arranged (e.g., organized, sorted) according to the plurality of themes. Displaying a theme may include displaying the phrase of the theme.
  • In some examples, the thematic search engine may rank the themes based on prominence and/or relevance to the search query.
  • The search results page may organize the search results (or a portion thereof) according to the themes (e.g., under the theme of ‘cost of living”, identifying those search results that relate to the theme of ‘cost of living”).
  • The themes and/or search results organized by theme by the thematic search engine may be rendered in the search results page according to a variety of different ways, e.g., lists, user interface (UI) cards or objects, horizontal carousel, vertical carousel, etc.
  • The search results organized by theme may be referred to as thematic search results. In some examples, the themes and/or search results organized by theme are displayed in the search results page along with the search results (e.g., normal search results) from the search engine.
  • In some examples, the themes and/or theme-organized search results are displayed in a portion of the search results page that is separate from the search results obtained by the search engine.”

Content From Multiple Sources Are Combined

The AI-generated summaries are created from multiple websites and grouped under a theme. This makes link attribution, visibility, and traffic difficult to predict.

In the following citation from the patent, the reference to “unstructured data” means content that’s on a web page.

According to the patent:

“For example, the thematic search engine may generate themes from unstructured data by analyzing the content of the responsive documents themselves and may thematically organize the search results according to the themes.

….In response to a search query (“moving to Denver”), a search engine may obtain search results (e.g., responsive documents) responsive to that search query.

The thematic search engine may select a set of responsive documents (e.g., top X number of search results) from the search results obtained by the search engine, and generate a plurality of themes (e.g., “neighborhoods”, “cost of living”, “things to do”, “pros and cons”, etc.) from the content of the responsive documents.

A theme may include a phrase, generated by a language model, that describes a theme included in the responsive documents. In some examples, the thematic search engine may map semantic keywords from each responsive document (e.g., from the search results) and connect the semantic keywords to similar semantic keywords from other responsive documents to generate themes.”

Content From Source Pages Are Linked

The documentation states that the thematic search engine links to the URLs of the source pages. It also states that the thematic search result could include the web page’s title or other metadata. But the part that’s important for SEOs and publishers is the part about attribution, links.

“…a thematic search result 119 may include a title 146 of the responsive document 126, a passage 145 from the responsive document 126, and a source 144 of the responsive document. The source 144 may be a resource locator (e.g., uniform resource location (URL)) of the responsive document 126.

The passage 145 may be a description (e.g., a snippet obtained from the metadata or content of the responsive document 126). In some examples, the passage 145 includes a portion of the responsive document 126 that mentions the respective theme 130. In some examples, the passage 145 included in the thematic search result 119 is associated with a summary description 166 generated by the language model 128 and included in a cluster group 172.”

User Interaction Influences Presentation

As previously mentioned, the thematic search engine is not a ranked list of documents for a search query. It’s a collection of information across themes that are related to the initial search query. User interaction with those AI generated summaries influences which sites are going to receive traffic.

Automatically generated sub-themes can present alternative paths on the user’s information journey that begins with the initial search query.

Summarization Uses Publisher Metadata

The summary generator uses document titles, metadata, and surrounding textual content. That may mean that well-structured content may influence how summaries are constructed.

The following is what the patent says, I added bullet points to make it easier to understand:

  • “The summary generator 164 may receive a passage 145 as an input and outputs a summary description 166 for the inputted passage 145.
  • In some examples, the summary generator 164 receives a passage 145 and contextual information as inputs and outputs a summary description 166 for the passage 145.
  • In some examples, the contextual information may include the title of the responsive document 126 and/or metadata associated with the responsive document 126.
  • In some examples, the contextual information may include one or more neighboring passages 145 (e.g., adjacent passages).
  • In some examples, the contextual information may include a summary description 166 for one or more neighboring passages 145 (e.g., adjacent passages).
  • In some examples, the contextual information may include all the other passages 145 on the same responsive document 126. For example, the summary generator may receive a passage 145 and the other passages 145 (e.g., all other passages 145) on the same responsive document 126 (and, in some examples, other contextual information) as inputs and may output a summary description 166 for the passage 145.”

Thematic Search: Implications For Content & SEO

There are two way that AI Mode ends for a publisher:

  1. Since users may get their answers from theme summaries or dropdowns, zero-click behavior is likely to increase, reducing traffic from traditional links.
  2. Or, it could be that the web page that provides the end of the user’s information journey for a given query is the one that receives the click.

I think this means that we really need to re-think the paradigm of ranking for keywords and maybe consider what the question is that’s being answered by a web page, and then identify follow-up questions that may be related to that initial query and either include that in the web page or create another web page that answers what may be the end of the information journey for a given search query.

You can read the patent here:

Thematic Search (PDF)

Read Google’s Documentation Of AI Mode (PDF)