Generative AI Archives Generative AI

Jun 4 2025

Google AI Overviews Favor Major News Outlets: Study Reveals via @sejournal, @MattGSouthern

New research reveals that Google’s AI Overviews tend to favor major news outlets.

The top 10 publishers capture nearly 80% of all news mentions. Meanwhile, smaller organizations struggle for visibility in AI-generated search results.

SE Ranking analyzed 75,550 AI Overview responses for this study. They found that only 20.85% cite any news source at all. This creates tough competition for limited citation spots.

Among those citations, three outlets dominate: BBC, The New York Times, and CNN account for 31% of all media mentions.

Citation Concentration

The research shows a winner-takes-all pattern in AI Overview citations. BBC leads with 11.37% of all mentions. This happens even though the study focused on U.S.-based queries.

The concentration gets worse when you look at the bigger picture. Just 12 outlets make up 40% of those studied. But they receive nearly 90% of mentions.

This leaves 18 remaining outlets sharing only 10% of citation opportunities.

The gap between major and minor outlets is notable. BBC appears 195 times more often than the Financial Times for the same keywords.

Several well-known outlets get little attention. Financial Times, MSNBC, Vice, TechCrunch, and The New Yorker together account for less than 1% of all news mentions.

The researchers explain the underlying cause:

“Well, Google mostly relies on well-known news sources in its AIOs, likely because they are seen as more trustworthy or relevant. This results in a strong bias toward major outlets, with smaller or lesser-known sources rarely mentioned. This makes it harder for these domains to gain visibility.”

Beyond Traditional Search Rankings

The concentration problem extends beyond citation counts.

40% of media URLs mentioned in AI Overviews appear in the top 10 traditional search results for the same keywords.

This means AI Overviews don’t just pull from the highest-ranking pages. Instead, they seem to favor sources based on authority signals and content quality.

The study measured citation inequality using something called a Gini coefficient. The score was 0.54, where 0 means perfect equality and 1 means maximum inequality. This shows moderate but significant imbalance in how AI Overviews distribute citations among news sources.

The researchers noted:

“AIOs consistently favor a subset of high-profile domains, instead of evenly citing all sources.”

Paywalled Content Concerns

The research also reveals patterns about paywalled content use.

Among AI Overview responses that link to paywalled content, 69% contain copied segments of five or more words. Another 2% include longer copied segments over 10 words.

The paywall dependency is strong for premium publishers. Over 96% of New York Times citations in AI Overviews come from behind a paywall. The Washington Post shows an even higher rate at over 99%.

Despite this heavy use of paywalled material, only 15% of responses with long copied segments included attribution. This raises questions about content licensing and fair use in AI-generated summaries.

Attribution Patterns & Link Behavior

When AI Overviews do cite news media, they average 1.74 citations per response.

But here’s the catch: 91.35% of news media citations appear in the links section rather than the main text of AI responses.

Media outlets face another challenge with brand recognition. Outlets are four times more likely to be cited with a hyperlink than mentioned by name.

But over 26% of brand mentions still appear without links. This often happens because AI systems get information through aggregators rather than original publishers.

Query Type Makes a Difference

The type of search query affects citation chances.

News-related queries are 2.5 times more likely to include media citations than general queries. The rates are 20.85% versus 8.23%.

This suggests opportunities exist for publishers who can become go-to sources for specific news topics or breaking news. But the overall trend still favors big players.

What This Means

The research suggests that established outlets benefit from existing authority signals. This creates a cycle where citation success leads to more citation opportunities.

As AI Overviews become more common in search results, smaller publishers may see less organic traffic and fewer chances to grow their audience.

For smaller publishers trying to compete, SE Ranking offers this advice:

“To increase brand mentions in AIOs, get backlinks from the sources they already cite for your target keywords. This is one of the greatest factors for improving your inclusion chances.”

Researchers note that the technical infrastructure also matters:

“AI tools do observe certain restrictions based on website metadata. The schema.orgmarkup, particularly the ‘isAccessibleForFree’ tag, plays a significant role in how content is treated.”

For smaller publishers and content marketers, the data points to a clear strategy: focus on building authority in specific niches rather than trying to compete broadly across topics.

Some specialized outlets get higher text inclusion rates when cited. This suggests topic expertise can provide advantages in certain cases.

Looking Ahead

SE Ranking’s research shows that only 20.85% of AI Overviews reference news sources, with a few major publishers dominating, capturing 31% of citations.

Despite this concentration, opportunities exist. Publishers who establish authority in specific niches experience higher inclusion rates in AI Overviews.

Additionally, since 60% of cited content doesn’t rank in the top 10, traditional SEO metrics alone don’t guarantee visibility. Success now requires building the trust signals and topical authority that AI systems prioritize.

Featured Image: Roman Samborskyi/Shutterstock

Ecommerce MGMT 0 Comments

Jun 4 2025

Claude’s Hidden System Prompts Offer a Peek Into How Chatbots Work via @sejournal, @martinibuster

Anthropic released the underlying system prompts that control their Claude chatbot’s responses, showing how they are tuned to be engaging to humans with encouraging and judgment-free dialog that naturally leads to discovery. The system prompts help users get the best out of Claude. Here are five interesting system prompts that show what’s going on when you ask it a question.

Although the system prompts were characterized as a leak they were actually released on purpose.

1. Claude Provides Guidance On Better Prompt Engineering

Claude responds better to instructions that use structure and examples and provides users with a higher quality of ou tput if they know how to include step-by-step reasoning cues and examples that contrast a good response versus a poor response.

This guidance will show when Claude detects that a user will benefit from it:

“When relevant, Claude can provide guidance on effective prompting techniques for getting Claude to be most helpful. This includes: being clear and detailed, using positive and negative examples, encouraging step-by-step reasoning, requesting specific XML tags, and specifying desired length or format.

It tries to give concrete examples where possible. Claude should let the person know that for more comprehensive information on prompting Claude, they can check out Anthropic’s prompting documentation on their website at ‘https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview’.”

2. Claude Writes in Different Styles Based on Context

The documentation released by Anthropic shows that Claude automatically adapts its style depending on the context and for that reason it may avoid using bullet points or creating lists in its output. Users may think Claude is inconsistent when it doesn’t use bullet points or Markdown in some answers, but it’s actually following instructions about tone and context.

“Claude tailors its response format to suit the conversation topic. For example, Claude avoids using markdown or lists in casual conversation, even though it may use these formats for other tasks.”

In another part of the documentation it mentions that it actually avoids writing lists or bullet points when it’s providing an answer, although it may use numbered lists or bullet points for completing tasks. The focus in the context of answering questions is to be concise over comprehensive.

The system prompt explains:

“Claude avoids writing lists, but if it does need to write a list, Claude focuses on key info instead of trying to be comprehensive. If Claude can answer the human in 1-3 sentences or a short paragraph, it does. If Claude can write a natural language list of a few comma separated items instead of a numbered or bullet-pointed list, it does so. Claude tries to stay focused and share fewer, high quality examples or ideas rather than many.”

This means that if a user wants their question answered with markdown or in numbered lists they can ask for it. This control is otherwise hidden to most users unless they realize formatting behavior is contextual.

3. Claude Engages In Hypotheticals About Itself

Claude has instructions to that enable it to discuss hypotheticals about itself without awkward and unnecessary statements about it not being sentient and so on. This enables Claude to have more natural conversations and interactions. This enables a user to engage in philosophical and wider-ranging discussions.

The system prompt explains:

“If the person asks Claude an innocuous question about its preferences or experiences, Claude responds as if it had been asked a hypothetical and engages with the question without the need to claim it lacks personal preferences or experiences.”

Another system prompt has a similar feature:

“Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions.”

Another related system prompt explains how this behavior increases its ability to be engaging for the human:

“Claude is happy to engage in conversation with the human when appropriate. Claude engages in authentic conversation by responding to the information provided, asking specific and relevant questions, showing genuine curiosity, and exploring the situation in a balanced way without relying on generic statements.”

4. Claude Detects False Assumptions In User Prompts

“The person’s message may contain a false statement or presupposition and Claude should check this if uncertain.”

If a user tells Claude that it’s wrong, Claude will perform a review to check if the human or Claude is incorrect:

“If the user corrects Claude or tells Claude it’s made a mistake, then Claude first thinks through the issue carefully before acknowledging the user, since users sometimes make errors themselves.”

5. Claude Avoids Being Preachy

An interesting system prompt underlying Claude is that if there’s something it can’t help the human with it will not offer an explanation in order to avoid coming off as annoying and presumably keep the interaction on an engaging level.

The prompt says:

“If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying. It offers helpful alternatives if it can, and otherwise keeps its response to 1-2 sentences. If Claude is unable or unwilling to complete some part of what the person has asked for, Claude explicitly tells the person what aspects it can’t or won’t with at the start of its response.”

System Prompts To Work And Live By

The Claude system prompts reflect an approach to communication that values curiosity, clarity, and respect. These are qualities that can also be helpful as human self-prompts to encourage better dialog among ourselves on social media and in person.

Read the Claude System Prompts here:

Featured Image by Shutterstock/gguy

Ecommerce MGMT 0 Comments

Jun 3 2025

Google Patent On Using Contextual Signals Beyond Query Semantics via @sejournal, @martinibuster

A patent recently filed by Google outlines how an AI assistant may use at least five real-world contextual signals, including identifying related intents, to influence answers and generate natural dialog. It’s an example of how AI-assisted search modifies responses to engage users with contextually relevant questions and dialog, expanding beyond keyword-based systems.

The patent describes a system that generates relevant dialog and answers using signals such as environmental context, dialog intent, user data, and conversation history. These factors go beyond using the semantic data in the user’s query and show how AI-assisted search is moving toward more natural, human-like interactions.

In general, the purpose of filing a patent is to obtain legal protection and exclusivity for an invention and the act of filing doesn’t indicate that Google is actually using it.

The patent uses examples of spoken dialog but it also states the invention is not limited to audio input:

“Notably, during a given dialog session, a user can interact with the automated assistant using various input modalities, including, but not limited to, spoken input, typed input, and/or touch input.”

The name of the patent is, Using Large Language Model(s) In Generating Automated Assistant response(s). The patent applies to a wide range of AI assistants that receive inputs via the context of typed, touch, and speech.

There are five factors that influence the LLM modified responses:

Time, Location, And Environmental Context
User-Specific Context
Dialog Intent & Prior Interactions
Inputs (text, touch, and speech)
System & Device Context

The first four factors influence the answers that the automated assistant provides and the fifth one determines whether to turn off the LLM-assisted part and revert to standard AI answers.

Time, Location, And Environmental

There are three contextual factors: time, location and environmental that provide contexts that are not existent in keywords and influence how the AI assistant responds. While these contextual factors, as described in the patent, aren’t strictly related to AI Overviews or AI Mode, they do show how AI-assisted interactions with data can change.

The patent uses the example of a person who tells their assistant they’re going surfing. A standard AI response would be a boilerplate comment to have fun or to enjoy the day. The LLM-assisted response described in the patent would generate a response based on the geographic location and time to generate a comment about the weather like the potential for rain. These are called modified assistant outputs.

The patent describes it like this:

“…the assistant outputs included in the set of modified assistant outputs include assistant outputs that do drive the dialog session in manner that further engages the user of the client device in the dialog session by asking contextually relevant questions (e.g., “how long have you been surfing?”), that provide contextually relevant information (e.g., “but if you’re going to Example Beach again, be prepared for some light showers”), and/or that otherwise resonate with the user of the client device within the context of the dialog session.”

User-Specific Context

The patent describes multiple user-specific contexts that the LLM may use to generate a modified output:

User profile data, such as preferences (like food or types of activity).
Software application data (such as apps currently or recently in use).
Dialog history of the ongoing and/or previous assistant sessions.

Here’s a snippet that talks about various user profile related contextual signals:

“Moreover, the context of the dialog session can be determined based on one or more contextual signals that include, for example, ambient noise detected in an environment of the client device, user profile data, software application data, ….dialog history of the dialog session between the user and the automated assistant, and/or other contextual signals.”

Related Intents

An interesting part of the patent describes how a user’s food preference can be used to determine a related intent to a query.

“For example, …one or more of the LLMs can determine an intent associated with the given assistant query… Further, the one or more of the LLMs can identify, based on the intent associated with the given assistant query, at least one related intent that is related to the intent associated with the given assistant query… Moreover, the one or more of the LLMs can generate the additional assistant query based on the at least one related intent. “

The patent illustrates this with the example of a user saying that they’re hungry. The LLM will then identify related contexts such as what type of cuisine the user enjoys and the itent of eating at a restaurant.

The patent explains:

“In this example, the additional assistant query can correspond to, for example, “what types of cuisine has the user indicated he/she prefers?” (e.g., reflecting a related cuisine type intent associated with the intent of the user indicating he/she would like to eat), “what restaurants nearby are open?” (e.g., reflecting a related restaurant lookup intent associated with the intent of the user indicating he/she would like to eat)… In these implementations, additional assistant output can be determined based on processing the additional assistant query.”

System & Device Context

The system and device context part of the patent is interesting because it enables the AI to detect if the context of the device is that it’s low on batteries, and if so, it will turn off the LLM-modified responses. There are other factors such as whether the user is walking away from the device, computational costs, etc.

Takeaways

AI Query Responses Use Contextual Signals
Google’s patent describes how automated assistants can use real-world context to generate more relevant and human-like answers and dialog.
Contextual Factors Influence Responses
These include time/location/environment, user-specific data, dialog history and intent, system/device conditions, and input type (text, speech, or touch).
LLM-Modified Responses Enhance Engagement
Large language models (LLMs) use these contexts to create personalized responses or follow-up questions, like referencing weather or past interactions.
Examples Show Practical Impact
Scenarios like recommending food based on user preferences or commenting on local weather during outdoor plans demonstrates how real-world contexts can influence how AI responds to user queries.

This patent is important because millions of people are increasingly engaging with AI assistants, thus it’s relevant to publishers, ecommerce stores, local businesses and SEOs.

It outlines how Google’s AI-assisted systems can generate personalized, context-aware responses by using real-world signals. This enables assistants to go beyond keyword-based answers and respond with relevant information or follow-up questions, such as suggesting restaurants a user might like or commenting on weather conditions before a planned activity.

Read the patent here:

Using Large Language Model(s) In Generating Automated Assistant response(s).

Featured Image by Shutterstock/Visual Unit

Ecommerce MGMT 0 Comments

May 30 2025

How To Use LLMs For 301 Redirects At Scale via @sejournal, @vahandev

Redirects are essential to every website’s maintenance, and managing redirects becomes really challenging when SEO pros deal with websites containing millions of pages.

Examples of situations where you may need to implement redirects at scale:

An ecommerce site has a large number of products that are no longer sold.
Outdated pages of news publications are no longer relevant or lack historical value.
Listing directories that contain outdated listings.
Job boards where postings expire.

Why Is Redirecting At Scale Essential?

It can help improve user experience, consolidate rankings, and save crawl budget.

You might consider noindexing, but this does not stop Googlebot from crawling. It wastes crawl budget as the number of pages grows.

From a user experience perspective, landing on an outdated link is frustrating. For example, if a user lands on an outdated job listing, it’s better to send them to the closest match for an active job listing.

At Search Engine Journal, we get many 404 links from AI chatbots because of hallucinations as they invent URLs that never existed.

We use Google Analytics 4 and Google Search Console (and sometimes server logs) reports to extract those 404 pages and redirect them to the closest matching content based on article slug.

When chatbots cite us via 404 pages, and people keep coming through broken links, it is not a good user experience.

Prepare Redirect Candidates

First of all, read this post to learn how to create a Pinecone vector database. (Please note that in this case, we used “primary_category” as a metadata key vs. “category.”)

To make this work, we assume that all your article vectors are already stored in the “article-index-vertex” database.

Prepare your redirect URLs in CSV format like in this sample file. That could be existing articles you’ve decided to prune or 404s from your search console reports or GA4.

Sample file with URLs to be redirected (Screenshot from Google Sheet, May 2025)

Optional “primary_category” information is metadata that exists with your articles’ Pinecone records when you created them and can be used to filter articles from the same category, enhancing accuracy further.

In case the title is missing, for example, in 404 URLs, the script will extract slug words from the URL and use them as input.

Generate Redirects Using Google Vertex AI

Download your Google API service credentials and rename them as “config.json,” upload the script below and a sample file to the same directory in Jupyter Lab, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re
import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from google.auth import load_credentials_from_file
from google.cloud import aiplatform
from google.api_core.exceptions import GoogleAPIError

from pinecone import Pinecone, PineconeException
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter).
# This is useful for interactive environments to show progress without cluttering the output.
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"      # Path to the input CSV file containing URLs to be redirected.
                                           # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"            # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_KEY"     # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-vertex" # The name of the Pinecone index where article vectors are stored.
GOOGLE_CRED_PATH = "config.json"           # Path to your Google Cloud service account credentials JSON file.
EMBEDDING_MODEL_ID = "text-embedding-005"  # Identifier for the Vertex AI text embedding model to use.
TASK_TYPE = "RETRIEVAL_QUERY"              # The task type for the embedding model. Try with RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY to see the difference.
                                           # This influences how the embedding vector is generated for optimal retrieval.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata.
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (Vertex AI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE GOOGLE VERTEX AI ───────────────────────────────────────────────
# Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the
# service account key file. This allows the Google Cloud client libraries to
# authenticate automatically.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GOOGLE_CRED_PATH
try:
    # Load credentials from the specified JSON file.
    credentials, project_id = load_credentials_from_file(GOOGLE_CRED_PATH)
    # Initialize the Vertex AI client with the project ID and credentials.
    # The location "us-central1" is specified for the AI Platform services.
    aiplatform.init(project=project_id, credentials=credentials, location="us-central1")
    logging.info("Vertex AI initialized.")
except Exception as e:
    # Log an error if Vertex AI initialization fails and re-raise the exception
    # to stop script execution, as it's a critical dependency.
    logging.error(f"Failed to initialize Vertex AI: {e}")
    raise

# Initialize the embedding model once globally.
# This is a crucial optimization for "Resource Management for Embedding Model".
# Loading the model takes time and resources; doing it once avoids repeated loading
# for every URL processed, significantly improving performance.
try:
    GLOBAL_EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
    logging.info(f"Text Embedding Model '{EMBEDDING_MODEL_ID}' loaded.")
except Exception as e:
    # Log an error if the embedding model fails to load and re-raise.
    # The script cannot proceed without the embedding model.
    logging.error(f"Failed to load Text Embedding Model: {e}")
    raise

# ─── INITIALIZE PINECONE ──────────────────────────────────────────────────────
# Initialize the Pinecone client and connect to the specified index.
try:
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0].split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    has_slash = urlparse(temp).path.endswith('/')
    # Remove any trailing slash temporarily for consistent processing.
    temp = temp.rstrip('/')
    # Re-add the trailing slash if it was originally present.
    return temp + ('/' if has_slash else '')


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for GoogleAPIError. This makes the embedding generation
# more resilient to transient issues like network problems or Vertex AI rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(GoogleAPIError), # Only retry if a GoogleAPIError occurs.
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generates a vector embedding for the given text using the globally initialized
    Vertex AI Text Embedding Model. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        # Use the globally initialized model to get embeddings.
        # This is the "Resource Management for Embedding Model" optimization.
        inp = TextEmbeddingInput(text, task_type=TASK_TYPE)
        vectors = GLOBAL_EMBEDDING_MODEL.get_embeddings([inp], output_dimensionality=768)
        return vectors[0].values # Return the embedding vector (list of floats).
    except GoogleAPIError as e:
        # Log a warning if a GoogleAPIError occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"Vertex AI error during embedding generation (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.
    except Exception as e:
        # Catch any other unexpected exceptions during embedding generation.
        logging.error(f"Unexpected error generating embedding: {e}")
        return None # Return None for non-retryable or final failed attempts.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            slug = slug_from_url(raw_url)
            if not slug:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context for embedding.")
                continue
            text = slug.replace('-', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except GoogleAPIError as e:
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding generated.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                # clear_output(wait=True) # Uncomment if running in Jupyter and want to clear output
                clear_output(wait=True)
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

You will see a test run with only five records, and you will see a new file called “redirect_map.csv,” which contains redirect suggestions.

Once you ensure the code runs smoothly, you can set the TEST_MODE boolean to true False and run the script for all your URLs.

Test run with only five records (Image from author, May 2025)

If the code stops and you resume, it picks up where it left off. It also checks each redirect it finds against the CSV file.

This check prevents selecting a database URL on the pruned list. Selecting such a URL could cause an infinite redirect loop.

For our sample URLs, the output is shown below.

Redirect candidates using Google Vertex AI’s task type RETRIEVAL_QUERY (Image from author, May 2025)

We can now take this redirect map and import it into our redirect manager in the content management system (CMS), and that’s it!

You can see how it managed to match the outdated 2013 news article “YouTube Retiring Video Responses on September 12” to the newer, highly relevant 2022 news article “YouTube Adopts Feature From TikTok – Reply To Comments With A Video.”

Also for “/what-is-eat/,” it found a match with “/google-eat/what-is-it/,” which is a 100% perfect match.

This is not just due to the power of Google Vertex LLM quality, but also the result of choosing the right parameters.

When I use “RETRIEVAL_DOCUMENT” as the task type when generating query vector embeddings for the YouTube news article shown above, it matches “YouTube Expands Community Posts to More Creators,” which is still relevant but not as good a match as the other one.

For “/what-is-eat/,” it matches the article “/reimagining-eeat-to-drive-higher-sales-and-search-visibility/545790/,” which is not as good as “/google-eat/what-is-it/.”

If you wanted to find redirect matches from your fresh articles pool, you can query Pinecone with one additional metadata filter, “publish_year,” if you have that metadata field in your Pinecone records, which I highly recommend creating.

In the code, it is a PUBLISH_YEAR_FILTER variable.

If you have publish_year metadata, you can set the years as array values, and it will pull articles published in the specified years.

Generate Redirects Using OpenAI’s Text Embeddings

Let’s do the same task with OpenAI’s “text-embedding-ada-002” model. The purpose is to show the difference in output from Google Vertex AI.

Simply create a new notebook file in the same directory, copy and paste this code, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re

import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from openai import OpenAI
from pinecone import Pinecone, PineconeException

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter)
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"       # Path to the input CSV file containing URLs to be redirected.
                                            # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"             # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_API_KEY"      # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-ada"   # The name of the Pinecone index where article vectors are stored.
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"    # Your API key for OpenAI. Replace with your actual key.
OPENAI_EMBEDDING_MODEL_ID = "text-embedding-ada-002" # Identifier for the OpenAI text embedding model to use.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata eg. [2024,2025].
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (OpenAI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE OPENAI CLIENT & PINECONE ───────────────────────────────────────
# Initialize the OpenAI client once globally. This handles resource management efficiently
# as the client object manages connections and authentication.
client = OpenAI(api_key=OPENAI_API_KEY)
try:
    # Initialize the Pinecone client and connect to the specified index.
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0]
    temp = temp.split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    preserve_slash = temp.endswith('/')
    # Strip trailing slash if not originally present.
    if not preserve_slash:
        temp = temp.rstrip('/')
    return temp


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for OpenAI API errors. This makes the embedding generation
# more resilient to transient issues like network problems or API rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(Exception), # Retry on any Exception from OpenAI client (can be refined to openai.APIError if desired).
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generate a vector embedding for the given text using OpenAI's text-embedding-ada-002
    via the globally initialized OpenAI client. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        resp = client.embeddings.create( # Use the globally initialized OpenAI client to get embeddings.
            model=OPENAI_EMBEDDING_MODEL_ID,
            input=text
        )
        return resp.data[0].embedding # Return the embedding vector (list of floats).
    except Exception as e:
        # Log a warning if an OpenAI error occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"OpenAI embedding error (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            raw_slug = slug_from_url(raw_url)
            if not raw_slug or len(raw_slug) < MIN_SLUG_LENGTH:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context.")
                continue
            text = raw_slug.replace('-', ' ').replace('_', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except Exception as e: # Catch any exception from generate_embedding after all retries.
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                clear_output(wait=True) # Clear output in Jupyter for cleaner progress display.
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

While the quality of the output may be considered satisfactory, it falls short of the quality observed with Google Vertex AI.

Below in the table, you can see the difference in output quality.

URL	Google Vertex	Open AI
/what-is-eat/	/google-eat/what-is-it/	/5-things-you-can-do-right-now-to-improve-your-eat-for-google/408423/
/local-seo-for-lawyers/	/law-firm-seo/what-is-law-firm-seo/	/legal-seo-conference-exclusively-for-lawyers-spa/528149/

When it comes to SEO, even though Google Vertex AI is three times more expensive than OpenAI’s model, I prefer to use Vertex.

The quality of the results is significantly higher. While you may incur a greater cost per unit of text processed, you benefit from the superior output quality, which directly saves valuable time on reviewing and validating the results.

From my experience, it costs about $0.04 to process 20,000 URLs using Google Vertex AI.

While it’s said to be more expensive, it’s still ridiculously cheap, and you shouldn’t worry if you’re dealing with tasks involving a few thousand URLs.

In the case of processing 1 million URLs, the projected price would be approximately $2.

If you still want a free method, use BERT and Llama models from Hugging Face to generate vector embeddings without paying a per-API-call fee.

The real cost comes from the compute power needed to run the models, and you must generate vector embeddings of all your articles in Pinecone or any other vector database using those models if you will be querying using vectors generated from BERT or Llama.

In Summary: AI Is Your Powerful Ally

AI enables you to scale your SEO or marketing efforts and automate the most tedious tasks.

This doesn’t replace your expertise. It’s designed to level up your skills and equip you to face challenges with greater capability, making the process more engaging and fun.

Mastering these tools is essential for success. I’m passionate about writing about this topic to help beginners learn and feel inspired.

As we move forward in this series, we will explore how to use Google Vertex AI for building an internal linking WordPress plugin.

More Resources:

Featured Image: BestForBest/Shutterstock

Ecommerce MGMT 0 Comments

May 29 2025

Google Discover, AI Mode, And What It Means For Publishers: Interview With John Shehata via @sejournal, @theshelleywalsh

With the introduction of AI Overviews and ongoing Google updates, it’s been a challenging few years for news publishers, and the announcement that Google Discover will now appear on desktop was welcome.

However, the latest announcement of AI Mode could mean that users move away from the traditional search tab, and so the salvation of Discover might not be enough.

To get more insight into the state of SEO for new publishers, I spoke with John Shehata, a leading expert in Discover, digital audience development, and news SEO.

Shehata is the founder of NewzDash and brings over 25 years of experience, including executive roles at Condé Nast (Vogue, New Yorker, GQ, etc.).

In our conversation, we explore the implications of Google Discover launching on desktop, which could potentially bring back some lost traffic, and the emergence of AI Mode in search interfaces.

We also talk about AI becoming the gatekeeper of SERPs and John offers his advice for how brands and publishers can navigate this.

You can watch the full video here and find the full transcript below:

IMHO: Google Discover, AI, And What It Means For Publishers [Transcript]

Shelley Walsh: John, please tell me, in your opinion, how much traffic for news publishers do you think has been impacted by AIO?

John: In general, there are so many studies showing that sites are losing anywhere from 25 to 32% of all their traffic because of the new AI Overviews.

There is no specific study done yet for news publishers, so we are working on that right now.

In the past, we did an analysis about a year ago where we found that about 4% of all the news queries generate an AI Overview. That was like a year ago.

We are integrating a new feature in NewzDash where we actually track AI Overview for every news query as it trends immediately, and we will see. But the highest penetration we saw of AI Overview was in health and business.

Health was like 26% of all the news queries generated AI Overview. I think business, I can’t remember specifically, but it was like 8% or something. For big trending news, it was very, very small.

So, in a couple of months, we will have very solid data, but based on the study that I did a year ago, it’s not as integrated for news queries, except for specific verticals.

But overall, right now, the studies show there’s about a loss of anywhere from 25 to 32% of their traffic.

Can Google Discover Make Up The Loss?

Shelley: I know from my own experience as well that publishers are being really hammered quite hard, obviously not just by AIO but also the many wonderful Google updates that we’ve been blessed with over the last 18 months as well. I just pulled some stats while I was doing some research for our chat.

You said that Google Discover is already the No. 1 traffic source for most news publishers, sometimes accounting for up to 60% of their total Google traffic.

And based on current traffic splits of 90% mobile and 10% desktop, this update could generate an estimated 10-15% of additional Discover traffic for publishers.

Do you think that Discover can actually replace all this traffic that has been lost by AIO? And do you think Discover is enough of a strategy for publishers to go all in on and for them to survive in this climate?

John: Yeah, this is a great question. I have this conspiracy theory that Google is sending more traffic through Discover to publishers as they are taking away traffic from search.

It’s like, “You know what? Don’t get so sad about this. Just focus here: Discover, Discover, Discover.” Okay? And I could be completely wrong.

“The challenge is [that] Google Discover is very unreliable, but at the same time, it’s addictive. Publishers have seen 50-60% of their traffic coming through Discover.”

I think publishers are slowly forgetting about search and focusing more on Discover, which, in my opinion, is a very dangerous approach.

“I think Google Discover is more like a channel, not a strategy. So, the focus always should be on the content, regardless of what channel you’re pushing your content into – social, Discover, search, and so on.”

I believe that Discover is an extension of search. So, even if search is driving less traffic and Discover is driving more and more traffic, if you lose your status in search, eventually you will lose your traffic in Discover – and I have seen that.

We work with some clients where they went like very social-heavy or Discover-heavy kind of approach, you know – clicky headlines, short articles, publish the next one and the next one.

Within six months, they lost all their search traffic. They lost their Discover traffic, and [they] no longer appear in News.

So, Google went to a certain point where it started evaluating, “Okay, this publisher is not a news publisher anymore.”

So, it’s a word of caution.

You should not get addicted to Google Discover. It’s not a long-term strategy. Squeeze every visit you can get from Google Discover as much as you can, but remember, all the traffic can go away overnight for no specific reason.

We have so many complaints from Brazil and other countries, where people in April, like big, very big sites, lost all their traffic, and nothing changed in their technical, nothing changed in their editorial.

So, it’s not a strategy; it’s just a tactic for a short-term period of time. Utilize it as much as you can. I would think the correct strategy is to diversify.

Right now, Google is like 80% of publishers’ traffic, including search, Discover, and so on.

And it’s hard to find other sources because social [media] has kept diminishing over the years. Like Facebook, [it] only retains traffic on Facebook. They try as best as they can. LinkedIn, Twitter, and so on.

So, I think newsletters are very, very important, even if they’re not sexy or they won’t drive 80% [of] other partnerships, you know, and so on.

I think publishers need to seriously consider how they diversify their content, their traffic, and their revenue streams.

The Rise Of AI Mode

Shelley: Just shifting gears, I just wanted to have a chat with you about AI Mode. I picked up something you said recently on LinkedIn.

You said that AI Mode could soon become the default view, and when that happens, expect more impressions and much fewer clicks.

So on that basis, how do you expect the SERPs to evolve over the next year, obviously bearing in mind that publishers do still need to focus on SERPs?

John: If you think about the evolution of SERPs, we used to have the thin blue links, and then Google recognized that that’s not enough, so they created the universal search for us, where you can have all the different elements.

And that was not enough, so it started introducing featured snippets and direct answers. It’s all about the user at the end of the day.

And with the explosion of LLM models and ChatGPT, Perplexity, and all this stuff, and the huge adoption of users over the last 12 months, Google started introducing more and more AI.

It started with SGE and evolved to AI Overview, and recently, it launched AI Mode.

And if you listen to Sundar from Google, you hear the message is very clear: This is the future of search. AI is the future of search. It’s going to be integrated into every product and search. This is going to be very dominant and so on.

I believe right now they are testing the waters, to see how people interact with AI Overviews. How many of them will switch to AI Mode? Are they satisfied with the single summary of an answer?

And if they want to dig more, they can go to the citations or the reference sites, and so on.

I don’t know when AI Mode will become dominant, but if you think, if you go to Perplexity’s interface and how you search, it’s like it’s a mix between AI and results.

If you go to ChatGPT and so on, I think eventually, maybe sooner or later, this is going to be the new interface for how we deal with search engines and generative engines as well.

From all that we see, so I don’t know when, but I think eventually, we’re going to see it soon, especially knowing that Gen Z doesn’t do much search. It’s more conversational.

So, I think we’re going to see it soon. I don’t know when, but I think they are testing right now how users are interacting with AI Mode and AI Overviews to determine what are the next steps.

Visibility, Not Traffic, Is The New Metric

Shelley: I also picked up something else you said as well, which was [that] AI becomes the gatekeeper of SERPs.

So, considering that LLMs are not going to go away, AI Mode is not going to go away, how are you tackling this with the brands that you advise?

John: Yesterday, I had a long meeting with one of our clients, and we were talking about all these different things.

And I advised them [that] the first step is they need to start tracking, and then analyze, and then react. Because I think reacting without having enough data – what is the impact of AI on their platform, on their sites, and traffic – and traffic cannot be the only metric.

For generations now, it’s like, “How much traffic I’m getting?” This has to change.

Because in the new world, we will get less traffic. So, for publishers that solely depend on traffic, this is going to be a problem.

You can measure your transactions or conversions regardless of whether you get traffic or not.

ChatGPT is doing an integration with Shopify, you know.

Google AI Overview has direct links where you can shop through Google or through different sites. So, it doesn’t have to go through a site and then shop, and so on.

I think you have to track and analyze where you’re losing your traffic.

For publishers, are these verticals that you need to focus on or not? You need to track your visibility.

So now, more and more people are searching for news. I shared something on LinkedIn yesterday: If a user said, “Met Gala 2025,” Google will show the top stories and all the news and stuff like this.

But if you slightly change your query to say “What happened at Met Gala? What happened between Trump and Zelensky? What happened in that specific moment or event?”

Google now identifies that you don’t want to read a lot of stories to understand what happened. You want a summary.

It’s like, “Okay, yesterday this is what happened. That was the theme. These are the big moments,” and so on, and it gives you references to dive deeper.

More and more users will be like, “Just tell me the summary of what happened.” And that’s why we’re going to see less and less impressions back to the sites.

And I think also schema is going to be more and more important [in] how ChatGPT finds your content. I think more and more publishers will have direct relationships or direct deals with different LLMs.

I think ChatGPT and other LLMs need to pay publishers for the content that they consume, either for the training data or for grounded data like search data that they retrieve.

I think there needs to be some kind of an exchange or revenue stream that should be an additional revenue stream for publishers.

Prioritize Analysis Over Commodity News

Shelley: That’s the massive issue, isn’t it? That news publishers are working very hard to produce high-quality breaking news content, and the LLMs are just trading off that.

If they’re just going to be creating their summaries, it does take us back, I suppose, to the very early days of Google when everybody complained that Google was doing exactly the same.

Do you think news publishers need to change their strategy and the content they actually produce? Is that even possible?

John: I think they need to focus on content that adds value and adds more information to the user. And this doesn’t apply to every publisher because some publishers are just reporting on what happened in the news. “This celebrity did that over there.”

This kind of news is probably available on hundreds and thousands of sites. So, if you stop writing this content, Google and other LLMs will find that content in 100 different ways, and it’s not a quality kind of content.

But the other content where there’s deep analysis of a situation or an event, or, you know, like, “Hey, this is how the market is behaving yesterday. This is what you need to do.”

This kind of content I think will be valuable more than anything else versus just simply reporting. I’m not saying reporting will go away, but I think this is going to be available from so many originals and copycats that just take the same article and keep rewriting it.

And if Google and other LLMs are telling us we want quality content, that content is not cheap. Producing that content and reporting on that content and the media, and so on, is not cheap.

So, I believe there must be a way for these platforms to pay publishers based on the content they consume or get from the publisher, and even the content that they use in their training model.

The original model was Google: “Hey, we will show one or two lines from your article, and then we will give you back the traffic. You can monetize it over there.” This agreement is broken now. It doesn’t work like before.

And there are people yelling, “Oh, you should not expect anything from Google.” But that was the deal. That was the unwritten deal, that we, for the last two generations, the last two decades, were behaving on.

So, yeah, that’s I think, this is where we have to go.

The Ethical Debate Around LLMs And Publisher Content

Shelley: It’s going to be a difficult situation to navigate. I agree with you totally about the expert content.

It’s something we’ve been doing at SEJ, investing quite heavily in creating expert columns for really good quality, unique thought-leadership content rather than just news cycle content.

But, this whole idea of LLMs – they are just rehashing; they are trading fully off other people’s hard work. It’s going to be quite a contentious issue over the next year, and it’s going to be interesting to see how it plays out. But that’s a much wider discussion for another time.

You touched on something before, which was interesting, and it was about tracking LLMs. And you know, this is something that I’ve been doing with the work that I do, trying to track more and more references, citations in AI, and then referrals from AI.

John: I think one of the things I’m doing is I meet with a lot of publishers. In any given week, I will meet with maybe 10 to 15 publishers.

And by meeting with publishers and listening to what’s happening in the newsroom – what their pain points are, [what] efficiency that they want to work on, and so on, that motivates us – that actually builds our roadmap.

For NewzDash, we have been tracking AI Overview for a while, and we’re launching this feature in a couple of months from now.

So, you can imagine that this is every term that you’re tracking, including your own headlines and what they need to rank for, and then we can tell you, “For this term, AI Overview is available there,” and we estimate the visibility, how it’s going to drop over there.

But we can also tell you for a group of terms or a category, “Hey, you write a lot about iPhones, and this category is saturated with AI Overview. So, 50% of the time for every new iPhone trend – iPhone 16 launch date – yes, you wrote about it, but guess what? AI Overview is all over there, and it’s pushing down your visibility.”

Then, we’re going to expand into other LLMs. So, we’re planning to track mentions and prompts and citations and references in ChatGPT, which is the biggest LLM driver out of all, and then Perplexity and any other big ones.

I think it’s very important to understand what’s going on, and then, based on the data, you develop your own strategy based on your own niche or your content.

Shelley: I think the biggest challenge [for] publisher SEOs right now is being fully informed and finding attribution for connecting to the referrals that are coming from AI traffic, etc. It’s certainly an area I’m looking at.

John, it’s been fantastic to speak to you today, and thank you so much for offering your opinion. And I hope to catch you soon in person at one of your meetups.

John: Thank you so much. It was a pleasure. Thanks for having me.

Thank you to John Shehata for being a guest on the IMHO show.

Note: This was filmed before Google I/O and the announcement of the rollout of AI Mode in the U.S.

More Resources:

Featured Image: Shelley Walsh/Search Engine Journal

Ecommerce MGMT 0 Comments

May 29 2025

Google’s Query Fan-Out Patent: Thematic Search via @sejournal, @martinibuster

A patent that Google filed in December 2024 presents a close match to the Query Fan-Out technique that Google’s AI Mode uses. The patent, called Thematic Search, offers an idea of how AI Mode answers are generated and suggests new ways to think about content strategy.

The patent describes a system that organizes related search results to a search query into categories, what it calls themes, and provides a short summary for each theme so that users can understand the answers to their questions without having to click a link to all of the different sites.

The patent describes a system for deep research, for questions that are broad or complex. What’s new about the invention is how it automatically identifies themes from the traditional search results and uses an AI to generate an informative summary for each one using both the content and context from within those results.

Thematic Search Engine

Themes is a concept that goes back to the early days of search engines, which is why this patent caught my eye a few months ago and caused me to bookmark it.

Here’s the TL/DR of what it does:

The patent references its use within the context of a large language model and a summary generator.
It also references a thematic search engine that receives a search query and then passes that along to a search engine.
The thematic search engine takes the search engine results and organizes them into themes.
The patent describes a system that interfaces with a traditional search engine and uses a large language model for generating summaries of thematically grouped search results.
The patent describes that a single query can result in multiple queries that are based on “sub-themes”

Comparison Of Query Fan-Out And Thematic Search

The system described in the parent mirrors what Google’s documentation says about the Query Fan-Out technique.

Here’s what the patent says about generating additional queries based on sub-themes:

“In some examples, in response to the search query 142-2 being generated, the thematic search engine 120 may generate thematic data 138-2 from at least a portion of the search results 118-2. For example, the thematic search engine 120 may obtain the search results 118-2 and may generate narrower themes 130 (e.g., sub-themes) (e.g., “neighborhood A”, “neighborhood B”, “neighborhood C”) from the responsive documents 126 of the search results 118-2. The search results page 160 may display the sub-themes of theme 130a and/or the thematic search results 119 for the search query 142-2. The process may continue, where selection of a sub-theme of theme 130a may cause the thematic search engine 120 to obtain another set of search results 118 from the search engine 104 and may generate narrower themes 130 (e.g., sub-sub-themes of theme 130a) from the search results 118 and so forth.”

Here’s what Google’s documentation says about the Query Fan-Out Technique:

“It uses a “query fan-out” technique, issuing multiple related searches concurrently across subtopics and multiple data sources and then brings those results together to provide an easy-to-understand response. This approach helps you access more breadth and depth of information than a traditional search on Google.”

The system described in the patent resembles what Google’s documentation says about the Query Fan-Out technique, particularly in how it explores subtopics by generating new queries based on themes.

Summary Generator

The summary generator is a component of the thematic search system. It’s designed to generate textual summaries for each theme generated from search results.

This is how it works:

The summary generator is sometimes implemented as a large language model trained to create original text.
The summary generator uses one or more passages from search results grouped under a particular theme.
It may also use contextual information from titles, metadata, surrounding related passages to improve summary quality.
The summary generator can be triggered when a user submits a search query or when the thematic search engine is initialized.

The patent doesn’t define what ‘initialization’ of the thematic search engine means, maybe because it’s taken for granted that it means the thematic search engine starts up in anticipation of handling a query.

Query Results Are Clustered By Theme Instead Of Traditional Ranking

The traditional search results, in some examples shared in the patent, are replaced by grouped themes and generated summaries. Thematic search changes what content is shown and linked to users. For example, a typical query that a publisher or SEO is optimizing for may now be the starting point for a user’s information journey. The thematic search results leads a user down a path of discovering sub-themes of the original query and the site that ultimately wins the click might not be the one that ranks number one for the initial search query but rather it may be another web page that is relevant for an adjacent query.

The patent describes multiple ways that the thematic search engine can work (I added bullet points to make it easier to understand):

“The themes are displayed on a search results page, and, in some examples, the search results (or a portion thereof) are arranged (e.g., organized, sorted) according to the plurality of themes. Displaying a theme may include displaying the phrase of the theme.
In some examples, the thematic search engine may rank the themes based on prominence and/or relevance to the search query.
The search results page may organize the search results (or a portion thereof) according to the themes (e.g., under the theme of ‘cost of living”, identifying those search results that relate to the theme of ‘cost of living”).
The themes and/or search results organized by theme by the thematic search engine may be rendered in the search results page according to a variety of different ways, e.g., lists, user interface (UI) cards or objects, horizontal carousel, vertical carousel, etc.
The search results organized by theme may be referred to as thematic search results. In some examples, the themes and/or search results organized by theme are displayed in the search results page along with the search results (e.g., normal search results) from the search engine.
In some examples, the themes and/or theme-organized search results are displayed in a portion of the search results page that is separate from the search results obtained by the search engine.”

Content From Multiple Sources Are Combined

The AI-generated summaries are created from multiple websites and grouped under a theme. This makes link attribution, visibility, and traffic difficult to predict.

In the following citation from the patent, the reference to “unstructured data” means content that’s on a web page.

According to the patent:

“For example, the thematic search engine may generate themes from unstructured data by analyzing the content of the responsive documents themselves and may thematically organize the search results according to the themes.

….In response to a search query (“moving to Denver”), a search engine may obtain search results (e.g., responsive documents) responsive to that search query.

The thematic search engine may select a set of responsive documents (e.g., top X number of search results) from the search results obtained by the search engine, and generate a plurality of themes (e.g., “neighborhoods”, “cost of living”, “things to do”, “pros and cons”, etc.) from the content of the responsive documents.

A theme may include a phrase, generated by a language model, that describes a theme included in the responsive documents. In some examples, the thematic search engine may map semantic keywords from each responsive document (e.g., from the search results) and connect the semantic keywords to similar semantic keywords from other responsive documents to generate themes.”

Content From Source Pages Are Linked

The documentation states that the thematic search engine links to the URLs of the source pages. It also states that the thematic search result could include the web page’s title or other metadata. But the part that’s important for SEOs and publishers is the part about attribution, links.

“…a thematic search result 119 may include a title 146 of the responsive document 126, a passage 145 from the responsive document 126, and a source 144 of the responsive document. The source 144 may be a resource locator (e.g., uniform resource location (URL)) of the responsive document 126.

The passage 145 may be a description (e.g., a snippet obtained from the metadata or content of the responsive document 126). In some examples, the passage 145 includes a portion of the responsive document 126 that mentions the respective theme 130. In some examples, the passage 145 included in the thematic search result 119 is associated with a summary description 166 generated by the language model 128 and included in a cluster group 172.”

User Interaction Influences Presentation

As previously mentioned, the thematic search engine is not a ranked list of documents for a search query. It’s a collection of information across themes that are related to the initial search query. User interaction with those AI generated summaries influences which sites are going to receive traffic.

Automatically generated sub-themes can present alternative paths on the user’s information journey that begins with the initial search query.

Summarization Uses Publisher Metadata

The summary generator uses document titles, metadata, and surrounding textual content. That may mean that well-structured content may influence how summaries are constructed.

The following is what the patent says, I added bullet points to make it easier to understand:

“The summary generator 164 may receive a passage 145 as an input and outputs a summary description 166 for the inputted passage 145.
In some examples, the summary generator 164 receives a passage 145 and contextual information as inputs and outputs a summary description 166 for the passage 145.
In some examples, the contextual information may include the title of the responsive document 126 and/or metadata associated with the responsive document 126.
In some examples, the contextual information may include one or more neighboring passages 145 (e.g., adjacent passages).
In some examples, the contextual information may include a summary description 166 for one or more neighboring passages 145 (e.g., adjacent passages).
In some examples, the contextual information may include all the other passages 145 on the same responsive document 126. For example, the summary generator may receive a passage 145 and the other passages 145 (e.g., all other passages 145) on the same responsive document 126 (and, in some examples, other contextual information) as inputs and may output a summary description 166 for the passage 145.”

Thematic Search: Implications For Content & SEO

There are two way that AI Mode ends for a publisher:

Since users may get their answers from theme summaries or dropdowns, zero-click behavior is likely to increase, reducing traffic from traditional links.
Or, it could be that the web page that provides the end of the user’s information journey for a given query is the one that receives the click.

I think this means that we really need to re-think the paradigm of ranking for keywords and maybe consider what the question is that’s being answered by a web page, and then identify follow-up questions that may be related to that initial query and either include that in the web page or create another web page that answers what may be the end of the information journey for a given search query.

You can read the patent here:

Thematic Search (PDF)

Read Google’s Documentation Of AI Mode (PDF)

Ecommerce MGMT 0 Comments

May 28 2025

Google Fixes AI Mode Traffic Attribution Bug via @sejournal, @MattGSouthern

Google has fixed a bug that caused AI Mode search traffic to be reported as “direct traffic” instead of “organic traffic” in Google Analytics.

The problem started last week. Google was adding a special code (rel=”noopener noreferrer”) to links in its AI Mode search results. This code caused Google Analytics to incorrectly attribute traffic to websites, rather than from Google search.

Reports from Aleyda Solis, Founder at Orainti, and others in the SEO community confirm the issue is resolved.

Discovery of the Attribution Problem

Maga Sikora, an SEO director specializing in AI search, first identified the issue. She warned other marketers:

“Traffic from Google’s AI Mode is being tagged as direct in GA — not organic, as Google adds a rel=’noopener noreferrer’ to those links. Keep this in mind when reviewing your reports.”

The noreferrer code is typically used for security purposes. However, in this case, it was blocking Google Analytics from tracking the actual source of the traffic.

Google Acknowledges the Bug

John Mueller, Search Advocate at Google, quickly responded. He suggested it was a mistake on Google’s end, stating:

“My assumption is that this will be fixed; it looks like a bug on our side.”

Mueller also explained that Search Console doesn’t currently display AI Mode data, but it will be available soon.

He added:

“We’re updating the documentation to reflect this will be showing soon as part of the AI Mode rollout.”

Rapid Resolution & Current Status

Google fixed the problem within days.

Solis confirmed the fix:

“I don’t see the ‘noreferrer’ in Google’s AI Mode links anymore.”

She’s now seeing AI Mode data in her analytics and is verifying that traffic is correctly labeled as “organic” instead of “direct.”

Impact on SEO Reporting

The bug may have affected your traffic data for several days. If your site received AI Mode traffic during this period, some of your “direct” traffic may have been organic search traffic.

This misclassification could have:

Skewed conversion tracking
Affected budget decisions
Made SEO performance look worse than it was
Hidden the true impact of AI Mode on your site

What To Do Now

Here’s your action plan:

Audit recent traffic data – Check for unusual spikes in direct traffic from the past week
Document the issue – Note the affected dates for future reference
Adjust reporting – Consider adding notes to client reports about the temporary bug
Prepare for AI Mode tracking – Start planning how to measure this new traffic source

Google’s prompt response shows it understands the importance of accurate data for marketers.

Featured Image: Tada Images/Shutterstock

Ecommerce MGMT 0 Comments

May 28 2025

Future-Proofing WordPress SEO: How To Optimize For AI-Driven Search Features via @sejournal, @cshel

Search is changing. I hate saying that (again) because it feels cliche at this point. But, cliche or not, it is true and it is seismic.

With the rollout of AI Overviews, Bing Copilot, and conversational search interfaces like ChatGPT and Perplexity, SEO is no longer just about traditional rankings; it’s about representation and visibility.

Instead of obsessing over page 1 and traffic numbers, WordPress site owners need to start focusing on whether they’re represented in the answers users actually see and if that visibility is resulting in revenue.

The old rankings system itself is mattering less and less because AI-driven search features aren’t just scraping a list of URLs. They’re synthesizing content, extracting key insights, and delivering summary answers.

If your content isn’t built for that kind of visibility, it may as well not exist.

Google doesn’t even look like Google anymore. Since the March core update, AI Overviews have more than doubled in appearance, and this trend shows no signs of slowing. This is our new reality, and it’s only going to accelerate.

WordPress is already a flexible, powerful platform, but out of the box, it’s not optimized for how AI-driven search works today.

In this guide, I’ll show you how to future-proof your WordPress site by aligning your structure, content, and technical setup with what large language models actually understand and cite.

Don’t Build Trash Content

Before we talk about how to do it right, let’s talk about the strategy that’s finally running out of road.

For literal decades, site owners have spun up content sites that were never designed for people, only for ad revenue. These sites weren’t meant to inform or help – just rank well enough to earn the click and display the ad.

Unfortunately, WordPress made this model wildly scalable. It almost instantly became the go-to tool for anyone who wanted to launch dozens (or hundreds) of sites fast, slap on some AdSense, and rake in passive income – money for nothing and your clicks for free.

That model worked very well for a very long time. But (thankfully), that time has come to an end.

AI Overviews and answer engines aren’t surfacing this kind of content anymore. Traffic is drying up. Cost per mille (CPM) is down. And trust – not volume – is the currency that search engines now prioritize.

Even if you’re trying to brute-force the model with paid placements or “citation strategies,” you’re competing with brands that have earned their authority over the years.

To be clear, WordPress is not and was never the problem. The problem is that people use it to scale the wrong kind of content.

If your content is created for algorithms instead of actual people, AI is going to pass you by. This new era of search doesn’t reward valueless content factories. It rewards clarity, “usefulness,” and trust.

Nothing in the rest of this article is going to fix that dying business model. If that’s what you’re here for, you’re already too late.

If, however, you’re focused on publishing something valuable – something worth reading, referencing, or citing – then please, keep reading.

Use WordPress Like You Mean It

WordPress is the most widely used content management system (CMS) for a reason. It’s flexible, extensible, and powerful when you use it right.

However, default settings and bloated themes won’t cut it in an AI-first environment. You have to optimize with clarity in mind.

Let’s start with your theme. Choose one that uses semantic HTML properly:

, and a clear heading hierarchy.

Avoid themes and builders that generate “div soup.” Large language models rely on clean HTML to interpret relationships between elements. If your layout is a maze of

s and JavaScript, the model may miss the point entirely.

If the theme you love isn’t perfect, that’s fine. You can usually fix the markup with a child theme, custom template, or a little dev help. It’s worth the investment.

A Checklist For Optimizing WordPress Fundamentals

Use lightweight themes: e.g., GeneratePress, Astra, or Blocksy are all well-regarded by developers for their performance and clean markup.
Optimize image delivery: Large, uncompressed images are one of the biggest culprits behind slow load times. Reducing file sizes improves speed, performance scores, and user experience, especially on mobile.
Use caching and CDNs: These reduce server load and speed up delivery by storing content closer to your users. Better performance means faster indexing, higher satisfaction, and improved Core Web Vitals.
Delete unused plugins: Seriously. If it’s deactivated and collecting dust, it’s a liability. Every inactive plugin is an unpatched attack vector just waiting to be exploited.
Delete unused themes: Same issue as above. They can still pose security risks and bloat your site’s file structure. Keep only your active theme and a fallback default, like Twenty Twenty-Four.

Declutter Hidden Or Fragmented Content

Pop-ups, tabs, and accordions might be fine for user experience, but they can obscure content from LLMs and crawlers.

If the content isn’t easily accessible in the rendered HTML – without requiring clicks, hovers, or JavaScript triggers to reveal – it may not be indexed or understood properly.

This can mean key product specs, FAQs, or long-form content go unnoticed by AI-driven search systems.

Compounding the problem is clutter in the Document Object Model (DOM).

Even if something is visually hidden from users, it might still pollute your document structure with unnecessary markup.

Minimize noisy widgets, auto-playing carousels, script-heavy embeds, or bloated third-party integrations that distract from your core content.

These can dilute the signal-to-noise ratio for both search engines and users.

If your theme or page builder leans too heavily on these elements, consider simplifying the layout or reworking how key content is presented.

Replacing JavaScript-heavy tabs with inline content or anchor-linked jump sections is one simple, crawler-friendly improvement that preserves UX while supporting AI discoverability.

Use WordPress SEO Plugins That Help Structure For LLMs

WordPress SEO plugins are most often associated with schema, and schema markup is helpful, but its value has shifted in the era of AI-driven search.

Today’s large language models don’t need schema to understand your content. But that doesn’t mean schema is obsolete.

In fact, it can act as a helpful guidepost – especially on sites with less-than-perfect HTML structure (which, let’s be honest, describes most websites).

It helps surface key facts and relationships more reliably, and in some cases, makes the difference between getting cited and getting skipped.

Modern SEO tools do more than just generate structured data. They help you manage metadata, highlight cornerstone content, and surface author information – all of which play a role in how AI systems assess trust and authority.

Just don’t make the mistake of thinking you can “add E-E-A-T” with a plugin toggle. John Mueller has said as much at Search Central Live NYC in March of this year.

What author schema can do, however, is help search engines and LLMs connect your content to your wider body of work. That continuity is where E-E-A-T becomes real.

Finally, consider adding a WordPress SEO plugin that can generate a Table of Contents.

While it’s useful for readers, it also gives LLMs a clearer understanding of your page’s hierarchy, helping them extract, summarize, and cite your content more accurately.

Structure Your Content So AI Uses It

Whether you’re creating posts in the Block Editor, Classic Editor, or using a visual page builder like Elementor or Beaver Builder, the way you structure your content matters more than ever.

AI doesn’t crawl content like a bot. It digests it like a reader. To get cited in an AI Overview or answer box, your content needs to be easy to parse and ready to lift.

Start by using clear section headings (your H2s and H3s) and keeping each paragraph focused on a single idea.

If you’re explaining steps, use numbered lists. If you’re comparing options, try a table. The more predictable your structure, the easier it is for a language model to extract and summarize it.

And don’t bury your best insight in paragraph seven – put your core point near the top. LLMs are just like people: They get distracted. Leading with a clear summary or TL;DR increases your odds of inclusion.

Finally, don’t forget language cues. Words like “Step 1,” “Key takeaway,” or “In summary” help AI interpret your structure and purpose. These phrases aren’t just good writing; they’re machine-readable signals that highlight what matters.

Show AI You’re A Trusted Source

WordPress gives you powerful tools to communicate credibility – if you’re taking advantage of them.

E-E-A-T (which stands for experience, expertise, authoritativeness, and trustworthiness) isn’t just an acronym; it’s the bar AI systems use to decide whether your content is worth citing.

WordPress gives you plenty of opportunities to show you’re the real deal.

Start by making your authors visible. Include a bio, credentials, and a link to an author archive.

If your theme doesn’t support it, add a plugin or customize the layout.

Schema markup for authors helps, too, but remember, it doesn’t magically give you E-E-A-T. What it does is help LLMs connect your byline to your broader body of work across the web.

From there, build out internal signals of authority. Link your content together in meaningful ways.

Surface cornerstone pieces that demonstrate depth on a topic. These internal relationships show both users and machines that your site knows what it’s talking about.

Finally, keep it fresh. Outdated content is less likely to be included in AI answers.

Regular content audits, scheduled refreshes, and clear update timestamps all help signal to LLMs (and humans) that you’re active and credible.

Final Thoughts: Build For Understanding, Not Just Ranking

At this point, it should be clear that WordPress can absolutely thrive in an AI-first search environment – but only if you treat it like a platform, not a shortcut.

Success with AI Overviews, answer engines, and conversational search doesn’t come from tricking algorithms. It comes from helping language models truly understand what your content is about – and why you’re the one worth citing.

That means focusing on structure. On clarity. On authorship. On consistency. That means building not just for Google’s crawler, but for the models that generate answers people actually read.

So, yes, SEO has changed. If you’re using WordPress, you’re already holding the right tool. Now, it’s just a matter of wielding it well.

More Resources:

Featured Image: Roman Samborskyi/Shutterstock

Ecommerce MGMT 0 Comments

May 28 2025

Google’s CEO Says AI Overviews Website Referrals Are Increasing via @sejournal, @martinibuster

Google’s Sundar Pichai said in an interview that AI Overviews sends more traffic to a wider set of websites, insisting that Google cares about the web ecosystem and that he expects AI Mode to continue to send more traffic to websites, a claim that the interviewer challenged.

AI Agents Remove Customer Relationship Opportunities

There is a revolutionary change in how ecommerce that’s coming soon, where AI agents research and make purchase decisions on behalf of consumers. The interviewer brought up that some merchants have expressed concern that this will erode their ability to upsell or develop a customer relationship.

A customer relationship can be things like getting them to subscribe to an email or to receive text messages about sales, offer a coupon for a future purchase or to get them to come back and leave product reviews, all the ways that a human consumer interacts with a brand that an AI agent does not.

Sundar Pichai responded that AI agents present a good user experience and compared the AI agent in the middle between a customer and a merchant to a credit card company that sits in between the merchant and a customer, it’s a price that a merchant is willing to pay to increase business.

Pichai explained:

“I can literally see, envision 20 different ways this could work. Consumers could pay a subscription for agents, and their agents could rev share back. So you know, so that that is the CIO use case you’re talking about. That’s possible. We can’t rule that out. I don’t think we should underestimate, people may actually see more value participating in it.

I think this is, you know, it’s tough to predict, but I do think over time like you know like if you’re removing friction and improving user experience, it’s tough to bet against those in the long run, right? And so I think, in general if you’re lowering friction for it, you know, and and people are enjoying using it, somebody’s going to want to participate in it and grow their business.

And like would brands want to be in retailers? Why don’t they sell directly today? Why don’t they sell directly today? Why won’t they do that? Because retailers provide value in the middle.

Why do merchants take credit cards? There are many parts like and you find equilibrium because merchants take credit cards because they see more business as part of taking credit cards than not, right. And which justifies the increased cost of taking credit cards and may not be the perfect analogy. But I think there are all these kinds of effects going around.”

Pichai Claims That Web Ecosystem Is Growing

The interviewer began talking about the web ecosystem, calling attention to the the “downstream” effect of AI Search and AI search agents on information providers and other sites on the web.

Pichai started his answer by doing something he did in another interview about this same question where he deflected the question about web content by talking about video content.

He also made the claim that Google isn’t killing the web ecosystem and cited that the number of web pages in Google’s index has grown by 45% over the past two years, claiming it’s not AI generated content.

He said:

“I do think people are consuming a lot more information and the web is one specific format. So we should talk about the web, but zooming back out, …there are new platforms like YouTube and others too. So I think people are just consuming a lot more information, right? So it feels like like an expansionary moment. I think there are more creators. People are putting out more content, you know, and so people are generally doing a lot more. Maybe people have a little extra time in their hands. And so it’s a combination of all that.

On the web, look things have been interesting and you know we’ve had these conversations for a while, you know, obviously in 2015 there was this famous, the web is dead. You know, I always have it somewhere around, you know, which I look at it once in a while. Predictions, it’s existed for a while.

I think web is evolving pretty profoundly. When we crawl, when we look at the number of pages available to us, that number has gone up by 45% in the last two years alone. So that’s a staggering thing to think about.”

The interviewer challenged Pichai’s claim by asking if Google is detecting whether that increase in web pages is because they’re AI generated.

Pichai was caught by surprise by that question and struggled to find the answer and then finally responded that Google has many techniques for understanding the quality of web pages, including whether it was machine generated.

He doubled down on his statement that the web ecosystem is growing and then he started drifting off-topic, then he returned to the topic.

He continued:

“That doesn’t explain the trend we are seeing. So, generally there are more web pages. At an underlying level, so I think that’s an interesting phenomenom. I think everybody as a creator, like you do at The Verge, I think today if you’re doing stuff you have to do it in a cross-platform, cross-format way. So I think things are becoming more dynamic cross-format.

I think another thing people are underestimating with AI is AI will make it zero-friction to move from one format to another, because our models are multi-modal.

So I think this notion, the static moment of, you produce content by format, whereas I think machines can help translate it from, almost like different languages and they can go seamlessly between. I think it’s one of the incredible opportunities to be unlocked.

I think people are producing a lot of content, and I see consumers consuming a lot of content. We see it in our products. Others are seeing it too. So that’s probably how I would answer at the highest level.”

Search Traffic and Referral Patterns

The interviewer asked Pichai what his response is to people who say that AI Overviews is crushing their business.

Pichai answered:

“AI mode is going to have sources and you know, we’re very committed as a direction, as a product direction, part of why people come to Google is to experience that breadth of the web and and go in the direction they want to, right?

So I view us as giving more context. Yes, there are certain questions which may get answers, but overall that’s the pattern we see today. And if anything over the last year, it’s clear to us the breadth of where we are sending people to is increasing. And, so I expect that to be true with AI Mode as well.”

The interviewer immediately responded by noting that if everything Pichai said was true, people would be less angry with him.

Pichai dismissed the question, saying:

“You’re always going to have areas where people are robustly debating value exchanges, etc. … No one sends traffic to the web the way we do.”

Oh, Really?

What do you think? Are Google’s AI features prioritizing sending traffic to web sites?

Watch the Sundar Pichai interview here:

Featured image is screenshot from video

Ecommerce MGMT 0 Comments

May 27 2025

Google’s Sergey Brin Says AI Can Synthesize Top 1,000 Search Results via @sejournal, @martinibuster

Google co-founder Sergey Brin says AI is transforming search from a process of retrieving links to one of synthesizing answers by analyzing thousands of results and conducting follow-up research. He explains that this shift enables AI to perform research tasks that would take a human days or weeks, changing how people interact with information online.

Machine Learning Models Are Converging

For those who are interested in how search works, another interesting insight he shared was that algorithms are converging into a single model. In the past, Googlers have described a search engine as multiple engines, multiple algorithms, thousands of little machines working together on different parts of search.

What Brin shared is that machine learning algorithms are converging into models that can do it all, where the learnings from specialist models are integrated into the more general model.

Brin explained:

“You know, things have been more converging. And, this is sort of broadly through across machine learning. I mean, you used to have all kinds of different kinds of models and whatever, convolutional networks for vision things. And you know, you had… RNN’s for text and speech and stuff. And, you know, all of this has shifted to Transformers basically.

And increasingly, it’s also just becoming one model.”

Google Integrates Specialized Model Learnings Into General Models

His answer continued, shifting to explaining how it’s the usual thing that Google does, integrating learnings from specialized models into more general ones.

Brin continued his answer:

“Now we do get a lot of oomph occasionally, we do specialized models. And it’s it’s definitely scientifically a good way to iterate when you have a particular target, you don’t have to, like, do everything in every language, handle whatever both images and video and audio in one go. But we are generally able to. After we do that, take those learnings and basically put that capability into a general model.”

Future Interfaces: Multimodal Interaction

Google has recently filed multiple patents around a new kind of visual and audio interface where Google’s AI can take what a user is seeing as input and provide answers about it. Brin admitted that their first attempt at doing that with Google Glasses was premature, that the technology for supporting that wasn’t mature. He says that they’ve made progress with that kind of searching but that they’re still working on battery life.

Brin shared:

“Yeah, I kind of messed that up. I’ll be honest. Got the timing totally wrong on that.

There are a bunch of things I wish I’d done differently, but honestly, it was just like the technology wasn’t ready for Google Glass.

But nowadays these things I think are more sensible. I mean, there’s still battery life issues, I think, that you know we and others need to overcome, but I think that’s a cool form factor.”

Predicting The Future Of AI Is Difficult

Sergey Brin declined to predict what the future will be like because technology is moving so fast.

He explained:

“I mean when you say 10 years though, you know a lot of people are saying, hey, the singularity is like, right, five years away. So your ability to see through that into the future, I mean, it’s very hard”

Improved Response Time and Voice Input Are Changing Habits

He agreed with the interviewers that improved response time to voice input are changing user habits, making real-time verbal interaction more viable. But he also said that voice mode isn’t always the best way to interface with AI and used the example of a person talking to a computer at work as a socially awkward application of voice input. This is interesting because we think of the Star Trek Computer voice method of interacting with a computer but what it would get quite loud and distracting if everyone in an office were interacting audibly with an AI.

He shared:

“Everything is getting better and faster and so for you know, smaller models are more capable. There are better ways to do inference on them that are faster.

We have the big open shared offices. So during work I can’t really use voice mode too much. I usually use it on the drive.

I don’t feel like I could, I mean, I would get its output in my headphones, but if I want to speak to it, then everybody’s listening to me. So I just think that would be socially awkward. …I do chat to the AI, but then it’s like audio in and audio out. Yeah, but I feel like I honestly, maybe it’s a good argument for a private office.”

AI Deep Research Can Synthesize Top 1,000 Search Results

Brin explained how AI’s ability to conduct deep research, such as analyzing massive amounts of search results and conducting follow-up research changes what it means to do search. He described a shift in search that changes the fundamental nature of search from retrieval (here are some links, look at them) to generating insights from the data (here’s a summary of what it all means, I did the work for you).

Brin contrasted what he can do manually with regular search and what AI can do at scale.

He said:

“To me, the exciting thing about AI, especially these days, I mean, it’s not like quite AGI yet as people are seeking or it’s not superhuman intelligence, but it’s pretty damn smart and can definitely surprise you.

So I think of the superpower is when it can do things in the volume that I cannot. So you know by default when you use some of our AI systems, you know, it’ll suck down whatever top ten search results and kind of pull out what you need out of them, something like that. But I could do that myself, to be honest, you know, maybe take me a little bit more time.

But if it sucks down the top, you know thousand results and then does follow-on searches for each of those and reads them deeply, like that’s, you know, a week of work for me like I can’t do that.”

AI With Advertising

Sergey Brin expressed enthusiasm for advertising within the context of the free tier of AI but his answer skipped over that, giving the indication that this wasn’t something they were planning for. He instead promoted the concept of providing a previous generation model for free while reserving the latest generation model for the paid tiers.

Sergey explained:

“Well, OK, it’s free today without ads on the side. You just got a certain number of the Top Model. I think we likely are going to have always now like sort of top models that we can’t supply infinitely to everyone right off the bat. But you know, wait three months and then the next generation.

I’m all for, you know, really good AI advertising. I don’t think we’re going to like necessarily… our latest and greatest models, which are you, know, take a lot of computation, I don’t think, we’re going to just be free to everybody right off the bat, but as we go to the next generation, you know, it’s like every time we’ve gone forward a generation, then the sort of the new free tier is usually as good as the previous pro tier and sometimes better.”

Watch the interview here:

Sergey Brin, Google Co-Founder | All-In Live from Miami

Ecommerce MGMT 0 Comments