How To Use LLMs For 301 Redirects At Scale via @sejournal, @vahandev

Redirects are essential to every website’s maintenance, and managing redirects becomes really challenging when SEO pros deal with websites containing millions of pages.

Examples of situations where you may need to implement redirects at scale:

  • An ecommerce site has a large number of products that are no longer sold.
  • Outdated pages of news publications are no longer relevant or lack historical value.
  • Listing directories that contain outdated listings.
  • Job boards where postings expire.

Why Is Redirecting At Scale Essential?

It can help improve user experience, consolidate rankings, and save crawl budget.

You might consider noindexing, but this does not stop Googlebot from crawling. It wastes crawl budget as the number of pages grows.

From a user experience perspective, landing on an outdated link is frustrating. For example, if a user lands on an outdated job listing, it’s better to send them to the closest match for an active job listing.

At Search Engine Journal, we get many 404 links from AI chatbots because of hallucinations as they invent URLs that never existed.

We use Google Analytics 4 and Google Search Console (and sometimes server logs) reports to extract those 404 pages and redirect them to the closest matching content based on article slug.

When chatbots cite us via 404 pages, and people keep coming through broken links, it is not a good user experience.

Prepare Redirect Candidates

First of all, read this post to learn how to create a Pinecone vector database. (Please note that in this case, we used “primary_category” as a metadata key vs. “category.”)

To make this work, we assume that all your article vectors are already stored in the “article-index-vertex” database.

Prepare your redirect URLs in CSV format like in this sample file. That could be existing articles you’ve decided to prune or 404s from your search console reports or GA4.

Sample file with urls to be redirectedSample file with URLs to be redirected (Screenshot from Google Sheet, May 2025)

Optional “primary_category” information is metadata that exists with your articles’ Pinecone records when you created them and can be used to filter articles from the same category, enhancing accuracy further.

In case the title is missing, for example, in 404 URLs, the script will extract slug words from the URL and use them as input.

Generate Redirects Using Google Vertex AI

Download your Google API service credentials and rename them as “config.json,” upload the script below and a sample file to the same directory in Jupyter Lab, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re
import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from google.auth import load_credentials_from_file
from google.cloud import aiplatform
from google.api_core.exceptions import GoogleAPIError

from pinecone import Pinecone, PineconeException
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter).
# This is useful for interactive environments to show progress without cluttering the output.
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"      # Path to the input CSV file containing URLs to be redirected.
                                           # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"            # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_KEY"     # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-vertex" # The name of the Pinecone index where article vectors are stored.
GOOGLE_CRED_PATH = "config.json"           # Path to your Google Cloud service account credentials JSON file.
EMBEDDING_MODEL_ID = "text-embedding-005"  # Identifier for the Vertex AI text embedding model to use.
TASK_TYPE = "RETRIEVAL_QUERY"              # The task type for the embedding model. Try with RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY to see the difference.
                                           # This influences how the embedding vector is generated for optimal retrieval.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata.
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (Vertex AI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE GOOGLE VERTEX AI ───────────────────────────────────────────────
# Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the
# service account key file. This allows the Google Cloud client libraries to
# authenticate automatically.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GOOGLE_CRED_PATH
try:
    # Load credentials from the specified JSON file.
    credentials, project_id = load_credentials_from_file(GOOGLE_CRED_PATH)
    # Initialize the Vertex AI client with the project ID and credentials.
    # The location "us-central1" is specified for the AI Platform services.
    aiplatform.init(project=project_id, credentials=credentials, location="us-central1")
    logging.info("Vertex AI initialized.")
except Exception as e:
    # Log an error if Vertex AI initialization fails and re-raise the exception
    # to stop script execution, as it's a critical dependency.
    logging.error(f"Failed to initialize Vertex AI: {e}")
    raise

# Initialize the embedding model once globally.
# This is a crucial optimization for "Resource Management for Embedding Model".
# Loading the model takes time and resources; doing it once avoids repeated loading
# for every URL processed, significantly improving performance.
try:
    GLOBAL_EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
    logging.info(f"Text Embedding Model '{EMBEDDING_MODEL_ID}' loaded.")
except Exception as e:
    # Log an error if the embedding model fails to load and re-raise.
    # The script cannot proceed without the embedding model.
    logging.error(f"Failed to load Text Embedding Model: {e}")
    raise

# ─── INITIALIZE PINECONE ──────────────────────────────────────────────────────
# Initialize the Pinecone client and connect to the specified index.
try:
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0].split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    has_slash = urlparse(temp).path.endswith('/')
    # Remove any trailing slash temporarily for consistent processing.
    temp = temp.rstrip('/')
    # Re-add the trailing slash if it was originally present.
    return temp + ('/' if has_slash else '')


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for GoogleAPIError. This makes the embedding generation
# more resilient to transient issues like network problems or Vertex AI rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(GoogleAPIError), # Only retry if a GoogleAPIError occurs.
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generates a vector embedding for the given text using the globally initialized
    Vertex AI Text Embedding Model. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        # Use the globally initialized model to get embeddings.
        # This is the "Resource Management for Embedding Model" optimization.
        inp = TextEmbeddingInput(text, task_type=TASK_TYPE)
        vectors = GLOBAL_EMBEDDING_MODEL.get_embeddings([inp], output_dimensionality=768)
        return vectors[0].values # Return the embedding vector (list of floats).
    except GoogleAPIError as e:
        # Log a warning if a GoogleAPIError occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"Vertex AI error during embedding generation (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.
    except Exception as e:
        # Catch any other unexpected exceptions during embedding generation.
        logging.error(f"Unexpected error generating embedding: {e}")
        return None # Return None for non-retryable or final failed attempts.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            slug = slug_from_url(raw_url)
            if not slug:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context for embedding.")
                continue
            text = slug.replace('-', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except GoogleAPIError as e:
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding generated.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                # clear_output(wait=True) # Uncomment if running in Jupyter and want to clear output
                clear_output(wait=True)
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

You will see a test run with only five records, and you will see a new file called “redirect_map.csv,” which contains redirect suggestions.

Once you ensure the code runs smoothly, you can set the TEST_MODE  boolean to true False and run the script for all your URLs.

Test run with only 5 recordsTest run with only five records (Image from author, May 2025)

If the code stops and you resume, it picks up where it left off. It also checks each redirect it finds against the CSV file.

This check prevents selecting a database URL on the pruned list. Selecting such a URL could cause an infinite redirect loop.

For our sample URLs, the output is shown below.

Redirect candidates using Google Vertex AI's task type RETRIEVAL_QUERYRedirect candidates using Google Vertex AI’s task type RETRIEVAL_QUERY (Image from author, May 2025)

We can now take this redirect map and import it into our redirect manager in the content management system (CMS), and that’s it!

You can see how it managed to match the outdated 2013 news article “YouTube Retiring Video Responses on September 12” to the newer, highly relevant 2022 news article “YouTube Adopts Feature From TikTok – Reply To Comments With A Video.”

Also for “/what-is-eat/,” it found a match with “/google-eat/what-is-it/,” which is a 100% perfect match.

This is not just due to the power of Google Vertex LLM quality, but also the result of choosing the right parameters.

When I use “RETRIEVAL_DOCUMENT” as the task type when generating query vector embeddings for the YouTube news article shown above, it matches “YouTube Expands Community Posts to More Creators,” which is still relevant but not as good a match as the other one.

For “/what-is-eat/,” it matches the article “/reimagining-eeat-to-drive-higher-sales-and-search-visibility/545790/,” which is not as good as “/google-eat/what-is-it/.”

If you wanted to find redirect matches from your fresh articles pool, you can query Pinecone with one additional metadata filter, “publish_year,” if you have that metadata field in your Pinecone records, which I highly recommend creating.

In the code, it is a PUBLISH_YEAR_FILTER variable.

If you have publish_year metadata, you can set the years as array values, and it will pull articles published in the specified years.

Generate Redirects Using OpenAI’s Text Embeddings

Let’s do the same task with OpenAI’s “text-embedding-ada-002” model. The purpose is to show the difference in output from Google Vertex AI.

Simply create a new notebook file in the same directory, copy and paste this code, and run it.


import os
import time
import logging
from urllib.parse import urlparse
import re

import pandas as pd
from pandas.errors import EmptyDataError
from typing import Optional, List, Dict, Any

from openai import OpenAI
from pinecone import Pinecone, PineconeException

# Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
# to functions, making them more robust against transient errors like network issues or API rate limits.
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

# For clearing output in Jupyter (optional, keep if running in Jupyter)
from IPython.display import clear_output

# ─── USER CONFIGURATION ───────────────────────────────────────────────────────
# Define configurable parameters for the script. These can be easily adjusted
# without modifying the core logic.

INPUT_CSV = "redirect_candidates.csv"       # Path to the input CSV file containing URLs to be redirected.
                                            # Expected columns: "URL", "Title", "primary_category".
OUTPUT_CSV = "redirect_map.csv"             # Path to the output CSV file where the generated redirect map will be saved.
PINECONE_API_KEY = "YOUR_PINECONE_API_KEY"      # Your API key for Pinecone. Replace with your actual key.
PINECONE_INDEX_NAME = "article-index-ada"   # The name of the Pinecone index where article vectors are stored.
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"    # Your API key for OpenAI. Replace with your actual key.
OPENAI_EMBEDDING_MODEL_ID = "text-embedding-ada-002" # Identifier for the OpenAI text embedding model to use.
CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                             # Useful for testing and debugging.
MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata eg. [2024,2025].
                                     # If empty, no year filtering is applied.
LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                             # This helps in saving progress incrementally and managing memory.
MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                             # Shorter segments might be noise or less descriptive.

# Retry configuration for API calls (OpenAI and Pinecone).
# These parameters control how the `tenacity` library retries failed API requests.
MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                             # Subsequent retries will have exponentially increasing delays.

# ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
# Configure the logging system to output informational messages to the console.
logging.basicConfig(
    level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
    format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
)

# ─── INITIALIZE OPENAI CLIENT & PINECONE ───────────────────────────────────────
# Initialize the OpenAI client once globally. This handles resource management efficiently
# as the client object manages connections and authentication.
client = OpenAI(api_key=OPENAI_API_KEY)
try:
    # Initialize the Pinecone client and connect to the specified index.
    pinecone = Pinecone(api_key=PINECONE_API_KEY)
    index = pinecone.Index(PINECONE_INDEX_NAME)
    logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
except PineconeException as e:
    # Log an error if Pinecone initialization fails and re-raise.
    # Pinecone is a critical dependency for finding redirect candidates.
    logging.error(f"Pinecone init error: {e}")
    raise

# ─── HELPERS ───────────────────────────────────────────────────────────────────
def canonical_url(url: str) -> str:
    """
    Converts a given URL into its canonical form by:
    1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
    2. Handling URL-encoded fragment markers (`%23`).
    3. Preserving the trailing slash if it was present in the original URL's path.
       This ensures consistency with the original site's URL structure.

    Args:
        url (str): The input URL.

    Returns:
        str: The canonicalized URL.
    """
    # Remove query parameters and URL fragments.
    temp = url.split('?', 1)[0]
    temp = temp.split('#', 1)[0]
    # Check for URL-encoded fragment markers and remove them.
    enc_idx = temp.lower().find('%23')
    if enc_idx != -1:
        temp = temp[:enc_idx]
    # Determine if the original URL path ended with a trailing slash.
    preserve_slash = temp.endswith('/')
    # Strip trailing slash if not originally present.
    if not preserve_slash:
        temp = temp.rstrip('/')
    return temp


def slug_from_url(url: str) -> str:
    """
    Extracts and joins meaningful, non-numeric path segments from a canonical URL
    to form a "slug" string. This slug can be used as text for embedding when
    a URL's title is not available.

    Args:
        url (str): The input URL.

    Returns:
        str: A hyphen-separated string of relevant slug parts.
    """
    clean = canonical_url(url) # Get the canonical version of the URL.
    path = urlparse(clean).path # Extract the path component of the URL.
    segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.

    # Filter segments based on criteria:
    # - Not purely numeric (e.g., '123' is excluded).
    # - Length is greater than or equal to MIN_SLUG_LENGTH.
    # - Contains at least one alphanumeric character (to exclude purely special character segments).
    parts = [seg for seg in segments
             if not seg.isdigit()
             and len(seg) >= MIN_SLUG_LENGTH
             and re.search(r'[A-Za-z0-9]', seg)]
    return '-'.join(parts) # Join the filtered parts with hyphens.

# ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
# Apply retry mechanism for OpenAI API errors. This makes the embedding generation
# more resilient to transient issues like network problems or API rate limits.
@retry(
    wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
    stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
    retry=retry_if_exception_type(Exception), # Retry on any Exception from OpenAI client (can be refined to openai.APIError if desired).
    reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
)
def generate_embedding(text: str) -> Optional[List[float]]:
    """
    Generate a vector embedding for the given text using OpenAI's text-embedding-ada-002
    via the globally initialized OpenAI client. Includes retry logic for API calls.

    Args:
        text (str): The input text (e.g., URL title or slug) to embed.

    Returns:
        Optional[List[float]]: A list of floats representing the embedding vector,
                               or None if the input text is empty/whitespace or
                               if an unexpected error occurs after retries.
    """
    if not text or not text.strip():
        # If the text is empty or only whitespace, no embedding can be generated.
        return None
    try:
        resp = client.embeddings.create( # Use the globally initialized OpenAI client to get embeddings.
            model=OPENAI_EMBEDDING_MODEL_ID,
            input=text
        )
        return resp.data[0].embedding # Return the embedding vector (list of floats).
    except Exception as e:
        # Log a warning if an OpenAI error occurs, then re-raise to trigger the `tenacity` retry mechanism.
        logging.warning(f"OpenAI embedding error (retrying): {e}")
        raise # The `reraise=True` in the decorator will catch this and retry.

# ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
def build_redirect_map(
    input_csv: str,
    output_csv: str,
    fetch_count: int,
    test_mode: bool
):
    """
    Builds a redirect map by processing URLs from an input CSV, generating
    embeddings, querying Pinecone for similar articles, and identifying
    suitable redirect candidates.

    Args:
        input_csv (str): Path to the input CSV file.
        output_csv (str): Path to the output CSV file for the redirect map.
        fetch_count (int): Number of candidates to fetch from Pinecone.
        test_mode (bool): If True, process only a limited number of rows.
    """
    # Read the input CSV file into a Pandas DataFrame.
    df = pd.read_csv(input_csv)
    required = {"URL", "Title", "primary_category"}
    # Validate that all required columns are present in the DataFrame.
    if not required.issubset(df.columns):
        raise ValueError(f"Input CSV must have columns: {required}")

    # Create a set of canonicalized input URLs for efficient lookup.
    # This is used to prevent an input URL from redirecting to itself or another input URL,
    # which could create redirect loops or redirect to a page that is also being redirected.
    input_urls = set(df["URL"].map(canonical_url))

    start_idx = 0
    # Implement resume functionality: if the output CSV already exists,
    # try to find the last processed URL and resume from the next row.
    if os.path.exists(output_csv):
        try:
            prev = pd.read_csv(output_csv)
        except EmptyDataError:
            # Handle case where the output CSV exists but is empty.
            prev = pd.DataFrame()
        if not prev.empty:
            # Get the last URL that was processed and written to the output file.
            last = prev["URL"].iloc[-1]
            # Find the index of this last URL in the original input DataFrame.
            idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
            if idxs:
                # Set the starting index for processing to the row after the last processed URL.
                start_idx = idxs[0] + 1
                logging.info(f"Resuming from row {start_idx} after {last}.")

    # Determine the range of rows to process based on test_mode.
    if test_mode:
        end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
        df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
        logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
    else:
        df_proc = df.iloc[start_idx:] # Process all remaining rows.
        logging.info(f"Processing rows {start_idx} to {len(df)-1}.")

    total = len(df_proc) # Total number of URLs to process in this run.
    processed = 0        # Counter for successfully processed URLs.
    batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.

    # Iterate over each row (URL) in the DataFrame slice to be processed.
    for _, row in df_proc.iterrows():
        raw_url = row["URL"] # Original URL from the input CSV.
        url = canonical_url(raw_url) # Canonicalized version of the URL.
        # Get title and category, handling potential missing values by defaulting to empty strings.
        title = row["Title"] if isinstance(row["Title"], str) else ""
        category = row["primary_category"] if isinstance(row["primary_category"], str) else ""

        # Determine the text to use for generating the embedding.
        # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
        if title.strip():
            text = title
        else:
            raw_slug = slug_from_url(raw_url)
            if not raw_slug or len(raw_slug) < MIN_SLUG_LENGTH:
                # If no meaningful slug can be extracted, skip this URL.
                logging.info(f"Skipping {raw_url}: insufficient slug context.")
                continue
            text = raw_slug.replace('-', ' ').replace('_', ' ') # Prepare slug for embedding by replacing hyphens with spaces.

        # Attempt to generate the embedding for the chosen text.
        # This call is wrapped in a try-except block to catch final failures after retries.
        try:
            embedding = generate_embedding(text)
        except Exception as e: # Catch any exception from generate_embedding after all retries.
            # If embedding generation fails even after retries, log the error and skip this URL.
            logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue # Move to the next URL.

        if not embedding:
            # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
            logging.info(f"Skipping {raw_url}: no embedding.")
            continue

        # Build metadata filter for Pinecone query.
        # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
        filt: Dict[str, Any] = {}
        if category:
            # Split category string by comma and strip whitespace for multiple categories.
            cats = [c.strip() for c in category.split(",") if c.strip()]
            if cats:
                filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
        if PUBLISH_YEAR_FILTER:
            filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
        filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.

        # Define a nested function for Pinecone query with retry mechanism.
        # This ensures that Pinecone queries are also robust against transient errors.
        @retry(
            wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
            stop=stop_after_attempt(MAX_RETRIES),
            retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
            reraise=True # Re-raise the exception if all retries fail.
        )
        def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
            """
            Performs a Pinecone index query with retry logic.
            """
            return index.query(
                vector=embedding_vector,
                top_k=top_k_count,
                include_values=False, # We don't need the actual vector values in the response.
                include_metadata=False, # We don't need the metadata in the response for this logic.
                filter=pinecone_filter # Apply the constructed metadata filter.
            )

        # Attempt to query Pinecone for redirect candidates.
        try:
            res = query_pinecone_with_retry(embedding, fetch_count, filt)
        except PineconeException as e:
            # If Pinecone query fails after retries, log the error and skip this URL.
            logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
            continue

        candidate = None # Initialize redirect candidate to None.
        score = None     # Initialize relevance score to None.

        # Iterate through the Pinecone query results (matches) to find a suitable candidate.
        for m in res.get("matches", []):
            cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
            # A candidate is suitable if:
            # 1. It exists (cid is not None).
            # 2. It's not the original URL itself (to prevent self-redirects).
            # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
            if cid and cid != url and cid not in input_urls:
                candidate = cid # Assign the first valid candidate found.
                score = m.get("score") # Get the relevance score of this candidate.
                break # Stop after finding the first suitable candidate (Pinecone returns by relevance).

        # Append the results for the current URL to the batch.
        batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
        processed += 1 # Increment the counter for processed URLs.
        msg = f"Mapped {url} → {candidate}"
        if score is not None:
            msg += f" ({score:.4f})" # Add score to log message if available.
        logging.info(msg) # Log the mapping result.

        # Periodically flush the batch results to the output CSV.
        if processed % LOG_BATCH_SIZE == 0:
            out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
            # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
            mode = 'a' if os.path.exists(output_csv) else 'w'
            # Determine if header should be written (only for new files).
            header = not os.path.exists(output_csv)
            # Write the batch to the CSV.
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
            batch.clear() # Clear the batch after writing to free memory.
            if not test_mode:
                clear_output(wait=True) # Clear output in Jupyter for cleaner progress display.
                print(f"Progress: {processed} / {total}") # Print progress update.

        time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.

    # After the loop, write any remaining items in the batch to the output CSV.
    if batch:
        out_df = pd.DataFrame(batch)
        mode = 'a' if os.path.exists(output_csv) else 'w'
        header = not os.path.exists(output_csv)
        out_df.to_csv(output_csv, mode=mode, header=header, index=False)

    logging.info(f"Completed. Total processed: {processed}") # Log completion message.

if __name__ == "__main__":
    # This block ensures that build_redirect_map is called only when the script is executed directly.
    # It passes the user-defined configuration parameters to the main function.
    build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)

While the quality of the output may be considered satisfactory, it falls short of the quality observed with Google Vertex AI.

Below in the table, you can see the difference in output quality.

URL Google Vertex Open AI
/what-is-eat/ /google-eat/what-is-it/ /5-things-you-can-do-right-now-to-improve-your-eat-for-google/408423/
/local-seo-for-lawyers/ /law-firm-seo/what-is-law-firm-seo/ /legal-seo-conference-exclusively-for-lawyers-spa/528149/

When it comes to SEO, even though Google Vertex AI is three times more expensive than OpenAI’s model, I prefer to use Vertex.

The quality of the results is significantly higher. While you may incur a greater cost per unit of text processed, you benefit from the superior output quality, which directly saves valuable time on reviewing and validating the results.

From my experience, it costs about $0.04 to process 20,000 URLs using Google Vertex AI.

While it’s said to be more expensive, it’s still ridiculously cheap, and you shouldn’t worry if you’re dealing with tasks involving a few thousand URLs.

In the case of processing 1 million URLs, the projected price would be approximately $2.

If you still want a free method, use BERT and Llama models from Hugging Face to generate vector embeddings without paying a per-API-call fee.

The real cost comes from the compute power needed to run the models, and you must generate vector embeddings of all your articles in Pinecone or any other vector database using those models if you will be querying using vectors generated from BERT or Llama.

In Summary: AI Is Your Powerful Ally

AI enables you to scale your SEO or marketing efforts and automate the most tedious tasks.

This doesn’t replace your expertise. It’s designed to level up your skills and equip you to face challenges with greater capability, making the process more engaging and fun.

Mastering these tools is essential for success. I’m passionate about writing about this topic to help beginners learn and feel inspired.

As we move forward in this series, we will explore how to use Google Vertex AI for building an internal linking WordPress plugin.

More Resources: 


Featured Image: BestForBest/Shutterstock

Google: Database Speed Beats Page Count For Crawl Budget via @sejournal, @MattGSouthern

Google has confirmed that most websites still don’t need to worry about crawl budget unless they have over one million pages. However, there’s a twist.

Google Search Relations team member Gary Illyes revealed on a recent podcast that how quickly your database operates matters more than the number of pages you have.

This update comes five years after Google shared similar guidance on crawl budgets. Despite significant changes in web technology, Google’s advice remains unchanged.

The Million-Page Rule Stays The Same

During the Search Off the Record podcast, Illyes maintained Google’s long-held position when co-host Martin Splitt inquired about crawl budget thresholds.

Illyes stated:

“I would say 1 million is okay probably.”

This implies that sites with fewer than a million pages can stop worrying about their crawl budget.

What’s surprising is that this number has remained unchanged since 2020. The web has grown significantly, with an increase in JavaScript, dynamic content, and more complex websites. Yet, Google’s threshold has remained the same.

Your Database Speed Is What Matters

Here’s the big news: Illyes revealed that slow databases hinder crawling more than having a large number of pages.

Illyes explained:

“If you are making expensive database calls, that’s going to cost the server a lot.”

A site with 500,000 pages but slow database queries might face more crawl issues than a site with 2 million fast-loading static pages.

What does this mean? You need to evaluate your database performance, not just count the number of pages. Sites with dynamic content, complex queries, or real-time data must prioritize speed and performance.

The Real Resource Hog: Indexing, Not Crawling

Illyes shared a sentiment that contradicts what many SEOs believe.

He said:

“It’s not crawling that is eating up the resources, it’s indexing and potentially serving or what you are doing with the data when you are processing that data.”

Consider what this means. If crawling doesn’t consume many resources, then blocking Googlebot may not be helpful. Instead, focus on making your content easier for Google to process after it has been crawled.

How We Got Here

The podcast provided some context about scale. In 1994, the World Wide Web Worm indexed only 110,000 pages, while WebCrawler indexed 2 million. Illyes called these numbers “cute” compared to today.

This helps explain why the one-million-page mark has remained unchanged. What once seemed huge in the early web is now just a medium-sized site. Google’s systems have expanded to manage this without altering the threshold.

Why The Threshold Remains Stable

Google has been striving to reduce its crawling footprint. Illyes revealed why that’s a challenge.

He explained:

“You saved seven bytes from each request that you make and then this new product will add back eight.”

This push-and-pull between efficiency improvements and new features helps explain why the crawl budget threshold remains consistent. While Google’s infrastructure evolves, the basic math regarding when crawl budget matters stays unchanged.

What You Should Do Now

Based on these insights, here’s what you should focus on:

Sites Under 1 Million Pages:
Continue with your current strategy. Prioritize excellent content and user experience. Crawl budget isn’t a concern for you.

Larger Sites:
Enhance database efficiency as your new priority. Review:

  • Query execution time
  • Caching effectiveness
  • Speed of dynamic content generation

All Sites:
Redirect focus from crawl prevention to indexing optimization. Since crawling isn’t the resource issue, assist Google in processing your content more efficiently.

Key Technical Checks:

  • Database query performance
  • Server response times
  • Content delivery optimization
  • Proper caching implementation

Looking Ahead

Google’s consistent crawl budget guidance demonstrates that some SEO fundamentals are indeed fundamental. Most sites don’t need to worry about it.

However, the insight regarding database efficiency shifts the conversation for larger sites. It’s not just about the number of pages you have; it’s about how efficiently you serve them.

For SEO professionals, this means incorporating database performance into your technical SEO audits. For developers, it underscores the significance of query optimization and caching strategies.

Five years from now, the million-page threshold might still exist. But sites that optimize their database performance today will be prepared for whatever comes next.

Listen to the full podcast episode below:


Featured Image: Novikov Aleksey/Shutterstock

Google’s Gary Illyes Warns AI Agents Will Create Web Congestion via @sejournal, @MattGSouthern

A Google engineer has warned that AI agents and automated bots will soon flood the internet with traffic.

Gary Illyes, who works on Google’s Search Relations team, said “everyone and my grandmother is launching a crawler” during a recent podcast.

The warning comes from Google’s latest Search Off the Record podcast episode.

AI Agents Will Strain Websites

During his conversation with fellow Search Relations team member Martin Splitt, Illyes warned that AI agents and “AI shenanigans” will be significant sources of new web traffic.

Illyes said:

“The web is getting congested… It’s not something that the web cannot handle… the web is designed to be able to handle all that traffic even if it’s automatic.”

This surge occurs as businesses deploy AI tools for content creation, competitor research, market analysis, and data gathering. Each tool requires crawling websites to function, and with the rapid growth of AI adoption, this traffic is expected to increase.

How Google’s Crawler System Works

The podcast provides a detailed discussion of Google’s crawling setup. Rather than employing different crawlers for each product, Google has developed one unified system.

Google Search, AdSense, Gmail, and other products utilize the same crawler infrastructure. Each one identifies itself with a different user agent name, but all adhere to the same protocols for robots.txt and server health.

Illyes explained:

“You can fetch with it from the internet but you have to specify your own user agent string.”

This unified approach ensures that all Google crawlers adhere to the same protocols and scale back when websites encounter difficulties.

The Real Resource Hog? It’s Not Crawling

Illyes challenged conventional SEO wisdom with a potentially controversial claim: crawling doesn’t consume significant resources.

Illyes stated:

“It’s not crawling that is eating up the resources, it’s indexing and potentially serving or what you are doing with the data.”

He even joked he would “get yelled at on the internet” for saying this.

This perspective suggests that fetching pages uses minimal resources compared to processing and storing the data. For those concerned about crawl budget, this could change optimization priorities.

From Thousands to Trillions: The Web’s Growth

The Googlers provided historical context. In 1994, the World Wide Web Worm search engine indexed only 110,000 pages, whereas WebCrawler managed to index 2 million. Today, individual websites can exceed millions of pages.

This rapid growth necessitated technological evolution. Crawlers progressed from basic HTTP 1.1 protocols to modern HTTP/2 for faster connections, with HTTP/3 support on the horizon.

Google’s Efficiency Battle

Google spent last year trying to reduce its crawling footprint, acknowledging the burden on site owners. However, new challenges continue to arise.

Illyes explained the dilemma:

“You saved seven bytes from each request that you make and then this new product will add back eight.”

Every efficiency gain is offset by new AI products requiring more data. This is a cycle that shows no signs of stopping.

What Website Owners Should Do

The upcoming traffic surge necessitates action in several areas:

  • Infrastructure: Current hosting may not support the expected load. Assess server capacity, CDN options, and response times before the influx occurs.
  • Access Control: Review robots.txt rules to control which AI crawlers can access your site. Block unnecessary bots while allowing legitimate ones to function properly.
  • Database Performance: Illyes specifically pointed out “expensive database calls” as problematic. Optimize queries and implement caching to alleviate server strain.
  • Monitoring: Differentiate between legitimate crawlers, AI agents, and malicious bots through thorough log analysis and performance tracking.

The Path Forward

Illyes pointed to Common Crawl as a potential model, which crawls once and shares data publicly, reducing redundant traffic. Similar collaborative solutions may emerge as the web adapts.

While Illyes expressed confidence in the web’s ability to manage increased traffic, the message is clear: AI agents are arriving in massive numbers.

Websites that strengthen their infrastructure now will be better equipped to weather the storm. Those who wait may find themselves overwhelmed when the full force of the wave hits.

Listen to the full podcast episode below:


Featured Image: Collagery/Shutterstock

How To Automate SEO Keyword Clustering By Search Intent With Python via @sejournal, @andreasvoniatis

There’s a lot to know about search intent, from using deep learning to infer search intent by classifying text and breaking down SERP titles using Natural Language Processing (NLP) techniques, to clustering based on semantic relevance, with the benefits explained.

Not only do we know the benefits of deciphering search intent, but we also have a number of techniques at our disposal for scale and automation.

So, why do we need another article on automating search intent?

Search intent is ever more important now that AI search has arrived.

While more was generally in the 10 blue links search era, the opposite is true with AI search technology, as these platforms generally seek to minimize the computing costs (per FLOP) in order to deliver the service.

SERPs Still Contain The Best Insights For Search Intent

The techniques so far involve doing your own AI, that is, getting all of the copy from titles of the ranking content for a given keyword and then feeding it into a neural network model (which you have to then build and test) or using NLP to cluster keywords.

What if you don’t have time or the knowledge to build your own AI or invoke the Open AI API?

While cosine similarity has been touted as the answer to helping SEO professionals navigate the demarcation of topics for taxonomy and site structures, I still maintain that search clustering by SERP results is a far superior method.

That’s because AI is very keen to ground its results on SERPs and for good reason – it’s modelled on user behaviors.

There is another way that uses Google’s very own AI to do the work for you, without having to scrape all the SERPs content and build an AI model.

Let’s assume that Google ranks site URLs by the likelihood of the content satisfying the user query in descending order. It follows that if the intent for two keywords is the same, then the SERPs are likely to be similar.

For years, many SEO professionals compared SERP results for keywords to infer shared (or shared) search intent to stay on top of core updates, so this is nothing new.

The value-add here is the automation and scaling of this comparison, offering both speed and greater precision.

How To Cluster Keywords By Search Intent At Scale Using Python (With Code)

Assuming you have your SERPs results in a CSV download, let’s import it into your Python notebook.

1. Import The List Into Your Python Notebook

import pandas as pd
import numpy as np

serps_input = pd.read_csv('data/sej_serps_input.csv')
del serps_input['Unnamed: 0']
serps_input

Below is the SERPs file now imported into a Pandas dataframe.

Image from author, April 2025

2. Filter Data For Page 1

We want to compare the Page 1 results of each SERP between keywords.

We’ll split the dataframe into mini keyword dataframes to run the filtering function before recombining into a single dataframe, because we want to filter at the keyword level:

# Split 
serps_grpby_keyword = serps_input.groupby("keyword")
k_urls = 15

# Apply Combine
def filter_k_urls(group_df):
    filtered_df = group_df.loc[group_df['url'].notnull()]
    filtered_df = filtered_df.loc[filtered_df['rank'] <= k_urls]
    return filtered_df
filtered_serps = serps_grpby_keyword.apply(filter_k_urls)

# Combine
## Add prefix to column names
#normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
filtered_serps_df = pd.concat([filtered_serps],axis=0)
del filtered_serps_df['keyword']
filtered_serps_df = filtered_serps_df.reset_index()
del filtered_serps_df['level_1']
filtered_serps_df
SERPs file imported into a Pandas dataframe.Image from author, April 2025

3. Convert Ranking URLs To A String

Because there are more SERP result URLs than keywords, we need to compress those URLs into a single line to represent the keyword’s SERP.

Here’s how:


# convert results to strings using Split Apply Combine 
filtserps_grpby_keyword = filtered_serps_df.groupby("keyword")

def string_serps(df): 
   df['serp_string'] = ''.join(df['url'])
   return df # Combine strung_serps = filtserps_grpby_keyword.apply(string_serps) 

# Concatenate with initial data frame and clean 
strung_serps = pd.concat([strung_serps],axis=0) 
strung_serps = strung_serps[['keyword', 'serp_string']]#.head(30) 
strung_serps = strung_serps.drop_duplicates() 
strung_serps

Below shows the SERP compressed into a single line for each keyword.

SERP compressed into single line for each keyword.Image from author, April 2025

4. Compare SERP Distance

To perform the comparison, we now need every combination of keyword SERP paired with other pairs:


# align serps
def serps_align(k, df):
    prime_df = df.loc[df.keyword == k]
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_a", 'keyword': 'keyword_a'})
    comp_df = df.loc[df.keyword != k].reset_index(drop=True)
    prime_df = prime_df.loc[prime_df.index.repeat(len(comp_df.index))].reset_index(drop=True)
    prime_df = pd.concat([prime_df, comp_df], axis=1)
    prime_df = prime_df.rename(columns = {"serp_string" : "serp_string_b", 'keyword': 'keyword_b', "serp_string_a" : "serp_string", 'keyword_a': 'keyword'})
    return prime_df

columns = ['keyword', 'serp_string', 'keyword_b', 'serp_string_b']
matched_serps = pd.DataFrame(columns=columns)
matched_serps = matched_serps.fillna(0)
queries = strung_serps.keyword.to_list()

for q in queries:
    temp_df = serps_align(q, strung_serps)
    matched_serps = matched_serps.append(temp_df)

matched_serps

Compare SERP similarity.

The above shows all of the keyword SERP pair combinations, making it ready for SERP string comparison.

There is no open-source library that compares list objects by order, so the function has been written for you below.

The function “serp_compare” compares the overlap of sites and the order of those sites between SERPs.


import py_stringmatching as sm
ws_tok = sm.WhitespaceTokenizer()

# Only compare the top k_urls results 
def serps_similarity(serps_str1, serps_str2, k=15):
    denom = k+1
    norm = sum([2*(1/i - 1.0/(denom)) for i in range(1, denom)])
    #use to tokenize the URLs
    ws_tok = sm.WhitespaceTokenizer()
    #keep only first k URLs
    serps_1 = ws_tok.tokenize(serps_str1)[:k]
    serps_2 = ws_tok.tokenize(serps_str2)[:k]
    #get positions of matches 
    match = lambda a, b: [b.index(x)+1 if x in b else None for x in a]
    #positions intersections of form [(pos_1, pos_2), ...]
    pos_intersections = [(i+1,j) for i,j in enumerate(match(serps_1, serps_2)) if j is not None] 
    pos_in1_not_in2 = [i+1 for i,j in enumerate(match(serps_1, serps_2)) if j is None]
    pos_in2_not_in1 = [i+1 for i,j in enumerate(match(serps_2, serps_1)) if j is None]
    
    a_sum = sum([abs(1/i -1/j) for i,j in pos_intersections])
    b_sum = sum([abs(1/i -1/denom) for i in pos_in1_not_in2])
    c_sum = sum([abs(1/i -1/denom) for i in pos_in2_not_in1])

    intent_prime = a_sum + b_sum + c_sum
    intent_dist = 1 - (intent_prime/norm)
    return intent_dist

# Apply the function
matched_serps['si_simi'] = matched_serps.apply(lambda x: serps_similarity(x.serp_string, x.serp_string_b), axis=1)

# This is what you get
matched_serps[['keyword', 'keyword_b', 'si_simi']]

Overlap of sites and the order of those sites between SERPs.

Now that the comparisons have been executed, we can start clustering keywords.

We will be treating any keywords that have a weighted similarity of 40% or more.


# group keywords by search intent
simi_lim = 0.4

# join search volume
keysv_df = serps_input[['keyword', 'search_volume']].drop_duplicates()
keysv_df.head()

# append topic vols
keywords_crossed_vols = serps_compared.merge(keysv_df, on = 'keyword', how = 'left')
keywords_crossed_vols = keywords_crossed_vols.rename(columns = {'keyword': 'topic', 'keyword_b': 'keyword',
                                                                'search_volume': 'topic_volume'})

# sim si_simi
keywords_crossed_vols.sort_values('topic_volume', ascending = False)

# strip NAN
keywords_filtered_nonnan = keywords_crossed_vols.dropna()
keywords_filtered_nonnan

We now have the potential topic name, keywords SERP similarity, and search volumes of each.
Clustering keywords.

You’ll note that keyword and keyword_b have been renamed to topic and keyword, respectively.

Now we’re going to iterate over the columns in the dataframe using the lambda technique.

The lambda technique is an efficient way to iterate over rows in a Pandas dataframe because it converts rows to a list as opposed to the .iterrows() function.

Here goes:


queries_in_df = list(set(matched_serps['keyword'].to_list()))
topic_groups = {}

def dict_key(dicto, keyo):
    return keyo in dicto

def dict_values(dicto, vala):
    return any(vala in val for val in dicto.values())

def what_key(dicto, vala):
    for k, v in dicto.items():
            if vala in v:
                return k

def find_topics(si, keyw, topc):
    if (si >= simi_lim):

        if (not dict_key(sim_topic_groups, keyw)) and (not dict_key(sim_topic_groups, topc)): 

            if (not dict_values(sim_topic_groups, keyw)) and (not dict_values(sim_topic_groups, topc)): 
                sim_topic_groups[keyw] = [keyw] 
                sim_topic_groups[keyw] = [topc] 
                if dict_key(non_sim_topic_groups, keyw):
                    non_sim_topic_groups.pop(keyw)
                if dict_key(non_sim_topic_groups, topc): 
                    non_sim_topic_groups.pop(topc)
            if (dict_values(sim_topic_groups, keyw)) and (not dict_values(sim_topic_groups, topc)): 
                d_key = what_key(sim_topic_groups, keyw)
                sim_topic_groups[d_key].append(topc)
                if dict_key(non_sim_topic_groups, keyw):
                    non_sim_topic_groups.pop(keyw)
                if dict_key(non_sim_topic_groups, topc): 
                    non_sim_topic_groups.pop(topc)
            if (not dict_values(sim_topic_groups, keyw)) and (dict_values(sim_topic_groups, topc)): 
                d_key = what_key(sim_topic_groups, topc)
                sim_topic_groups[d_key].append(keyw)
                if dict_key(non_sim_topic_groups, keyw):
                    non_sim_topic_groups.pop(keyw)
                if dict_key(non_sim_topic_groups, topc): 
                    non_sim_topic_groups.pop(topc) 

        elif (keyw in sim_topic_groups) and (not topc in sim_topic_groups): 
            sim_topic_groups[keyw].append(topc)
            sim_topic_groups[keyw].append(keyw)
            if keyw in non_sim_topic_groups:
                non_sim_topic_groups.pop(keyw)
            if topc in non_sim_topic_groups: 
                non_sim_topic_groups.pop(topc)
        elif (not keyw in sim_topic_groups) and (topc in sim_topic_groups):
            sim_topic_groups[topc].append(keyw)
            sim_topic_groups[topc].append(topc)
            if keyw in non_sim_topic_groups:
                non_sim_topic_groups.pop(keyw)
            if topc in non_sim_topic_groups: 
                non_sim_topic_groups.pop(topc)
        elif (keyw in sim_topic_groups) and (topc in sim_topic_groups):
            if len(sim_topic_groups[keyw]) > len(sim_topic_groups[topc]):
                sim_topic_groups[keyw].append(topc) 
                [sim_topic_groups[keyw].append(x) for x in sim_topic_groups.get(topc)] 
                sim_topic_groups.pop(topc)

        elif len(sim_topic_groups[keyw]) < len(sim_topic_groups[topc]):
            sim_topic_groups[topc].append(keyw) 
            [sim_topic_groups[topc].append(x) for x in sim_topic_groups.get(keyw)]
            sim_topic_groups.pop(keyw) 
        elif len(sim_topic_groups[keyw]) == len(sim_topic_groups[topc]):
            if sim_topic_groups[keyw] == topc and sim_topic_groups[topc] == keyw:
            sim_topic_groups.pop(keyw)

    elif si < simi_lim:
  
        if (not dict_key(non_sim_topic_groups, keyw)) and (not dict_key(sim_topic_groups, keyw)) and (not dict_values(sim_topic_groups,keyw)): 
            non_sim_topic_groups[keyw] = [keyw]
        if (not dict_key(non_sim_topic_groups, topc)) and (not dict_key(sim_topic_groups, topc)) and (not dict_values(sim_topic_groups,topc)): 
            non_sim_topic_groups[topc] = [topc]

Below shows a dictionary containing all the keywords clustered by search intent into numbered groups:

{1: ['fixed rate isa',
  'isa rates',
  'isa interest rates',
  'best isa rates',
  'cash isa',
  'cash isa rates'],
 2: ['child savings account', 'kids savings account'],
 3: ['savings account',
  'savings account interest rate',
  'savings rates',
  'fixed rate savings',
  'easy access savings',
  'fixed rate bonds',
  'online savings account',
  'easy access savings account',
  'savings accounts uk'],
 4: ['isa account', 'isa', 'isa savings']}

Let’s stick that into a dataframe:


topic_groups_lst = []

for k, l in topic_groups_numbered.items():
    for v in l:
        topic_groups_lst.append([k, v])

topic_groups_dictdf = pd.DataFrame(topic_groups_lst, columns=['topic_group_no', 'keyword'])
                                
topic_groups_dictdf
Topic group dataframe.Image from author, April 2025

The search intent groups above show a good approximation of the keywords inside them, something that an SEO expert would likely achieve.

Although we only used a small set of keywords, the method can obviously be scaled to thousands (if not more).

Activating The Outputs To Make Your Search Better

Of course, the above could be taken further using neural networks, processing the ranking content for more accurate clusters and cluster group naming, as some of the commercial products out there already do.

For now, with this output, you can:

  • Incorporate this into your own SEO dashboard systems to make your trends and SEO reporting more meaningful.
  • Build better paid search campaigns by structuring your Google Ads accounts by search intent for a higher Quality Score.
  • Merge redundant facet ecommerce search URLs.
  • Structure a shopping site’s taxonomy according to search intent instead of a typical product catalog.

I’m sure there are more applications that I haven’t mentioned – feel free to comment on any important ones that I’ve not already mentioned.

In any case, your SEO keyword research just got that little bit more scalable, accurate, and quicker!

Download the full code here for your own use.

More Resources:


Featured Image: Buch and Bee/Shutterstock

HTTP Status Codes Google Cares About (And Those It Ignores) via @sejournal, @MattGSouthern

Google’s Search Relations team recently shared insights about how the search engine handles HTTP status codes during a “Search Off the Record” podcast.

Gary Illyes and Martin Splitt from Google discussed several status code categories commonly misunderstood by SEO professionals.

How Google Views Certain HTTP Status Codes

While the podcast didn’t cover every HTTP status code (obviously, 200 OK remains fundamental), it focused on categories that often cause confusion among SEO practitioners.

Splitt emphasized during the discussion:

“These status codes are actually important for site owners and SEOs because they tell a story about what happened when a particular request came in.”

The podcast revealed several notable points about how Google processes specific status code categories.

The 1xx Codes: Completely Ignored

Google’s crawlers ignore all status codes in the 1xx range, including newer features like “early hints” (HTTP 103).

Illyes explained:

“We are just going to pass through [1xx status codes] anyway without even noticing that something was in the 100 range. We just notice the next non-100 status code instead.”

This means implementing early hints might help user experience, but won’t directly benefit your SEO.

Redirects: Simpler Than Many SEOs Believe

While SEO professionals often debate which redirect type to use (301, 302, 307, 308), Google’s approach focuses mainly on whether redirects are permanent or temporary.

Illyes stated:

“For Google search specifically, it’s just like ‘yeah, it was a redirection.’ We kind of care about in canonicalization whether something was temporary or permanent, but otherwise we just [see] it was a redirection.”

This doesn’t mean redirect implementation is unimportant, but it suggests the permanent vs. temporary distinction is more critical than the specific code number.

Client Error Codes: Standard Processing

The 4xx range of status codes functions largely as expected.

Google appropriately processes standard codes like 404 (not found) and 410 (gone), which remain essential for proper crawl management.

The team humorously mentioned status code 418 (“I’m a teapot”), an April Fool’s joke in the standards, which has no SEO impact.

Network Errors in Search Console: Looking Deeper

Many mysterious network errors in Search Console originate from deeper technical layers below HTTP.

Illyes explained:

“Every now and then you would get these weird messages in Search Console that like there was something with the network… and that can actually happen in these layers that we are talking about.”

When you see network-related crawl errors, you may need to investigate lower-level protocols like TCP, UDP, or DNS.

What Wasn’t Discussed But Still Matters

The podcast didn’t cover many status codes that definitely matter to Google, including:

  • 200 OK (the standard successful response)
  • 500-level server errors (which can affect crawling and indexing)
  • 429 Too Many Requests (rate limiting)
  • Various other specialized codes

Practical Takeaways

While this wasn’t a comprehensive guide to HTTP status codes, the discussion revealed several practical insights:

  • For redirects, focus primarily on the permanent vs. temporary distinction
  • Don’t invest resources in optimizing 1xx responses specifically for Google
  • When troubleshooting network errors, look beyond HTTP to deeper protocol layers
  • Continue to implement standard status codes correctly, including those not specifically discussed

As web technology evolves with HTTP/3 and QUIC, understanding how Google processes these signals can help you build more effective technical SEO strategies without overcomplicating implementation.


Featured Image: Roman Samborskyi/Shutterstock

Create Your Own ChatGPT Agent For On-Page SEO Audits via @sejournal, @makhyan

ChatGPT is more than just a prompting and response platform. You can send prompts to ask for help with SEO, but it becomes more powerful the moment that you make your own agent.

I conduct many SEO audits – it’s a necessity for an enterprise site – so I was looking for a way to streamline some of these processes.

How did I do it? By creating a ChatGPT agent that I’m going to share with you so that you can customize and change it to meet your needs.

I’ll keep things as “untechnical” as possible, but just follow the instructions, and everything should work.

I’m going to explain the following steps”

  1. Configuration of your own ChatGPT.
  2. Creating your own Cloudflare code to fetch a page’s HTML data.
  3. Putting your SEO audit agents to work.

At the end, you’ll have a bot that provides you with information, such as:

Custom ChatGPT for SEOCustom ChatGPT for SEO (Image from author, May 2025)

You’ll also receive a list of actionable steps to take to improve your SEO based on the agent’s findings.

Creating A Cloudflare Pages Worker For Your Agent

Cloudflare Pages workers help your agent gather information from the website you’re trying to parse and view its current state of SEO.

You can use a free account to get started, and you can register by doing the following:

  1. Going to http://pages.dev/
  2. Creating an account

I used Google to sign up because it’s easier, but choose the method you’re most comfortable with. You’ll end up on a screen that looks something like this:

Cloudflare DashboardCloudflare Dashboard (Screenshot from Cloudfare, May 2025)

Navigate to Add > Workers.

Add a Cloudflare WorkerAdd a Cloudflare Worker (Screenshot from Cloudfare, May 2025)

You can then select a template, import a repository, or start with Hello World! I chose the Hello World option, as it’s the easiest one to use.

Selecting Cloudflare WorkerSelecting Cloudflare Worker (Screenshot from Cloudfare, May 2025)

Go through the next screen and hit “Deploy.” You’ll end up on a screen that says, “Success! Your project is deployed to Region: Earth.”

Don’t click off this page.

Instead, click on “Edit code,” remove all of the existing code, and enter the following code into the editor:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request));
});

async function handleRequest(request) {
  const { searchParams } = new URL(request.url);
  const targetUrl = searchParams.get('url');
  const userAgentName = searchParams.get('user-agent');

  if (!targetUrl) {
    return new Response(
      JSON.stringify({ error: "Missing 'url' parameter" }),
      { status: 400, headers: { 'Content-Type': 'application/json' } }
    );
  }

  const userAgents = {
    googlebot: 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.184 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    samsung5g: 'Mozilla/5.0 (Linux; Android 13; SM-S901B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36',
    iphone13pmax: 'Mozilla/5.0 (iPhone14,3; U; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19A346 Safari/602.1',
    msedge: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    safari: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
    bingbot: 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/',
    chrome: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
  };

  const userAgent = userAgents[userAgentName] || userAgents.chrome;

  const headers = {
    'User-Agent': userAgent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip',
    'Cache-Control': 'no-cache',
    'Pragma': 'no-cache',
  };


  try {
    let redirectChain = [];
    let currentUrl = targetUrl;
    let finalResponse;

    // Follow redirects
    while (true) {
      const response = await fetch(currentUrl, { headers, redirect: 'manual' });

      // Add the current URL and status to the redirect chain only if it's not already added
      if (!redirectChain.length || redirectChain[redirectChain.length - 1].url !== currentUrl) {
        redirectChain.push({ url: currentUrl, status: response.status });
      }

      // Check if the response is a redirect
      if (response.status >= 300 && response.status < 400 && response.headers.get('location')) { const redirectUrl = new URL(response.headers.get('location'), currentUrl).href; currentUrl = redirectUrl; // Follow the redirect } else { // No more redirects; capture the final response finalResponse = response; break; } } if (!finalResponse.ok) { throw new Error(`Request to ${targetUrl} failed with status code: ${finalResponse.status}`); } const html = await finalResponse.text(); // Robots.txt const domain = new URL(targetUrl).origin; const robotsTxtResponse = await fetch(`${domain}/robots.txt`, { headers }); const robotsTxt = robotsTxtResponse.ok ? await robotsTxtResponse.text() : 'robots.txt not found'; const sitemapMatches = robotsTxt.match(/Sitemap:s*(https?://[^s]+)/gi) || []; const sitemaps = sitemapMatches.map(sitemap => sitemap.replace('Sitemap: ', '').trim());

    // Metadata
    const titleMatch = html.match(/]*>s*(.*?)s*/i);
    const title = titleMatch ? titleMatch[1] : 'No Title Found';

    const metaDescriptionMatch = html.match(//i);
    const metaDescription = metaDescriptionMatch ? metaDescriptionMatch[1] : 'No Meta Description Found';

    const canonicalMatch = html.match(//i);
    const canonical = canonicalMatch ? canonicalMatch[1] : 'No Canonical Tag Found';

    // Open Graph and Twitter Info
    const ogTags = {
      ogTitle: (html.match(//i) || [])[1] || 'No Open Graph Title',
      ogDescription: (html.match(//i) || [])[1] || 'No Open Graph Description',
      ogImage: (html.match(//i) || [])[1] || 'No Open Graph Image',
    };

    const twitterTags = {
      twitterTitle: (html.match(//i) || [])[2] || 'No Twitter Title',
      twitterDescription: (html.match(//i) || [])[2] || 'No Twitter Description',
      twitterImage: (html.match(//i) || [])[2] || 'No Twitter Image',
      twitterCard: (html.match(//i) || [])[2] || 'No Twitter Card Type',
      twitterCreator: (html.match(//i) || [])[2] || 'No Twitter Creator',
      twitterSite: (html.match(//i) || [])[2] || 'No Twitter Site',
      twitterLabel1: (html.match(//i) || [])[2] || 'No Twitter Label 1',
      twitterData1: (html.match(//i) || [])[2] || 'No Twitter Data 1',
      twitterLabel2: (html.match(//i) || [])[2] || 'No Twitter Label 2',
      twitterData2: (html.match(//i) || [])[2] || 'No Twitter Data 2',
      twitterAccountId: (html.match(//i) || [])[2] || 'No Twitter Account ID',
    };

    // Headings
    const headings = {
      h1: [...html.matchAll(/

]*>(.*?)

/gis)].map(match => match[1]), h2: [...html.matchAll(/

]*>(.*?)

/gis)].map(match => match[1]), h3: [...html.matchAll(/

]*>(.*?)

/gis)].map(match => match[1]), }; // Images const imageMatches = [...html.matchAll(/]*src="(.*?)"[^>]*>/gi)]; const images = imageMatches.map(img => img[1]); const imagesWithoutAlt = imageMatches.filter(img => !/alt=".*?"/i.test(img[0])).length; // Links const linkMatches = [...html.matchAll(/]*href="(.*?)"[^>]*>/gi)]; const links = { internal: linkMatches.filter(link => link[1].startsWith(domain)).map(link => link[1]), external: linkMatches.filter(link => !link[1].startsWith(domain) && link[1].startsWith('http')).map(link => link[1]), }; // Schemas (JSON-LD) const schemaJSONLDMatches = [...html.matchAll(/
Google Reminds That Hreflang Tags Are Hints, Not Directives via @sejournal, @MattGSouthern

A recent exchange between SEO professional Neil McCarthy and Google Search Advocate John Mueller has highlighted how Google treats hreflang tags.

McCarthy observed pages intended for Belgian French users (fr-be) appearing in France. Mueller clarified that hreflang is a suggestion, not a guarantee.

Here’s what this interaction shows us about hreflang, canonical tags, and international SEO.

French-Belgian Pages in French Search Results

McCarthy noticed that pages tagged for French-Belgian audiences were appearing in searches conducted from France.

In a screenshot shared on Bluesky, Google stated the result:

  • Contains the search terms
  • Is in French
  • “Seems coherent with this search, even if it usually appears in searches outside of France”

McCarthy asked whether Google was ignoring his hreflang instructions.

What Google Says About hreflang

Mueller replied:

“hreflang doesn’t guarantee indexing, so it can also just be that not all variations are indexed. And, if they are the same (eg fr-fr, fr-be), it’s common that one is chosen as canonical (they’re the same).”

In a follow-up, he added:

“I suspect this is a ‘same language’ case where our systems just try to simplify things for sites. Often hreflang will still swap out the URL, but reporting will be on the canonical URL.”

Key Takeaways

Hreflang is a Hint, Not a Command

Google uses hreflang as a suggestion for which regional URL to display. It doesn’t require that each version be indexed or shown separately.

Canonical Tags Can Override Variations

Google may select one as the canonical URL when two pages are nearly identical. That URL then receives all the indexing and reporting.

“Same Language” Simplification

If two pages share the same language, Google’s systems may group them. Even if hreflang presents the correct variant to users, metrics often consolidate into the canonical version.

What This Means for International SEO Teams

Add unique elements to each regional page. The more distinct the content, the less likely Google is to group it under one canonical URL.

In Google Search Console, verify which URL is shown as canonical. Don’t assume that hreflang tags alone will generate separate performance data.

Use VPNs or location-based testing tools to search from various countries. Ensure Google displays the correct pages for the intended audience.

Review Google’s official documentation on hreflang, sitemaps, and HTTP headers. Remember that hreflang signals are hints that work best alongside a solid site structure.

Next Steps for Marketers

International SEO can be complex, but clear strategies help:

  1. Audit Your hreflang Setup: Check tag syntax, XML sitemaps, and HTTP header configurations.
  2. Review Page Similarity: Ensure each language-region version serves a unique user need.
  3. Monitor Continuously: Set up alerts for unexpected traffic patterns or drops in regional performance.

SEO teams can set realistic goals and fine-tune their international strategies by understanding hreflang’s limits and Google’s approach to canonical tags. Regular testing, precise localization, and vigilant monitoring will keep regional campaigns on track.


Featured Image: Roman Samborskyi/Shutterstock

Server-Side vs. Client-Side Rendering: What Google Recommends via @sejournal, @MattGSouthern

In an interview with Kenichi Suzuki from Faber Company Inc., Google Developer Advocate Martin Splitt recently shared key information about JavaScript rendering, server-side vs. client-side rendering, and structured data.

The talk cleared up common SEO confusion and offered practical tips for developers and marketers working with Google’s changing search systems.

Google’s AI Crawler & JavaScript Rendering

When asked how AI systems handle JavaScript content, Splitt revealed that Google’s AI crawler (used by Gemini) processes JavaScript well through a shared service.

Splitt explained:

“We don’t share what Googlebot sees for web search, but Google’s AI crawler that Gemini uses also renders. It uses WRS [Web Rendering Service], but it’s basically like we have a service Googlebot uses, and Gemini uses the service as well.”

This gives Google’s AI tools an edge over competitors that have trouble with JavaScript.

While one study mentioned in the interview claimed rendering sometimes takes weeks, Splitt explained that it usually happens much faster.

“The 99th percentile is within minutes,” Splitt noted, suggesting that long delays are rare and might be due to measurement errors.

Server-Side vs. Client-Side Rendering: Which is Better?

Part of the discussion covered the debate between server-side rendering (SSR) and client-side rendering (CSR).

Instead of saying one is always better, Splitt stressed that the right choice depends on what your website does.

Splitt stated:

“If you have a website that is a classical website that is basically just presenting information to the user, then requiring JavaScript is a drawback. It can break. It can cause problems. It will make things slower. It will need more battery on your phone.”

Splitt suggests SSR or even pre-rendering static HTML for websites focused on content. But CSR works better for interactive tools like CAD programs or video editors.

Splitt clarified:

“It’s not one or the other. It is two tools. Do you need a hammer or do you need a screwdriver? That depends on what you’re trying to do.”

See also: Understand the Difference Between Client-Side and Server-Side Rendering.

Structured Data’s Role in AI Understanding

The talk then moved to structured data, which is becoming more important as AI systems grow in search.

When asked if structured data helps Google’s AI understand content better, like Microsoft claims about Bing, Splitt confirmed it helps.

He stated:

“Structured data gives us more information and gives us more confidence in information. So it makes sense to have structured data.”

However, Splitt clarified that while structured data adds context, it “does not push rankings” directly. This is an important difference for SEO professionals who might think it directly boosts search positions.

What This Means

Here are the key things we learned from this interview:

  1. Google’s rendering usually happens within minutes, so the old fear of JavaScript-heavy sites being at a disadvantage is less of an issue now.
  2. Non-Google AI tools may still have trouble with JavaScript, making SSR possibly more critical for visibility across all AI systems.
  3. Use SSR for content sites and CSR for interactive tools. Don’t use one solution for everything.
  4. Though not a ranking factor, structured data helps Google understand your content better. This matters more as AI becomes a bigger part of search.

In his final advice to SEO professionals, Splitt highlighted basic principles over technical tricks:

“Think about your users. Figure out what is your business goal, how to make users happy, and then just create great content.”

As AI changes search technology, understanding these technical details becomes more important for marketers who want to optimize content for people and search algorithms.

Hear the full discussion in the video below:

11 Lessons Learned From Auditing Over 500 Websites via @sejournal, @olgazarr

After conducting more than 500 in-depth website audits in the past 12 years, I’ve noticed clear patterns about what works and doesn’t in SEO.

I’ve seen almost everything that can go right – and wrong – with websites of different types.

To help you avoid costly SEO mistakes, I’m sharing 11 practical lessons from critical SEO areas, such as technical SEO, on-page SEO, content strategy, SEO tools and processes, and off-page SEO.

It took me more than a decade to discover all these lessons. By reading this article, you can apply these insights to save yourself and your SEO clients time, money, and frustration – in less than an hour.

Lesson #1: Technical SEO Is Your Foundation For SEO Success

  • Lesson: You should always start any SEO work with technical fundamentals; crawlability and indexability determine whether search engines can even see your site.

Technical SEO ensures search engines can crawl, index, and fully understand your content. If search engines can’t properly access your site, no amount of quality content or backlinks will help.

After auditing over 500 websites, I believe technical SEO is the most critical aspect of SEO, which comes down to two fundamental concepts:

  • Crawlability: Can search engines easily find and navigate your website’s pages?
  • Indexability: Once crawled, can your pages appear in search results?

If your pages fail these two tests, they won’t even enter the SEO game — and your SEO efforts won’t matter.

I strongly recommend regularly monitoring your technical SEO health using at least two essential tools: Google Search Console and Bing Webmaster Tools.

Google Search Console Indexing ReportGoogle Search Console Indexing Report provides valuable insights into crawlability and indexability. Screenshot from Google Search Console, April 2025

When starting any SEO audit, always ask yourself these two critical questions:

  • Can Google, Bing, or other search engines crawl and index my important pages?
  • Am I letting search engine bots crawl only the right pages?

This step alone can save you huge headaches and ensure no major technical SEO blockages.

→ Read more: 13 Steps To Boost Your Site’s Crawlability And Indexability

Lesson #2: JavaScript SEO Can Easily Go Wrong

  • Lesson: You should be cautious when relying heavily on JavaScript. It can easily prevent Google from seeing and indexing critical content.

JavaScript adds great interactivity, but search engines (even as smart as Google) often struggle to process it reliably.

Google handles JavaScript in three steps (crawling, rendering, and indexing) using an evergreen Chromium browser. However, rendering delays (from minutes to weeks) and limited resources can prevent important content from getting indexed.

I’ve audited many sites whose SEO was failing because key JavaScript-loaded content wasn’t visible to Google.

Typically, important content was missing from the initial HTML, it didn’t load properly during rendering, or there were significant differences between the raw HTML and rendered HTML when it came to content or meta elements.

You should always test if Google can see your JavaScript-based content:

  • Use the Live URL Test in Google Search Console and verify rendered HTML.
Google Search Console LIVE TestGoogle Search Console LIVE Test allows you to see the rendered HTML. (Screenshot from Google Search Console, April 2025)
  • Or, search Google for a unique sentence from your JavaScript content (in quotes). If your content isn’t showing up, Google probably can’t index it.*
Site: search in Google The site: search in Google allows you to quickly check whether a given piece of text on a given page is indexed by Google. (Screenshot from Google Search, April 2025)

*This will only work for URLs that are already in Google’s index.

Here are a few best practices regarding JavaScript SEO:

  • Critical content in HTML: You should include titles, descriptions, and important content directly in the initial HTML so search engines can index it immediately. You should remember that Google doesn’t scroll or click.
  • Server-Side Rendering (SSR): You should consider implementing SSR to serve fully rendered HTML. It’s more reliable and less resource-intensive for search engines.
  • Proper robots.txt setup: Websites should block essential JavaScript files needed for rendering, as this prevents indexing.
  • Use crawlable URLs: You should ensure each page has a unique, crawlable URL. You should also avoid URL fragments (#section) for important content; they often don’t get indexed.

For a full list of JavaScript SEO common errors and best practices, you can navigate to the JavaScript SEO guide for SEO pros and developers.

Read more: 6 JavaScript Optimization Tips From Google

Lesson #3: Crawl Budget Matters, But Only If Your Website Is Huge

  • Lesson: You should only worry about the crawl budget if your website has hundreds of thousands or millions of pages.

Crawl budget refers to how many pages a search engine like Google crawls on your site within a certain timeframe. It’s determined by two main factors:

  • Crawl capacity limit: This prevents Googlebot from overwhelming your server with too many simultaneous requests.
  • Crawl demand: This is based on your site’s popularity and how often content changes.

No matter what you hear or read on the internet, most websites don’t need to stress about crawl budget at all. Google typically handles crawling efficiently for smaller websites.

But for huge websites – especially those with millions of URLs or daily-changing content – crawl budget becomes critical (as Google confirms in its crawl budget documentation).

Google documentation on crawl budgetGoogle, in its documentation, clearly defines what types of websites should be concerned about crawl budget. (Screenshot from Search Central, April 2025)

In this case, you need to ensure that Google prioritizes and crawls important pages frequently without wasting resources on pages that should never be crawled or indexed.

You can check your crawl budget health using Google Search Console’s Indexing report. Pay attention to:

  • Crawled – Currently Not Indexed: This usually indicates indexing problems, not crawl budget.
  • Discovered – Currently Not Indexed: This typically signals crawl budget issues.

You should also regularly review Google Search Console’s Crawl Stats report to see how many pages Google crawls per day. Comparing crawled pages with total pages on your site helps you spot inefficiencies.

While those quick checks in GSC naturally won’t replace log file analysis, they will give quick insights into possible crawl budget issues and may suggest that a detailed log file analysis may be necessary.

Read more: 9 Tips To Optimize Crawl Budget For SEO

This brings us to the next point.

Lesson #4: Log File Analysis Lets You See The Entire Picture

  • Lesson: Log file analysis is a must for many websites. It reveals details you can’t see otherwise and helps diagnose problems with crawlability and indexability that affect your site’s ability to rank.

Log files track every visit from search engine bots, like Googlebot or Bingbot. They show which pages are crawled, how often, and what the bots do. This data lets you spot issues and decide how to fix them.

For example, on an ecommerce site, you might find Googlebot crawling product pages, adding items to the cart, and removing them, wasting your crawl budget on useless actions.

With this insight, you can block those cart-related URLs with parameters to save resources so that Googlebot can crawl and index valuable, indexable canonical URLs.

Here is how you can make use of log file analysis:

  • Start by accessing your server access logs, which record bot activity.
  • Look at what pages bots hit most, how frequently they visit, and if they’re stuck on low-value URLs.
  • You don’t need to analyze logs manually. Tools like Screaming Frog Log File Analyzer make it easy to identify patterns quickly.
  • If you notice issues, like bots repeatedly crawling URLs with parameters, you can easily update your robots.txt file to block those unnecessary crawls

Getting log files isn’t always easy, especially for big enterprise sites where server access might be restricted.

If that’s the case, you can use the aforementioned Google Search Console’s Crawl Stats, which provides valuable insights into Googlebot’s crawling activity, including pages crawled, crawl frequency, and response times.

Google Search Console Crawl Stats reportThe Google Search Console Crawl Stats report provides a sample of data about Google’s crawling activity. (Screenshot from Google Search Console, April 2025)

While log files offer the most detailed view of search engine interactions, even a quick check in Crawl Stats helps you spot issues you might otherwise miss.

Read more: 14 Must-Know Tips For Crawling Millions Of Webpages

Lesson #5: Core Web Vitals Are Overrated. Stop Obsessing Over Them

  • Lesson: You should focus less on Core Web Vitals. They rarely make or break SEO results.

Core Web Vitals measure loading speed, interactivity, and visual stability, but they do not influence SEO as significantly as many assume.

After auditing over 500 websites, I’ve rarely seen Core Web Vitals alone significantly improve rankings.

Most sites only see measurable improvement if their loading times are extremely poor – taking more than 30 seconds – or have critical issues flagged in Google Search Console (where everything is marked in red).

Core Web Vitals in Google Search ConsoleThe Core Web Vitals report in Google Search Console provides real-world user data. (Screenshot from Google Search Console, April 2025)

I’ve watched clients spend thousands, even tens of thousands of dollars, chasing perfect Core Web Vitals scores while overlooking fundamental SEO basics, such as content quality or keyword strategy.

Redirecting those resources toward content and foundational SEO improvements usually yields way better results.

When evaluating Core Web Vitals, you should focus exclusively on real-world data from Google Search Console (as opposed to lab data in Google PageSpeed Insights) and consider users’ geographic locations and typical internet speeds.

If your users live in urban areas with reliable high-speed internet, Core Web Vitals won’t affect them much. But if they’re rural users on slower connections or older devices, site speed and visual stability become critical.

The bottom line here is that you should always base your decision to optimize Core Web Vitals on your specific audience’s needs and real user data – not just industry trends.

Read more: Are Core Web Vitals A Ranking Factor?

Lesson #6: Use Schema (Structured Data) To Help Google Understand & Trust You

  • Lesson: You should use structured data (Schema) to tell Google who you are, what you do, and why your website deserves trust and visibility.

Schema Markup (or structured data) explicitly defines your content’s meaning, which helps Google easily understand the main topic and context of your pages.

Certain schema types, like rich results markup, allow your listings to display extra details, such as star ratings, event information, or product prices. These “rich snippets” can grab attention in search results and increase click-through rates.

You can think of schema as informative labels for Google. You can label almost anything – products, articles, reviews, events – to clearly explain relationships and context. This clarity helps search engines understand why your content is relevant for a given query.

You should always choose the correct schema type (like “Article” for blog posts or “Product” for e-commerce pages), implement it properly with JSON-LD, and carefully test it using Google’s Rich Results Test or Structured Data Testing Tool.

Structured data markup typesIn its documentation, Google shows examples of structured data markup supported by Google Search. (Screenshot from Google Search Console, April 2025)

Schema lets you optimize SEO behind the scenes without affecting what your audience sees.

While SEO clients often hesitate about changing visible content, they usually feel comfortable adding structured data because it’s invisible to website visitors.

Read more: CMO Guide To Schema: How Your Organization Can Implement A Structured Data Strategy

Lesson #7: Keyword Research And Mapping Are Everything

  • Lesson: Technical SEO gets you into the game by controlling what search engines can crawl and index. But, the next step – keyword research and mapping – tells them what your site is about and how to rank it.

Too often, websites chase the latest SEO tricks or target broad, competitive keywords without any strategic planning. They skip proper keyword research and rarely invest in keyword mapping, both essential steps to long-term SEO success:

  • Keyword research identifies the exact words and phrases your audience actually uses to search.
  • Keyword mapping assigns these researched terms to specific pages and gives each page a clear, focused purpose.

Every website should have a spreadsheet listing all its indexable canonical URLs.

Next to each URL, there should be the main keyword that the page should target, plus a few related synonyms or variations.

Keyword research and keyword mappingHaving the keyword mapping document is a vital element of any SEO strategy. (Image from author, April 2025)

Without this structure, you’ll be guessing and hoping your pages rank for terms that may not even match your content.

A clear keyword map ensures every page has a defined role, which makes your entire SEO strategy more effective.

This isn’t busywork; it’s the foundation of a solid SEO strategy.

→ Read more: How To Use ChatGPT For Keyword Research

Lesson #8: On-Page SEO Accounts For 80% Of Success

  • Lesson: From my experience auditing hundreds of websites, on-page SEO drives about 80% of SEO results. Yet, only about 1 in 20 or 30 sites I review have done it well. Most get it wrong from the start.

Many websites rush straight into link building, generating hundreds or even thousands of low-quality backlinks with exact-match anchor texts, before laying any SEO groundwork.

They skip essential keyword research, overlook keyword mapping, and fail to optimize their key pages first.

I’ve seen this over and over: chasing advanced or shiny tactics while ignoring the basics that actually work.

When your technical SEO foundation is strong, focusing on on-page SEO can often deliver significant results.

There are thousands of articles about basic on-page SEO: optimizing titles, headers, and content around targeted keywords.

Yet, almost nobody implements all of these basics correctly. Instead of chasing trendy or complex tactics, you should focus first on the essentials:

  • Do proper keyword research to identify terms your audience actually searches.
  • Map these keywords clearly to specific pages.
  • Optimize each page’s title tags, meta descriptions, headers, images, internal links, and content accordingly.

These straightforward steps are often enough to achieve SEO success, yet many overlook them while searching for complicated shortcuts.

Read more: Google E-E-A-T: What Is It & How To Demonstrate It For SEO

Lesson #9: Internal Linking Is An Underused But Powerful SEO Opportunity

  • Lesson: Internal links hold more power than overhyped external backlinks and can significantly clarify your site’s structure for Google.

Internal links are way more powerful than most website owners realize.

Everyone talks about backlinks from external sites, but internal linking – when done correctly – can actually make a huge impact.

Unless your website is brand new, improving your internal linking can give your SEO a serious lift by helping Google clearly understand the topic and context of your site and its specific pages.

Still, many websites don’t use internal links effectively. They rely heavily on generic anchor texts like “Read more” or “Learn more,” which tell search engines absolutely nothing about the linked page’s content.

Low-value internal linksImage from author, April 2025

Website owners often approach me convinced they need a deep technical audit.

Yet, when I take a closer look, their real issue frequently turns out to be poor internal linking or unclear website structure, both making it harder for Google to understand the site’s content and value.

Internal linking can also give a boost to underperforming pages.

For example, if you have a page with strong external backlinks, linking internally from that high-authority page to weaker ones can pass authority and help those pages rank better.

Investing a little extra time in improving your internal links is always worth it. They’re one of the easiest yet most powerful SEO tools you have.

Read more: Internal Link Structure Best Practices to Boost Your SEO

Lesson #10: Backlinks Are Just One SEO Lever, Not The Only One

  • Lesson: You should never blindly chase backlinks to fix your SEO. Build them strategically only after mastering the basics.

SEO audits often show websites placing too much emphasis on backlinks while neglecting many other critical SEO opportunities.

Blindly building backlinks without first covering SEO fundamentals – like removing technical SEO blockages, doing thorough keyword research, and mapping clear keywords to every page – is a common and costly mistake.

Even after getting those basics right, link building should never be random or reactive.

Too often, I see sites start building backlinks simply because their SEO isn’t progressing, hoping more links will magically help. This rarely works.

Instead, you should always approach link building strategically, by first carefully analyzing your direct SERP competitors to determine if backlinks are genuinely your missing element:

  • Look closely at the pages outranking you.
  • Identify whether their advantage truly comes from backlinks or better on-page optimization, content quality, or internal linking.
Backlink analysisThe decision on whether or not to build backlinks should be based on whether direct competitors have more and better backlinks. (Image from author, April 2025)

Only after ensuring your on-page SEO and internal links are strong and confirming that backlinks are indeed the differentiating factor, should you invest in targeted link building.

Typically, you don’t need hundreds of low-quality backlinks. Often, just a few strategic editorial links or well-crafted SEO press releases can close the gap and improve your rankings.

Read more: How To Get Quality Backlinks: 11 Ways That Really Work

Lesson #11: SEO Tools Alone Can’t Replace Manual SEO Checks

  • Lesson: You should never trust SEO tools blindly. Always cross-check their findings manually using your own judgment and common sense.

SEO tools make our work faster, easier, and more efficient, but they still can’t fully replicate human analysis or insight.

Tools lack the ability to understand context and strategy in the way that SEO professionals do. They usually can’t “connect the dots” or assess the real significance of certain findings.

This is exactly why every recommendation provided by a tool needs manual verification. You should always evaluate the severity and real-world impact of the issue yourself.

Often, website owners come to me alarmed by “fatal” errors flagged by their SEO tools.

Yet, when I manually inspect these issues, most turn out to be minor or irrelevant.

Meanwhile, fundamental aspects of SEO, such as strategic keyword targeting or on-page optimization, are completely missing since no tool can fully capture these nuances.

Screaming Frog SEO Spider flagging SEO issuesScreaming Frog SEO Spider says there are rich result validation errors, but when I check that manually, there are no errors. (Screenshot from Screaming Frog, April 2025)

SEO tools are still incredibly useful because they handle large-scale checks that humans can’t easily perform, like analyzing millions of URLs at once.

However, you should always interpret their findings carefully and manually verify the importance and actual impact before taking any action.

Final Thoughts

After auditing hundreds of websites, the biggest pattern I notice isn’t complex technical SEO issues, though they do matter.

Instead, the most frequent and significant problem is simply a lack of a clear, prioritized SEO strategy.

Too often, SEO is done without a solid foundation or clear direction, which makes all other efforts less effective.

Another common issue is undiagnosed technical problems lingering from old site migrations or updates. These hidden problems can quietly hurt rankings for years if left unresolved.

The lessons above cover the majority of challenges I encounter daily, but remember: Each website is unique. There’s no one-size-fits-all checklist.

Every audit must be personalized and consider the site’s specific context, audience, goals, and limitations.

SEO tools and AI are increasingly helpful, but they’re still just tools. Ultimately, your own human judgment, experience, and common sense remain the most critical factors in effective SEO.

More Resources:


Featured Image: inspiring.team/Shutterstock