Google: Many Top Sites Have Invalid HTML And Still Rank via @sejournal, @MattGSouthern

A recent discussion on Google’s Search Off the Record podcast challenges long-held assumptions about technical SEO, revealing that most top-ranking websites don’t use valid HTML.

Despite these imperfections, they continue to rank well in search results.

Search Advocate John Mueller and Developer Relations Engineer Martin Splitt referenced a study by former Google webmaster Jens Meiert, which found that only one homepage among the top 200 websites passed HTML validation tests.

Mueller highlighted:

β€œ0.5% of the top 200 websites have valid HTML on their homepage. One site had valid HTML. That’s it.”

He described the result as β€œcrazy,” noting that the study surprised even developers who take pride in clean code.

Mueller added:

β€œSearch engines have to deal with whatever broken HTML is out there. It doesn’t have to be perfect, it’ll still work.”

When HTML Errors Matter

While most HTML issues are tolerated, certain technical elements, such as metadata, must be correctly implemented.

Splitt said:

β€œIf something is written in a way that isn’t HTML compliant, then the browser will make assumptions.”

That usually works fine for visible content, but can fail β€œcatastrophically” when it comes to elements that search engines rely on.

Mueller said:

β€œIf [metadata] breaks, then it’s probably not going to do anything in your favor.”

SEO Is Not A Technical Checklist

Google also challenged the notion that SEO is a box-ticking exercise for developers.

Mueller said:

β€œSometimes SEO is also not so much about purely technical things that you do, but also kind of a mindset.”

Splitt said:

β€œAm I using the terminology that my potential customers would use? And do I have the answers to the things that they will ask?”

Naming things appropriately, he said, is one of the most overlooked SEO skills and often more important than technical precision.

Core Web Vitals and JavaScript

Two recurring sources of confusion, Core Web Vitals and JavaScript, were also addressed.

Core Web Vitals

The podcast hosts reiterated that good Core Web Vitals scores don’t guarantee better rankings.

Mueller said:

β€œCore Web Vitals is not the solution to everything.”

Mueller added:

β€œDevelopers love scores… it feels like β€˜oh I should like maybe go from 85 to 87 and then I will rank first,’ but there’s a lot more involved.”

JavaScript

On the topic of JavaScript, Splitt said that while Google can process it, implementation still matters.

Splitt said:

β€œIf the content that you care about is showing up in the rendered HTML, you’ll be fine generally speaking.”

Splitt added:

β€œUse JavaScript responsibly and don’t use it for everything.”

Misuse can still create problems for indexing and rendering, especially if assumptions are made without testing.

What This Means

The key takeaway from the podcast is that technical perfection isn’t 100% necessary for SEO success.

While critical elements like metadata must function correctly, the vast majority of HTML validation errors won’t prevent ranking.

As a result, developers and marketers should be cautious about overinvesting in code validation at the expense of content quality and search intent alignment.

Listen to the full podcast episode below:

Google’s β€˜srsltid’ Parameter Appears In Organic URLs, Creating Confusion via @sejournal, @MattGSouthern

Google’s srsltid parameter, originally meant for product tracking, is now showing on blog pages and homepages, creating confusion among SEO pros.

Per a recent Reddit thread, people are seeing the parameter attached not just to product pages, but also to blog posts, category listings, and homepages.

Google Search Advocate John Mueller responded saying, β€œit doesn’t cause any problems for search.”  However, it may still raise more questions than it answers.

Here’s what you need to know.

What Is the srsltid Parameter Supposed to Do?

The srsltid parameter is part of Merchant Center auto-tagging. It’s designed to help merchants track conversions from organic listings connected to their product feeds.

When enabled, the parameter is appended to URLs shown in search results, allowing for better attribution of downstream behavior.

A post on Google’s Search Central community forum clarifies that these URLs aren’t indexed.

As Product Expert Barry Hunter (not affiliated with Google) explained:

β€œThe URLs with srsltid are NOT really indexed. The param is added dynamically at runtime. That’s why they don’t show as indexed in Search Console… but they may appear in search results.”

While it’s true the URLs aren’t indexed, they’re showing up in indexed pages reported by third-party tools.

Why SEO Pros Are Confused

Despite Google’s assurances, the real-world impact of srsltid is causing confusion for these reasons:

  • Inflated URL counts: Tools often treat URLs with unique parameters as separate pages. This inflates site page counts and can obscure crawl reports or site audits.
  • Data fragmentation: Without filtering, analytics platforms like GA4 split traffic between canonical and parameterized URLs, making it harder to measure performance accurately.
  • Loss of visibility in Search Console: As documented in a study by Oncrawl, sites saw clicks and impressions for srsltid URLs drop to zero around September, even though those pages still appeared in search results.
  • Unexpected reach: The parameter is appearing on pages beyond product listings, including static pages, blogs, and category hubs.

Oncrawl’s analysis also found that Googlebot crawled 0.14% of pages with the srsltid parameter, suggesting minimal crawling impact.

Can Anything Be Done?

Google hasn’t indicated any rollback or revision to how srsltid works in organic results. But you do have a few options depending on how you’re affected.

Option 1: Disable Auto-Tagging

You can turn off Merchant Center auto-tagging by navigating to Tools and settings > Conversion settings > Automatic tagging. Switching to UTM parameters can provide greater control over traffic attribution.

Option 2: Keep Auto-Tagging, Filter Accordingly

If you need to keep auto-tagging active:

  • Ensure all affected pages have correct canonical tags.
  • Configure caching systems to ignore srsltid as a cache key.
  • Update your analytics filters to exclude or consolidate srsltid traffic.

Blocking the parameter in robots.txt won’t prevent the URLs from appearing in search results, as they’re added dynamically and not crawled directly.

What This Means

The srsltid parameter may not affect rankings, but its indirect impact on analytics and reporting is being felt.

When performance reporting shifts without explanation, SEO pros need to provide answers. Understanding how srsltid functions work, and how it doesn’t, helps mitigate confusion.

Staying informed, filtering correctly, and communicating with stakeholders are the best options for navigating this issue.


Featured Image: Roman Samborskyi/Shutterstock

The Smart SEO Team’s Guide To Timing & Executing A Large-Scale Site Migration via @sejournal, @inmotionhosting

This post was sponsored by InMotion Hosting. The opinions expressed in this article are the sponsor’s own.

We’ve all felt it, that sinking feeling in your stomach when your site starts crawling instead of sprinting.

Page speed reports start flashing red. Search Console is flooding your inbox with errors.

You know it’s time for better hosting, but here’s the thing: moving a large website without tanking your SEO is like trying to change tires while your car is still moving.

We’ve seen too many migrations go sideways, which is why we put together this guide.

Let’s walk through a migration plan that works. One that’ll future-proof your site without disrupting your rankings or overburdening your team.

Free Website Migration Checklist

Step 1: Set Your Performance Goals & Audit Your Environment

Establish Performance Benchmarks

Before you touch a single line of code, you need benchmarks. Think of these as your β€œbefore” pictures in a website makeover.

If you skip this step, you’ll regret it later. How will you know if your migration was successful if you don’t know where you started?

Gather your current page speed numbers, uptime percentages, and server response times. These will serve as proof that the migration was worth it.

Document Current Site Architecture

Next, let’s identify what’s working for your site and what’s holding it back. Keep a detailed record of your current setup, including your content management system (CMS), plugins, traffic patterns, and peak periods.

Large sites often have unusual, hidden connections that only reveal themselves at the worst possible moments during migrations. Trust us, documenting this now prevents those 2 AM panic attacks later.

Define Your Website Migration Goals

Let’s get specific about what success looks like. Saying β€œwe want the site to be faster” is like saying β€œwe want more leads.” It sounds great, but how do you measure it?

Aim for concrete targets, such as:

  • Load times under 2 seconds on key pages (we like to focus on product pages first).
  • 99.99% uptime guarantees (because every minute of downtime is money down the drain).
  • Server response times under 200ms.
  • 30% better crawl efficiency (so Google sees your content updates).

We recommend running tests with Google Lighthouse and GTmetrix at different times of day. You’d be surprised how performance can vary between your morning coffee and afternoon slump.

Your top money-making pages deserve special attention during migration, so keep tabs on those.

Step 2: Choose The Right Hosting Fit

Not all hosting options can handle the big leagues.

We’ve seen too many migrations fail because someone picked a hosting plan better suited for a personal blog than an enterprise website.

Match Your Needs To Solutions

Let’s break down what we’ve found works best.

Managed VPS is excellent for medium-sized sites. If you’re receiving 100,000 to 500,000 monthly visitors, this might be your sweet spot. You’ll have the control you need without the overkill.

Dedicated servers are what we recommend for the major players. If you’re handling millions of visitors or running complex applications, this is for you.

What we appreciate about dedicated resources is that they eliminate the β€œnoisy neighbor” problem, where someone else’s traffic spike can tank your performance. Enterprise sites on dedicated servers load 40-60% faster and rarely experience those resource-related outages.

WordPress-optimized hosting is ideal if you’re running WordPress. These environments come pre-tuned with built-in caching and auto-updates. Why reinvent the wheel, right?

Understand The Must-Have Features Checklist

Let’s talk about what your web hosting will need for SEO success.

Free Website Migration Checklist

NVMe SSDs are non-negotiable these days. They’re about six times faster than regular storage for database work, and you’ll feel the difference immediately.

A good CDN is essential if you want visitors from different regions to have the same snappy experience. Server-level caching makes a huge difference, as it reduces processing work and speeds up repeat visits and search crawls.

Illustration showing how caching works on a websiteImage created by InMotion Hosting, June 2025

Staging environments aren’t optional for big migrations. They’re your safety net. Keep in mind that emergency fixes can cost significantly more than setting up staging beforehand.

And please ensure you have 24/7 migration support from actual humans. Not chatbots, real engineers who answer the phone when things go sideways at midnight.

Key Considerations for Growth

Think about where your site is headed, not just where it is now.

Are you launching in new markets? Planning a big PR push? Your hosting should handle growth without making you migrate again six months later.

One thing that often gets overlooked: redirect limits. Many platforms cap at 50,000-100,000 redirects, which sounds like a lot until you’re migrating a massive product catalog.

Step 3: Prep for Migration – The Critical Steps

Preparation separates smooth migrations from disasters. This phase makes or breaks your project.

Build Your Backup Strategy

First things first: backups, backups, backups. We’re talking complete copies of both files and databases.

Don’t dump everything into one giant folder labeled β€œSite Stuff.” Organizing backups by date and type. Include the entire file system, database exports, configuration files, SSL certificates, and everything else.

Here’s a common mistake we often see: not testing the restore process before migration day. A backup you can’t restore is wasted server space. Always conduct a test restore on a separate server to ensure everything works as expected.

Set Up the New Environment and Test in Staging

Your new hosting environment should closely mirror your production environment. Match PHP versions, database settings, security rules, everything. This isn’t the time to upgrade seven different things at once (we’ve seen that mistake before).

Run thorough pre-launch tests on staging. Check site speed on different page types. Pull out your phone and verify that the mobile display works.

Use Google’s testing tools to confirm that your structured data remains intact. The goal is no surprises on launch day.

Map Out DNS Cutover and Minimize TTL for a Quick Switch

DNS strategy might sound boring, but it can make or break your downtime window.

Here’s what works: reduce your TTL to at least 300 seconds (5 minutes) about 48 hours before migration. This makes DNS changes propagate quickly when you flip the switch.

Have all your DNS records prepared in advance: A records, CNAMEs for subdomains, MX records for email, and TXT records for verification. Keep a checklist and highlight the mission-critical ones that would cause panic if forgotten.

Freeze Non-Essential Site Updates Before Migration

This might be controversial, but we’re advocates for freezing all content and development changes for at least 48 hours before migration.

The last thing you need is someone publishing a new blog post right as you’re moving servers.

You can use this freeze time for team education. It’s a perfect moment to run workshops on technical SEO or explain how site speed affects rankings. Turn downtime into learning time.

Step 4: Go-Live Without the Guesswork

Migration day! This is where all your planning pays off, or where you realize what you forgot.

Launch Timing Is Everything

Choose your timing carefully. You should aim for when traffic is typically lowest.

For global sites, consider the β€œfollow-the-sun” approach. This means migrating region by region during their lowest traffic hours. While it takes longer, it dramatically reduces risk.

Coordinate Your Teams

Clear communication is everything. Everyone should know exactly what they’re doing and when.

Define clear go/no-go decision points. Who makes the call if something looks off? What’s the threshold for rolling back vs. pushing through?

Having these conversations before you’re in the middle of a migration saves a ton of stress.

Live Performance Monitoring

Once you flip the switch, monitoring becomes your best friend. Here are the key items to monitor:

  • Watch site speed across different page types and locations.
  • Set up email alerts for crawl errors in Search Console.
  • Monitor 404 error rates and redirect performance.

Sudden spikes in 404 errors or drops in speed need immediate attention. They’re usually signs that something didn’t migrate correctly.

The faster you catch these issues, the less impact they’ll have on your rankings.

Post-Migration Validation

After launch, run through a systematic checklist:

  • Test redirect chains (we recommend Screaming Frog for this).
  • Make sure internal links work.
  • Verify your analytics tracking (you’d be surprised how often this breaks).
  • Check conversion tracking.
  • Validate SSL certificates.
  • Watch server logs for crawl issues.

One step people often forget: resubmitting your sitemap in Search Console as soon as possible. This helps Google discover your new setup faster.

Even with a perfect migration, most large sites take 3-6 months for complete re-indexing, so patience is key.

Step 5: Optimize, Tune, and Report: How To Increase Wins

The migration itself is just the beginning. Post-migration tuning is where the magic happens.

Fine-Tune Your Configuration

Now that you’re observing real traffic patterns, you can optimize your setup.

Start by enhancing caching rules based on actual user behavior. Adjust compression settings, and optimize those database queries that seemed fine during testing but are sluggish in production.

Handling redirects at the server level, rather than through plugins or CMS settings, is faster and reduces server load.

Automate Performance Monitoring

Set up alerts for issues before they become problems. We recommend monitoring:

  • Page speed drops by over 10%.
  • Uptime drops.
  • Changes in crawl rates.
  • Spikes in server resource usage.
  • Organic traffic drops by over 20%.

Automation saves you from constantly checking dashboards, allowing you to focus on improvements instead of firefighting.

Analyze for SEO Efficiency

Server logs tell you a lot about how well your migration went from an SEO perspective. Look for fewer crawl errors, faster Googlebot response times, and better crawl budget usage.

Improvements in crawl efficiency mean Google can discover and index your new content much faster.

Measure and Report Success

Compare your post-migration performance to those baseline metrics you wisely collected.

When showing results to executives, connect each improvement to business outcomes. For example:

  • β€œFaster pages reduced our bounce rate by 15%, which means more people are staying on the site.”
  • β€œBetter uptime means we’re not losing sales during peak hours.”
  • β€œImproved crawl efficiency means our new products get indexed faster.”

Pro tip: Build easy-to-read dashboards that executives can access at any time. This helps build confidence and alleviate concerns.

Ready to Execute Your High-Performance Migration?

You don’t need more proof that hosting matters. Every slow page load and server hiccup already demonstrates it. What you need is a plan that safeguards your SEO investment while achieving tangible improvements.

This guide provides you with that playbook. You now know how to benchmark, choose the right solutions, and optimize for success.

This approach can be applied to sites of all sizes, ranging from emerging e-commerce stores to large enterprise platforms. The key lies in preparation and partnering with the right support team.

If you’re ready to take action, consider collaborating with a hosting provider that understands the complexities of large-scale migrations. Look for a team that manages substantial redirect volumes and builds infrastructure specifically for high-traffic websites. Your future rankings will thank you!

Image Credits

Featured Image: Image by InMotion Hosting. Used with permission.

In-Post Image: Images by InMotion Hosting. Used with permission.

How Much Code Should SEO Pros Know? Google Weighs In via @sejournal, @MattGSouthern

Google’s Martin Splitt and Gary Illyes recently addressed a common question in search marketing: how technical do SEO professionals need to be?

In a Search Off the Record podcast, they offered guidance on which technical skills are helpful in SEO and discussed the long-standing friction between developers and SEO professionals.

Splitt noted:

β€œI think in order to optimize a system or work with a system so deeply like SEOs do, you have to understand some of the characteristics of the system.”

However, he clarified that strong coding skills aren’t a requirement for doing effective SEO work.

The Developer-SEO Divide

Splitt, who regularly speaks at both developer and SEO events, acknowledged that the relationship between these groups can sometimes be difficult.

Splitt says:

β€œEven if you go to a developer event and talk about SEO, it is a strained relationship you’re entering.”

He added that developers often approach SEO conversations with skepticism, even when they come from someone with a developer background.

This disconnect can cause real-world problems.

Illyes shared an example of a large agency that added a calendar plugin across multiple websites, unintentionally generating β€œ100 million URLs.” Google began crawling all of them, creating a major crawl budget issue.

What SEO Pros Need To Know

Rather than recommending that SEO professionals learn to code, Splitt advises understanding how web technologies function.

Splitt states:

β€œYou should understand what is a header, how does HTTPS conceptually work, what’s the certificate, how does that influence how the connection works.”

He also advised being familiar with the differences between web protocols, such as HTTP/2 and HTTP/1.1.

While SEO pros don’t need to write in programming languages like C, C++, or JavaScript, having some awareness of how JavaScript affects page rendering can be helpful.

Context Matters: Not All SEOs Need The Same Skills

Google also pointed out that SEO is a broad discipline, and the amount of technical knowledge needed can vary depending on your focus.

Splitt gave the example of international SEO. He initially said these specialists might not need technical expertise, but later clarified that internationalization often includes technical components too.

β€œSEO is such a broad field. There are people who are amazing at taking content international… they specialize on a much higher layer as in like the content and the structure and language and localization in different markets.”

Still, he emphasized that people working in more technical roles, or in generalist positions, should aim to understand development concepts.

What This Means

Here’s what the discussion means for SEO professionals:

  • Technical understanding matters, but being able to code is not always necessary. Knowing HTTP protocols, HTML basics, and how JavaScript interacts with pages can go a long way.
  • Your role defines your needs. If you’re working on content strategy or localization, deep technical knowledge might not be essential. But if you’re handling site migrations or audits, that knowledge becomes more critical.
  • Context should guide your decisions. Applying advice without understanding the β€œwhy” can lead to problems. SEO isn’t one-size-fits-all.
  • Cross-team collaboration is vital. Google’s comments suggest there’s still a divide between developers and SEO teams. Improving communication between these groups could prevent technical missteps that affect rankings.

Looking Ahead

As websites become more complex and JavaScript frameworks continue to grow, technical literacy will likely become more important.

Google’s message is clear: SEOs don’t need to become developers, but having a working understanding of how websites function can make you far more effective.

For companies, closing the communication gap between development and marketing remains a key area of opportunity.

Listen to the full podcast episode below:


Featured Image: Roman Samborskyi/Shutterstock

See What AI Sees: AI Mode Killed the Old SEO Playbook β€” Here’s the New One via @sejournal, @mktbrew

This post was sponsored by MarketBrew. The opinions expressed in this article are the sponsor’s own.

Is Google using AI to censor thousands of independent websites?

Wondering why your traffic has suddenly dropped, even though you’re doing SEO properly?

Between letters to the FTC describing a systematic dismantling of the open web by Google to SEO professionals who may be unaware that their strategies no longer make an impact, these changes represent a definite re-architecting of the web’s entire incentive structure.

It’s time to adapt.

While some were warning about AI passage retrieval and vector scoring, the industry largely stuck to legacy thinking. SEOs continued to focus on E-E-A-T, backlinks, and content refresh cycles, assuming that if they simply improved quality, recovery would come.

But the rules had changed.

Google’s Silent Pivot: From Keywords to Embedding Vectors

In late 2023 and early 2024, Google began rolling out what it now refers to as AI Mode.

What Is Google’s AI Mode?

AI Mode breaks content into passages, embeds those passages into a multi-dimensional vector space, and compares them directly to queries using cosine similarity.

In this new model, relevance is determined geometrically rather than lexically. Instead of ranking entire pages, Google evaluates individual passages. The most relevant passages are then surfaced in a ChatGPT-like interface, often without any need for users to click through to the source.

Beneath this visible change is a deeper shift: content scoring has become embedding-first.

What Are Embedding Vectors?

Embedding vectors are mathematical representations of meaning. When Google processes a passage of content, it converts that passage into a vector, a list of numbers that captures the semantic context of the text. These vectors exist in a multi-dimensional space where the distance between vectors reflects how similar the meanings are.

Instead of relying on exact keywords or matching phrases, Google compares the embedding vector of a search query to the embedding vectors of individual passages. This allows it to identify relevance based on deeper context, implied meaning, and overall intent.

Traditional SEO practices like keyword targeting and topical coverage do not carry the same weight in this system. A passage does not need to use specific words to be considered relevant. What matters is whether its vector lands close to the query vector in this semantic space.

How Are Embedding Vectors Different From Keywords?

Keywords focus on exact matches. Embedding vectors focus on meaning.

Traditional SEO relied on placing target terms throughout a page. But Google’s AI Mode now compares the semantic meaning of a query and a passage using embedding vectors. A passage can rank well even if it doesn’t use the same words, as long as its meaning aligns closely with the query.

This shift has made many SEO strategies outdated. Pages may be well-written and keyword-rich, yet still underperform if their embedded meaning doesn’t match search intent.

What SEO Got Wrong & What Comes Next

The story isn’t just about Google changing the game, it’s also about how the SEO industry failed to notice the rules had already shifted.

Don’t: Misread the Signals

As rankings dropped, many teams assumed they’d been hit by a quality update or core algorithm tweak. They doubled down on familiar tactics: improving E-E-A-T signals, updating titles, and refreshing content. They pruned thin pages, boosted internal links, and ran audits.

But these efforts were based on outdated models. They treated the symptom, visibility loss, not the cause: semantic drift.

Semantic drift happens when your content’s vector no longer aligns with the evolving vector of search intent. It’s invisible to traditional SEO tools because it occurs in latent space, not your HTML.

No amount of backlinks or content tweaks can fix that.

This wasn’t just platform abuse. It was also a strategic oversight.

SEO teams:

Many believed that doing what Google said, improving helpfulness, pruning content, and writing for humans, would be enough.

That promise collapsed under AI scrutiny.

But we’re not powerless.

Don’t: Fall Into The Trap of Compliance

Google told the industry to β€œfocus on helpful content,” and SEOs listened, through a lexical lens. They optimized for tone, readability, and FAQs.

But β€œhelpfulness” was being determined mathematically by whether your vectors aligned with the AI’s interpretation of the query.

Thousands of reworked sites still dropped in visibility. Why? Because while polishing copy, they never asked: Does this content geometrically align with search intent?

Do: Optimize For Data, Not Keywords

The new SEO playbook begins with a simple truth: you are optimizing for math, not words.

The New SEO Playbook: How To Optimize For AI-Powered SERPs

Here’s what we now know:

  1. AI Mode is real and measurable.
    βœ…You can calculate embedding similarity.
    βœ…You can test passages against queries.
    βœ…You can visualize how Google ranks.
  2. Content must align semantically, not just topically.
    βœ…Two pages about β€œbest hiking trails” may be lexically similar, but if one focuses on family hikes and the other on extreme terrain, their vectors diverge.
  3. Authority still matters, but only after similarity.
    βœ…The AI Mode fan-out selects relevant passages first. Authority reranking comes later.
    βœ…If you don’t pass the similarity threshold, your authority won’t matter.
  4. Passage-level optimization is the new frontier.
    βœ…Optimizing entire pages isn’t enough. Each chunk of content must pull semantic weight.

How Do I Track Google AI Mode Data To Improve SERP Visibility?

It depends on your goals; for success in SERPs, you need to focus on tools that not only show you visibility data, but also how to get there.

Profound was one of the first tools to measure whether content appeared inside large language models, essentially offering a visibility check for LLM inclusion. It gave SEOs early signals that AI systems were beginning to treat search results differently, sometimes surfacing pages that never ranked traditionally. Profound made it clear: LLMs were not relying on the same scoring systems that SEOs had spent decades trying to influence.

But Profound stopped short of offering explanations. It told you if your content was chosen, but not why. It didn’t simulate the algorithmic behavior of AI Mode or reveal what changes would lead to better inclusion.

That’s where simulation-based platforms came in.

Market Brew approached the challenge differently. Instead of auditing what was visible inside an AI system, they reconstructed the inner logic of those systems, building search engine models that mirrored Google’s evolution toward embeddings and vector-based scoring. These platforms didn’t just observe the effects of AI Mode, they recreated its mechanisms.

As early as 2023, Market Brew had already implemented:

  • Passage segmentation that divides page content into consistent ~700-character blocks.
  • Embedding generation using Sentence-BERT to capture the semantic fingerprint of each passage.
  • Cosine similarity calculations to simulate how queries match specific blocks of content, not just the page as a whole.
  • Thematic clustering algorithms, like Top Cluster Similarity, to determine which groupings of passages best aligned with a search intent.

πŸ” Market Brew Tutorial: Mastering the Top Cluster Similarity Ranking Factor | First Principles SEO

This meant users could test a set of prompts against their content and watch the algorithm think, block by block, similarity score by score.

Where Profound offered visibility, Market Brew offered agency.

Instead of asking β€œDid I show up in an AI overview?”, simulation tools helped SEOs ask, β€œWhy didn’t I?” and more importantly, β€œWhat can I change to improve my chances?”

By visualizing AI Mode behavior before Google ever acknowledged it publicly, these platforms gave early adopters a critical edge. The SEOs using them didn’t wait for traffic to drop before acting, they were already optimizing for vector alignment and semantic coverage long before most of the industry knew it mattered.

And in an era where rankings hinge on how well your embeddings match a user’s intent, that head start has made all the difference.

Visualize AI Mode Coverage. For Free.

SEO didn’t die. It transformed, from art into applied geometry.

AI Mode Visualizer Tutorial

To help SEOs adapt to this AI-driven landscape, Market Brew has just announced the AI Mode Visualizer, a free tool that simulates how Google’s AI Overviews evaluate your content:

  • Enter a page URL.
  • Input up to 10 search prompts or generate them automatically from a single master query using LLM-style prompt expansion.
  • See a cosine similarity matrix showing how each content chunk (700 characters) for your page aligns with each intent.
  • Click any score to view exactly which passage matched, and why.

πŸ”— Try the AI Mode Visualizer

This is the only tool that lets you watch AI Mode think.

Two Truths, One Future

Nate Hake is right: Google restructured the game. The data reflects an industry still catching up to the new playbook.

Because two things can be true:

  • Google may be clearing space for its own services, ad products, and AI monopolies.
  • And many SEOs are still chasing ghosts in a world governed by geometry.

It’s time to move beyond guesses.

If AI Mode is the new architecture of search, we need tools that expose how it works, not just theories about what changed.

We were bringing you this story back in early 2024, before AI Overviews had a name, explaining how embeddings and vector scoring would reshape SEO.

Tools like the AI Mode Visualizer offer a rare chance to see behind the curtain.

Use it. Test your assumptions. Map the space between your content and modern relevance.

Search didn’t end.

But the way forward demands new eyes.

________________________________________________________________________________________________

Image Credits

Featured Image: Image by MarketBrew. Used with permission.

Google Removes Robots.txt Guidance For Blocking Auto-Translated Pages via @sejournal, @MattGSouthern

Google removes robots.txt guidance for blocking auto-translated pages. This change aligns Google’s technical documents with its spam policies.

  • Google removed guidance advising websites to block auto-translated pages via robots.txt.
  • This aligns with Google’s policies that judge content by user value, not creation method.
  • Use meta tags like “noindex” for low-quality translations instead of sitewide exclusions.
Google Launches Loyalty Program Structured Data Support via @sejournal, @MattGSouthern

Google now supports structured data that allows businesses to show loyalty program benefits in search results.

Businesses can use two new types of structured data. One type defines the loyalty program itself, while the other illustrates the benefits members receive for specific products.

Here’s what you need to know.

Loyalty Structured Data

When businesses use this new structured data for loyalty programs, their products can display member benefits directly in Google. This allows shoppers to view the perks before clicking on any listings.

Google recognizes four specific types of loyalty benefits that can be displayed:

  • Loyalty Points: Points earned per purchase
  • Member-Only Prices: Exclusive pricing for members
  • Special Returns: Perks like free returns
  • Special Shipping: Benefits like free or expedited shipping

This is a new way to make products more visible. It may also result in higher clicks from search results.

The announcement states:

β€œβ€¦ member benefits, such as lower prices and earning loyalty points, are a major factor considered by shoppers when buying products online.”

Details & Requirements

The new feature needs two steps.

  1. First, add loyalty program info to your β€˜Organization’ structured data.
  2. Then, add loyalty benefits to your β€˜Product’ structured data.
  3. Bonus step: Check if your markup works using the Rich Results Test tool.

With valid markup in place, Google will be aware of your loyalty program and the perks associated with each product.

Important implementation note: Google recommends placing all loyalty program information on a single dedicated page rather than spreading it across multiple pages. This helps ensure proper crawling and indexing.

Multi-Tier Programs Now Supported

Businesses can define multiple membership tiers within a single loyalty programβ€”think bronze, silver, and gold levels. Each tier can have different requirements for joining, such as:

  • Credit card signup requirements
  • Minimum spending thresholds (e.g., $250 annual spend)
  • Periodic membership fees

This flexibility allows businesses to create sophisticated loyalty structures that match their existing programs.

Merchant Center Takes Priority

Google Shopping software engineers Irina Tuduce and Pascal Fleury say this feature is:

β€œβ€¦ especially important if you don’t have a Merchant Center account and want the ability to provide a loyalty program for your business.”

It’s worth reiterating: If your business already uses Google Merchant Center, keep using that for loyalty programs.

In fact, if you implement both structured data markup and Merchant Center loyalty programs, Google will prioritize the Merchant Center settings. This override ensures there’s no confusion about which data source takes precedence.

Looking Ahead

The update seems aimed at helping smaller businesses compete with larger retailers, which often have complex Merchant Center setups.

Now, smaller sites can share similar information using structured data, including sophisticated multi-tier programs that were previously difficult to implement without Merchant Center.

Small and medium e-commerce sites without Merchant Center accounts should strongly consider adopting this markup.

For more details, see Google’s new help page.

How To Host Or Migrate A Website In 2025: Factors That May Break Rankings [+ Checklist] via @sejournal, @inmotionhosting

This post was sponsored by InMotion Hosting. The opinions expressed in this article are the sponsor’s own.

Is your website struggling to maintain visibility in search results despite your SEO efforts?

Are your Core Web Vitals scores inconsistent, no matter how many optimizations you implement?

Have you noticed competitors outranking you even when your content seems superior?

In 2025, hosting isn’t just a backend choice. It’s a ranking signal.

In this guide, you’ll learn how hosting decisions impact your ability to rank, and how to choose (or migrate to) hosting that helps your visibility.

Learn to work with your rankings, not against them, with insights from InMotion Hosting’s enterprise SEO specialists.

Jump Straight To Your Needs

Best For Hosting Type How Easy is Migration?
Growing SMBs VPS Easy: Launch Assist (free)
Enterprise / SaaS Dedicated Very Easy: White-Glove + Managed Service

Don’t know which one you need? Read on.

Hosting Directly Impacts SEO Performance

Your hosting environment is the foundation of your SEO efforts. Poor hosting can undermine even the best content and keyword strategies.

Key Areas That Hosting Impacts

Core Web Vitals

Server response time directly affects Largest Contentful Paint (LCP) and First Input Delay (FID), two critical ranking factors.

Solution: Hosting with NVMe storage and sufficient RAM improves these metrics.

Crawl Budget

Your website’s visibility to search engines can be affected by limited server resources, wrong settings, and firewalls that restrict access.

When search engines encounter these issues, they index fewer pages and visit your site less often.

Solution: Upgrade to a hosting provider that’s built for SEO performance and consistent uptime.

Indexation Success

Proper .htaccess rulesΒ for redirects, error handling, and DNS configurations are essential for search engines to index your content effectively.

Many hosting providers limit your ability to change this important file, restricting you from:

– Editing your .htaccess file.

– Installing certain SEO or security plugins.

– Adjusting server settings.

These restrictions can hurt your site’s ability to be indexed and affect your overall SEO performance.

Solution: VPS and dedicated hosting solutions give you full access to these settings.

SERP Stability During Traffic Spikes

If your content goes viral or experiences a temporary surge in traffic, poor hosting can cause your site to crash or slow down significantly. This can lead to drops in your rankings if not addressed right away.

Solution: Using advanced caching mechanisms can help prevent these problems.

Server Security

Google warns users about sites with security issues in Search Console. Warnings like β€œSocial Engineering Detected” can erode user trust and hurt your rankings.

Solution: Web Application Firewalls offer important protection against security threats.

Server Location

The location of your server affects how fast your site loads for different users, which can influence your rankings.

Solution: Find a web host that operates data centers in multiple server locations, such as two in the United States, one in Amsterdam, and, soon, one in Singapore. This helps reduce loading times for users worldwide.

Load Times

Faster-loading pages lead to lower bounce rates, which can improve your SEO. [Server-side optimizations], such as caching and compression, are vital for achieving fast load times.

These factors have always been important, but they are even more critical now that AI plays a role in search engine results.

40 Times Faster Page Speeds with Top Scoring Core Web Vitals with InMotion Hosting UltraStack One. (Source: InMotion Hosting UltraStack One for WordPress )Image created by InMotion Hosting, 2025.

2025 Update: Search Engines Are Prioritizing Hosting & Technical Performance More Than Ever

In 2025, search engines have fully embraced AI-driven results, and with this shift has come an increased emphasis on technical performance signals that only proper hosting can deliver.

How 2025 AI Overview SERPs Affect Your Website’s Technical SEO

Google is doubling down on performance signals. Its systems now place even greater weight on:

  • Uptime: Sites with frequent server errors due to outages experience more ranking fluctuations than in previous years. 99.99% uptime guaranteesΒ are now essential.
  • Server-Side Rendering: As JavaScript frameworks become more prevalent, servers that efficiently handle rendering deliver a better user experience and improved Core Web Vitals scores. Server-optimized JS rendering can make a difference.
  • Trust Scores: Servers free of malware with healthy dedicated IP addresses isolated to just your site (rather than shared with potentially malicious sites) receive better crawling and indexing treatment. InMotion Hosting’s security-first approach helps maintain these crucial trust signals.
  • Content Freshness: Server E-Tags and caching policies affect how quickly Google recognizes and indexes new or updated content.
  • TTFB (Time To First Byte): Server location, network stability, and input/output speeds all impact TTFB. Servers equipped with NVMe storage technology excel at I/O speeds, delivering faster data retrieval and improved SERP performance.
Infographic Illustrating How Browser Caching Works (Source: Ultimate Guide to Optimize WordPress Performance )Created by InMotion Hosting. May, 2025

Modern search engines utilize AI models that prioritize sites that deliver consistent, reliable, and fast data. This shift means hosting that can render pages quickly is no longer optional for competitive rankings.

What You Can Do About It (Even If You’re Not Into Technical SEO)

You don’t need to be a server administrator to improve your website’s performance. Here’s what you can do.

1. Choose Faster Hosting

Upgrade from shared hosting to VPS or dedicated hosting with NVMe storage. InMotion Hosting’s plans are specifically designed to boost SEO performance.

2. Use Monitoring Tools

Free tools like UptimeRobot.com, WordPress plugins, or cPanel’s resource monitoring can alert you to performance issues before they affect your rankings.

3. Implement Server-Side Caching

Set up caching with Redis or Memcached using WordPress plugins like W3 Total Cache, or through cPanel.

4. Add a CDN

Content Delivery Networks (CDNs) can enhance global performance without needing server changes. InMotion Hosting makes CDN integration easy.

5. Utilize WordPress Plugins

Use LLMS.txt files to help AI tools crawl your site more effectively.

6. Work with Hosting Providers Who Understand SEO

InMotion Hosting offers managed service packages for thorough server optimization, tailored for optimal SEO performance.

Small Business: VPS Hosting Is Ideal for Reliable Performance on a Budget

VPS hosting is every growing business’s secret SEO weapon.

Imagine two competing local service businesses, both with similar content and backlink profiles, but one uses shared hosting while the other uses a VPS.

When customers search for services, the VPS-hosted site consistently appears higher in results because it loads faster and delivers a smoother user experience.

What Counts as an SMB

Small to medium-sized businesses typically have fewer than 500 employees, annual revenue under $100 million, and websites that receive up to 50,000 monthly visitors.

If your business falls into this category, VPS hosting offers the ideal balance of performance and cost.

What You Get With VPS Hosting

1. Fast Speeds with Less Competition

VPS hosting gives your website dedicated resources, unlike shared hosting where many sites compete for the same resources. InMotion Hosting’s VPS solutions ensure your site runs smoothly with optimal resource allocation.

2. More Control Over SEO

With VPS hosting, you can easily set up caching, SSL, and security features that affect SEO. Full root access enables you to have complete control over your server environment.

3. Affordable for Small Businesses Focused on SEO

VPS hosting provides high-quality performance at a lower cost than dedicated servers, making it a great option for growing businesses.

4. Reliable Uptime

InMotion Hosting’s VPS platform guarantees 99.99% uptime through triple replication across multiple nodes. If one node fails, two copies of your site will keep it running.

5. Better Performance for Core Web Vitals

Dedicated CPU cores and RAM lead to faster loading times and improved Core Web Vitals scores. You can monitor server resources to keep track of performance.

6. Faster Connections

Direct links to major internet networks improve TTFB (Time To First Byte), an important SEO measure.

7. Strong Security Tools

InMotion Hosting provides security measures to protect your site against potential threats that could harm it and negatively impact your search rankings. Their malware prevention systems keep your site safe.

How To Set Up VPS Hosting For Your SEO-Friendly Website

  1. Assess your website’s current performance using tools like Google PageSpeed Insights and Search Console
  2. Choose a VPS plan that matches your traffic volume and resource needs
  3. Work with your provider’s migration team to transfer your site (InMotion Hosting offers Launch Assist for seamless transitions)
  4. Implement server-level caching for optimal performance
  5. Configure your SSL certificate to ensure secure connections
  6. Set up performance monitoring to track improvements
  7. Update DNS settings to point to your new server

Large & Enterprise Businesses: Dedicated Hosting Is Perfect For Scaling SEO

What Counts As An Enterprise Business?

Enterprise businesses typically have complex websites with over 1,000 pages, receive more than 100,000 monthly visitors, operate multiple domains or subdomains, or run resource-intensive applications that serve many concurrent users.

Benefits of Dedicated Hosting

Control Over Server Settings

Dedicated hosting provides you with full control over how your server is configured. This is important for enterprise SEO, which often needs specific settings to work well.

Better Crawlability for Large Websites

More server resources allow search engines to crawl more pages quickly. This helps ensure your content gets indexed on time. Advanced server logs provide insights to help you improve crawl patterns.

Reliable Uptime for Global Users

Enterprise websites need to stay online. Dedicated hosting offers reliable service that meets the expectations of users around the world.

Strong Processing Power for Crawlers

Dedicated CPU resources provide the power needed to handle spikes from search engine crawlers when they index your site. InMotion Hosting uses the latest Intel Xeon processors for better performance.

Multiple Dedicated IP Addresses

Having multiple dedicated IP addresses is important for businesses and SaaS platforms that offer API microservices. IP management tools make it easier to manage these addresses.

Custom Security Controls

You can create specific firewall rules and access lists to manage traffic and protect against bots. DDoS protection systems enhance your security.

Real-Time Server Logs

You can watch for crawl surges and performance issues as they happen with detailed server logs. Log analysis tools help you find opportunities to improve.

Load Balancing for Traffic Management

Load balancing helps spread traffic evenly across resources. This way, you can handle increases in traffic without slowing down performance. InMotion Hosting provides strong load balancing solutions.

Future Scalability

You can use multiple servers and networks to manage traffic and resources as your business grows. Scalable infrastructure planning keeps your performance ready for the future.

Fixed Pricing Plans

You can manage costs effectively as you grow with predictable pricing plans.

How To Migrate To Dedicated Hosting

  1. Conduct a thorough site audit to identify all content and technical requirements.
  2. Document your current configuration, including plugins, settings, and custom code.
  3. Work with InMotion Hosting’s migration specialists to plan the transition
  4. Set up a staging environment to test the new configuration before going live
  5. Configure server settings for optimal SEO performance
  6. Implement monitoring tools to track key metrics during and after migration
  7. Create a detailed redirect map for any URL changes
  8. Roll out the migration during low-traffic periods to minimize impact
  9. Verify indexing status in Google Search Console post-migration

[DOWNLOAD] Website Migration Checklist

Free Website Migration Checklist download from InMotion Hosting – step-by-step guide to smoothly transfer your websiteImage created by InMotion Hosting, May 2025

    Why Shared Hosting Can Kill Your SERP Rankings & Core Web Vitals

    If you’re serious about SEO in 2025, shared hosting is a risk that doesn’t come with rewards.

    Shared Hosting Issues & Risks

    Capped Resource Environments

    Shared hosting plans typically impose strict limits on CPU usage, memory, and connections. These limitations directly impact Core Web Vitals scores and can lead to temporary site suspensions during traffic spikes.

    Resource Competition

    Every website on a shared server competes for the same limited resources.

    This becomes even more problematic with AI bots accessing hundreds of sites simultaneously on a single server.

    Neighbor Problems

    A resource-intensive website on your shared server can degrade performance for all sites, including yours. Isolated hosting environments eliminate this risk.

    Collateral Damage During Outages

    When a shared server becomes overwhelmed, not only does your website go down, but so do connected services like domains and email accounts. InMotion Hosting’s VPS and dedicated solutions provide isolation from these cascading failures.

    Limited Access to Server Logs

    Without detailed server logs, diagnosing and resolving technical SEO issues becomes nearly impossible. Advanced log analysis is essential for optimization.

    Restricted Configuration Access

    Shared hosting typically prevents modifications to server-level configurations that are essential for optimizing technical SEO.

    Inability to Adapt Quickly

    Shared environments limit your ability to implement emerging SEO techniques, particularly those designed to effectively handle AI crawlers. Server-level customization is increasingly important for SEO success.

    In 2025, Reliable Hosting Is a Competitive Advantage

    As search engines place greater emphasis on technical performance, your hosting choice is no longer just an IT decision; it’s a strategic marketing investment.

    InMotion Hosting’s VPS and Dedicated Server solutions are engineered specifically to address the technical SEO challenges of 2025 and beyond. With NVMe-powered storage, optimized server configurations, and 24/7 expert human support, we provide the foundation your site needs to achieve and maintain top rankings.

    Ready to turn your hosting into an SEO advantage? Learn more about our SEO-first hosting solutions designed for performance and scale.


    Image Credits

    Featured Image: Image by Shutterstock. Used with permission.

    In-Post Image: Images by InMotion Hosting. Used with permission.

    How To Use LLMs For 301 Redirects At Scale via @sejournal, @vahandev

    Redirects are essential to every website’s maintenance, and managing redirects becomes really challenging when SEO pros deal with websites containing millions of pages.

    Examples of situations where you may need to implement redirects at scale:

    • An ecommerce site has a large number of products that are no longer sold.
    • Outdated pages of news publications are no longer relevant or lack historical value.
    • Listing directories that contain outdated listings.
    • Job boards where postings expire.

    Why Is Redirecting At Scale Essential?

    It can help improve user experience, consolidate rankings, and save crawl budget.

    You might consider noindexing, but this does not stop Googlebot from crawling. It wastes crawl budget as the number of pages grows.

    From a user experience perspective, landing on an outdated link is frustrating. For example, if a user lands on an outdated job listing, it’s better to send them to the closest match for an active job listing.

    At Search Engine Journal, we get many 404 links from AI chatbots because of hallucinations as they invent URLs that never existed.

    We use Google Analytics 4 and Google Search Console (and sometimes server logs) reports to extract those 404 pages and redirect them to the closest matching content based on article slug.

    When chatbots cite us via 404 pages, and people keep coming through broken links, it is not a good user experience.

    Prepare Redirect Candidates

    First of all, read this post to learn how to create a Pinecone vector database. (Please note that in this case, we used β€œprimary_category” as a metadata key vs. β€œcategory.”)

    To make this work, we assume that all your article vectors are already stored in the β€œarticle-index-vertex” database.

    Prepare your redirect URLs in CSV format like in thisΒ sample file.Β That could be existing articles you’ve decided to prune or 404s from your search console reports or GA4.

    Sample file with urls to be redirectedSample file with URLs to be redirected (Screenshot from Google Sheet, May 2025)

    Optional β€œprimary_category” information is metadata that exists with your articles’ Pinecone records when you created them and can be used to filter articles from the same category, enhancing accuracy further.

    In case the title is missing, for example, in 404 URLs, the script will extract slug words from the URL and use them as input.

    Generate Redirects Using Google Vertex AI

    Download your Google API service credentials and rename them as β€œconfig.json,” upload the script below and a sample file to the same directory in Jupyter Lab, and run it.

    
    import os
    import time
    import logging
    from urllib.parse import urlparse
    import re
    import pandas as pd
    from pandas.errors import EmptyDataError
    from typing import Optional, List, Dict, Any
    
    from google.auth import load_credentials_from_file
    from google.cloud import aiplatform
    from google.api_core.exceptions import GoogleAPIError
    
    from pinecone import Pinecone, PineconeException
    from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput
    
    # Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
    # to functions, making them more robust against transient errors like network issues or API rate limits.
    from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
    
    # For clearing output in Jupyter (optional, keep if running in Jupyter).
    # This is useful for interactive environments to show progress without cluttering the output.
    from IPython.display import clear_output
    
    # ─── USER CONFIGURATION ───────────────────────────────────────────────────────
    # Define configurable parameters for the script. These can be easily adjusted
    # without modifying the core logic.
    
    INPUT_CSV = "redirect_candidates.csv"      # Path to the input CSV file containing URLs to be redirected.
                                               # Expected columns: "URL", "Title", "primary_category".
    OUTPUT_CSV = "redirect_map.csv"            # Path to the output CSV file where the generated redirect map will be saved.
    PINECONE_API_KEY = "YOUR_PINECONE_KEY"     # Your API key for Pinecone. Replace with your actual key.
    PINECONE_INDEX_NAME = "article-index-vertex" # The name of the Pinecone index where article vectors are stored.
    GOOGLE_CRED_PATH = "config.json"           # Path to your Google Cloud service account credentials JSON file.
    EMBEDDING_MODEL_ID = "text-embedding-005"  # Identifier for the Vertex AI text embedding model to use.
    TASK_TYPE = "RETRIEVAL_QUERY"              # The task type for the embedding model. Try with RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY to see the difference.
                                               # This influences how the embedding vector is generated for optimal retrieval.
    CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
    TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                                 # Useful for testing and debugging.
    MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
    QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
    PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata.
                                         # If empty, no year filtering is applied.
    LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                                 # This helps in saving progress incrementally and managing memory.
    MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                                 # Shorter segments might be noise or less descriptive.
    
    # Retry configuration for API calls (Vertex AI and Pinecone).
    # These parameters control how the `tenacity` library retries failed API requests.
    MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
    INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                                 # Subsequent retries will have exponentially increasing delays.
    
    # ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
    # Configure the logging system to output informational messages to the console.
    logging.basicConfig(
        level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
        format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
    )
    
    # ─── INITIALIZE GOOGLE VERTEX AI ───────────────────────────────────────────────
    # Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the
    # service account key file. This allows the Google Cloud client libraries to
    # authenticate automatically.
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GOOGLE_CRED_PATH
    try:
        # Load credentials from the specified JSON file.
        credentials, project_id = load_credentials_from_file(GOOGLE_CRED_PATH)
        # Initialize the Vertex AI client with the project ID and credentials.
        # The location "us-central1" is specified for the AI Platform services.
        aiplatform.init(project=project_id, credentials=credentials, location="us-central1")
        logging.info("Vertex AI initialized.")
    except Exception as e:
        # Log an error if Vertex AI initialization fails and re-raise the exception
        # to stop script execution, as it's a critical dependency.
        logging.error(f"Failed to initialize Vertex AI: {e}")
        raise
    
    # Initialize the embedding model once globally.
    # This is a crucial optimization for "Resource Management for Embedding Model".
    # Loading the model takes time and resources; doing it once avoids repeated loading
    # for every URL processed, significantly improving performance.
    try:
        GLOBAL_EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL_ID)
        logging.info(f"Text Embedding Model '{EMBEDDING_MODEL_ID}' loaded.")
    except Exception as e:
        # Log an error if the embedding model fails to load and re-raise.
        # The script cannot proceed without the embedding model.
        logging.error(f"Failed to load Text Embedding Model: {e}")
        raise
    
    # ─── INITIALIZE PINECONE ──────────────────────────────────────────────────────
    # Initialize the Pinecone client and connect to the specified index.
    try:
        pinecone = Pinecone(api_key=PINECONE_API_KEY)
        index = pinecone.Index(PINECONE_INDEX_NAME)
        logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
    except PineconeException as e:
        # Log an error if Pinecone initialization fails and re-raise.
        # Pinecone is a critical dependency for finding redirect candidates.
        logging.error(f"Pinecone init error: {e}")
        raise
    
    # ─── HELPERS ───────────────────────────────────────────────────────────────────
    def canonical_url(url: str) -> str:
        """
        Converts a given URL into its canonical form by:
        1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
        2. Handling URL-encoded fragment markers (`%23`).
        3. Preserving the trailing slash if it was present in the original URL's path.
           This ensures consistency with the original site's URL structure.
    
        Args:
            url (str): The input URL.
    
        Returns:
            str: The canonicalized URL.
        """
        # Remove query parameters and URL fragments.
        temp = url.split('?', 1)[0].split('#', 1)[0]
        # Check for URL-encoded fragment markers and remove them.
        enc_idx = temp.lower().find('%23')
        if enc_idx != -1:
            temp = temp[:enc_idx]
        # Determine if the original URL path ended with a trailing slash.
        has_slash = urlparse(temp).path.endswith('/')
        # Remove any trailing slash temporarily for consistent processing.
        temp = temp.rstrip('/')
        # Re-add the trailing slash if it was originally present.
        return temp + ('/' if has_slash else '')
    
    
    def slug_from_url(url: str) -> str:
        """
        Extracts and joins meaningful, non-numeric path segments from a canonical URL
        to form a "slug" string. This slug can be used as text for embedding when
        a URL's title is not available.
    
        Args:
            url (str): The input URL.
    
        Returns:
            str: A hyphen-separated string of relevant slug parts.
        """
        clean = canonical_url(url) # Get the canonical version of the URL.
        path = urlparse(clean).path # Extract the path component of the URL.
        segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.
    
        # Filter segments based on criteria:
        # - Not purely numeric (e.g., '123' is excluded).
        # - Length is greater than or equal to MIN_SLUG_LENGTH.
        # - Contains at least one alphanumeric character (to exclude purely special character segments).
        parts = [seg for seg in segments
                 if not seg.isdigit()
                 and len(seg) >= MIN_SLUG_LENGTH
                 and re.search(r'[A-Za-z0-9]', seg)]
        return '-'.join(parts) # Join the filtered parts with hyphens.
    
    # ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
    # Apply retry mechanism for GoogleAPIError. This makes the embedding generation
    # more resilient to transient issues like network problems or Vertex AI rate limits.
    @retry(
        wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
        stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
        retry=retry_if_exception_type(GoogleAPIError), # Only retry if a GoogleAPIError occurs.
        reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
    )
    def generate_embedding(text: str) -> Optional[List[float]]:
        """
        Generates a vector embedding for the given text using the globally initialized
        Vertex AI Text Embedding Model. Includes retry logic for API calls.
    
        Args:
            text (str): The input text (e.g., URL title or slug) to embed.
    
        Returns:
            Optional[List[float]]: A list of floats representing the embedding vector,
                                   or None if the input text is empty/whitespace or
                                   if an unexpected error occurs after retries.
        """
        if not text or not text.strip():
            # If the text is empty or only whitespace, no embedding can be generated.
            return None
        try:
            # Use the globally initialized model to get embeddings.
            # This is the "Resource Management for Embedding Model" optimization.
            inp = TextEmbeddingInput(text, task_type=TASK_TYPE)
            vectors = GLOBAL_EMBEDDING_MODEL.get_embeddings([inp], output_dimensionality=768)
            return vectors[0].values # Return the embedding vector (list of floats).
        except GoogleAPIError as e:
            # Log a warning if a GoogleAPIError occurs, then re-raise to trigger the `tenacity` retry mechanism.
            logging.warning(f"Vertex AI error during embedding generation (retrying): {e}")
            raise # The `reraise=True` in the decorator will catch this and retry.
        except Exception as e:
            # Catch any other unexpected exceptions during embedding generation.
            logging.error(f"Unexpected error generating embedding: {e}")
            return None # Return None for non-retryable or final failed attempts.
    
    # ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
    def build_redirect_map(
        input_csv: str,
        output_csv: str,
        fetch_count: int,
        test_mode: bool
    ):
        """
        Builds a redirect map by processing URLs from an input CSV, generating
        embeddings, querying Pinecone for similar articles, and identifying
        suitable redirect candidates.
    
        Args:
            input_csv (str): Path to the input CSV file.
            output_csv (str): Path to the output CSV file for the redirect map.
            fetch_count (int): Number of candidates to fetch from Pinecone.
            test_mode (bool): If True, process only a limited number of rows.
        """
        # Read the input CSV file into a Pandas DataFrame.
        df = pd.read_csv(input_csv)
        required = {"URL", "Title", "primary_category"}
        # Validate that all required columns are present in the DataFrame.
        if not required.issubset(df.columns):
            raise ValueError(f"Input CSV must have columns: {required}")
    
        # Create a set of canonicalized input URLs for efficient lookup.
        # This is used to prevent an input URL from redirecting to itself or another input URL,
        # which could create redirect loops or redirect to a page that is also being redirected.
        input_urls = set(df["URL"].map(canonical_url))
    
        start_idx = 0
        # Implement resume functionality: if the output CSV already exists,
        # try to find the last processed URL and resume from the next row.
        if os.path.exists(output_csv):
            try:
                prev = pd.read_csv(output_csv)
            except EmptyDataError:
                # Handle case where the output CSV exists but is empty.
                prev = pd.DataFrame()
            if not prev.empty:
                # Get the last URL that was processed and written to the output file.
                last = prev["URL"].iloc[-1]
                # Find the index of this last URL in the original input DataFrame.
                idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
                if idxs:
                    # Set the starting index for processing to the row after the last processed URL.
                    start_idx = idxs[0] + 1
                    logging.info(f"Resuming from row {start_idx} after {last}.")
    
        # Determine the range of rows to process based on test_mode.
        if test_mode:
            end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
            df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
            logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
        else:
            df_proc = df.iloc[start_idx:] # Process all remaining rows.
            logging.info(f"Processing rows {start_idx} to {len(df)-1}.")
    
        total = len(df_proc) # Total number of URLs to process in this run.
        processed = 0        # Counter for successfully processed URLs.
        batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.
    
        # Iterate over each row (URL) in the DataFrame slice to be processed.
        for _, row in df_proc.iterrows():
            raw_url = row["URL"] # Original URL from the input CSV.
            url = canonical_url(raw_url) # Canonicalized version of the URL.
            # Get title and category, handling potential missing values by defaulting to empty strings.
            title = row["Title"] if isinstance(row["Title"], str) else ""
            category = row["primary_category"] if isinstance(row["primary_category"], str) else ""
    
            # Determine the text to use for generating the embedding.
            # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
            if title.strip():
                text = title
            else:
                slug = slug_from_url(raw_url)
                if not slug:
                    # If no meaningful slug can be extracted, skip this URL.
                    logging.info(f"Skipping {raw_url}: insufficient slug context for embedding.")
                    continue
                text = slug.replace('-', ' ') # Prepare slug for embedding by replacing hyphens with spaces.
    
            # Attempt to generate the embedding for the chosen text.
            # This call is wrapped in a try-except block to catch final failures after retries.
            try:
                embedding = generate_embedding(text)
            except GoogleAPIError as e:
                # If embedding generation fails even after retries, log the error and skip this URL.
                logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
                continue # Move to the next URL.
    
            if not embedding:
                # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
                logging.info(f"Skipping {raw_url}: no embedding generated.")
                continue
    
            # Build metadata filter for Pinecone query.
            # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
            filt: Dict[str, Any] = {}
            if category:
                # Split category string by comma and strip whitespace for multiple categories.
                cats = [c.strip() for c in category.split(",") if c.strip()]
                if cats:
                    filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
            if PUBLISH_YEAR_FILTER:
                filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
            filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.
    
            # Define a nested function for Pinecone query with retry mechanism.
            # This ensures that Pinecone queries are also robust against transient errors.
            @retry(
                wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
                stop=stop_after_attempt(MAX_RETRIES),
                retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
                reraise=True # Re-raise the exception if all retries fail.
            )
            def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
                """
                Performs a Pinecone index query with retry logic.
                """
                return index.query(
                    vector=embedding_vector,
                    top_k=top_k_count,
                    include_values=False, # We don't need the actual vector values in the response.
                    include_metadata=False, # We don't need the metadata in the response for this logic.
                    filter=pinecone_filter # Apply the constructed metadata filter.
                )
    
            # Attempt to query Pinecone for redirect candidates.
            try:
                res = query_pinecone_with_retry(embedding, fetch_count, filt)
            except PineconeException as e:
                # If Pinecone query fails after retries, log the error and skip this URL.
                logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
                continue # Move to the next URL.
    
            candidate = None # Initialize redirect candidate to None.
            score = None     # Initialize relevance score to None.
    
            # Iterate through the Pinecone query results (matches) to find a suitable candidate.
            for m in res.get("matches", []):
                cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
                # A candidate is suitable if:
                # 1. It exists (cid is not None).
                # 2. It's not the original URL itself (to prevent self-redirects).
                # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
                if cid and cid != url and cid not in input_urls:
                    candidate = cid # Assign the first valid candidate found.
                    score = m.get("score") # Get the relevance score of this candidate.
                    break # Stop after finding the first suitable candidate (Pinecone returns by relevance).
    
            # Append the results for the current URL to the batch.
            batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
            processed += 1 # Increment the counter for processed URLs.
            msg = f"Mapped {url} β†’ {candidate}"
            if score is not None:
                msg += f" ({score:.4f})" # Add score to log message if available.
            logging.info(msg) # Log the mapping result.
    
            # Periodically flush the batch results to the output CSV.
            if processed % LOG_BATCH_SIZE == 0:
                out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
                # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
                mode = 'a' if os.path.exists(output_csv) else 'w'
                # Determine if header should be written (only for new files).
                header = not os.path.exists(output_csv)
                # Write the batch to the CSV.
                out_df.to_csv(output_csv, mode=mode, header=header, index=False)
                batch.clear() # Clear the batch after writing to free memory.
                if not test_mode:
                    # clear_output(wait=True) # Uncomment if running in Jupyter and want to clear output
                    clear_output(wait=True)
                    print(f"Progress: {processed} / {total}") # Print progress update.
    
            time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.
    
        # After the loop, write any remaining items in the batch to the output CSV.
        if batch:
            out_df = pd.DataFrame(batch)
            mode = 'a' if os.path.exists(output_csv) else 'w'
            header = not os.path.exists(output_csv)
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
    
        logging.info(f"Completed. Total processed: {processed}") # Log completion message.
    
    if __name__ == "__main__":
        # This block ensures that build_redirect_map is called only when the script is executed directly.
        # It passes the user-defined configuration parameters to the main function.
        build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)
    

    You will see a test run with only five records, and you will see a new file called β€œredirect_map.csv,” which contains redirect suggestions.

    Once you ensure the code runs smoothly, you can set the TEST_MODEΒ  boolean to true FalseΒ and run the script for all your URLs.

    Test run with only 5 recordsTest run with only five records (Image from author, May 2025)

    If the code stops and you resume, it picks up where it left off. It also checks each redirect it finds against the CSV file.

    This check prevents selecting a database URL on the pruned list. Selecting such a URL could cause an infinite redirect loop.

    For our sample URLs, the output is shown below.

    Redirect candidates using Google Vertex AI's task type RETRIEVAL_QUERYRedirect candidates using Google Vertex AI’s task type RETRIEVAL_QUERY (Image from author, May 2025)

    We can now take this redirect map and import it into our redirect manager in the content management system (CMS), and that’s it!

    You can see how it managed to match the outdated 2013 news article β€œYouTube Retiring Video Responses on September 12” to the newer, highly relevant 2022 news article β€œYouTube Adopts Feature From TikTok – Reply To Comments With A Video.”

    Also for β€œ/what-is-eat/,” it found a match with β€œ/google-eat/what-is-it/,” which is a 100% perfect match.

    This is not just due to the power of Google Vertex LLM quality, but also the result of choosing the right parameters.

    When I use β€œRETRIEVAL_DOCUMENT” as the task type when generating query vector embeddings for the YouTube news article shown above, it matches β€œYouTube Expands Community Posts to More Creators,” which is still relevant but not as good a match as the other one.

    For β€œ/what-is-eat/,” it matches the article β€œ/reimagining-eeat-to-drive-higher-sales-and-search-visibility/545790/,” which is not as good as β€œ/google-eat/what-is-it/.”

    If you wanted to find redirect matches from your fresh articles pool, you can query Pinecone with one additional metadata filter, β€œpublish_year,” if you have that metadata field in your Pinecone records, which I highly recommend creating.

    In the code, it is a PUBLISH_YEAR_FILTER variable.

    If you have publish_year metadata, you can set the years as array values, and it will pull articles published in the specified years.

    Generate Redirects Using OpenAI’s Text Embeddings

    Let’s do the same task with OpenAI’s β€œtext-embedding-ada-002” model. The purpose is to show the difference in output from Google Vertex AI.

    Simply create a new notebook file in the same directory, copy and paste this code, and run it.

    
    import os
    import time
    import logging
    from urllib.parse import urlparse
    import re
    
    import pandas as pd
    from pandas.errors import EmptyDataError
    from typing import Optional, List, Dict, Any
    
    from openai import OpenAI
    from pinecone import Pinecone, PineconeException
    
    # Import tenacity for retry mechanism. Tenacity provides a decorator to add retry logic
    # to functions, making them more robust against transient errors like network issues or API rate limits.
    from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
    
    # For clearing output in Jupyter (optional, keep if running in Jupyter)
    from IPython.display import clear_output
    
    # ─── USER CONFIGURATION ───────────────────────────────────────────────────────
    # Define configurable parameters for the script. These can be easily adjusted
    # without modifying the core logic.
    
    INPUT_CSV = "redirect_candidates.csv"       # Path to the input CSV file containing URLs to be redirected.
                                                # Expected columns: "URL", "Title", "primary_category".
    OUTPUT_CSV = "redirect_map.csv"             # Path to the output CSV file where the generated redirect map will be saved.
    PINECONE_API_KEY = "YOUR_PINECONE_API_KEY"      # Your API key for Pinecone. Replace with your actual key.
    PINECONE_INDEX_NAME = "article-index-ada"   # The name of the Pinecone index where article vectors are stored.
    OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"    # Your API key for OpenAI. Replace with your actual key.
    OPENAI_EMBEDDING_MODEL_ID = "text-embedding-ada-002" # Identifier for the OpenAI text embedding model to use.
    CANDIDATE_FETCH_COUNT = 3    # Number of potential redirect candidates to fetch from Pinecone for each input URL.
    TEST_MODE = True             # If True, the script will process only a small subset of the input data (MAX_TEST_ROWS).
                                 # Useful for testing and debugging.
    MAX_TEST_ROWS = 5            # Maximum number of rows to process when TEST_MODE is True.
    QUERY_DELAY = 0.2            # Delay in seconds between successive API queries (to avoid hitting rate limits).
    PUBLISH_YEAR_FILTER: List[int] = []  # Optional: List of years to filter Pinecone results by 'publish_year' metadata eg. [2024,2025].
                                         # If empty, no year filtering is applied.
    LOG_BATCH_SIZE = 5           # Number of URLs to process before flushing the results to the output CSV.
                                 # This helps in saving progress incrementally and managing memory.
    MIN_SLUG_LENGTH = 3          # Minimum length for a URL slug segment to be considered meaningful for embedding.
                                 # Shorter segments might be noise or less descriptive.
    
    # Retry configuration for API calls (OpenAI and Pinecone).
    # These parameters control how the `tenacity` library retries failed API requests.
    MAX_RETRIES = 5              # Maximum number of times to retry an API call before giving up.
    INITIAL_RETRY_DELAY = 1      # Initial delay in seconds before the first retry.
                                 # Subsequent retries will have exponentially increasing delays.
    
    # ─── SETUP LOGGING ─────────────────────────────────────────────────────────────
    # Configure the logging system to output informational messages to the console.
    logging.basicConfig(
        level=logging.INFO,  # Set the logging level to INFO, meaning INFO, WARNING, ERROR, CRITICAL messages will be shown.
        format="%(asctime)s %(levelname)s %(message)s" # Define the format of log messages (timestamp, level, message).
    )
    
    # ─── INITIALIZE OPENAI CLIENT & PINECONE ───────────────────────────────────────
    # Initialize the OpenAI client once globally. This handles resource management efficiently
    # as the client object manages connections and authentication.
    client = OpenAI(api_key=OPENAI_API_KEY)
    try:
        # Initialize the Pinecone client and connect to the specified index.
        pinecone = Pinecone(api_key=PINECONE_API_KEY)
        index = pinecone.Index(PINECONE_INDEX_NAME)
        logging.info(f"Connected to Pinecone index '{PINECONE_INDEX_NAME}'.")
    except PineconeException as e:
        # Log an error if Pinecone initialization fails and re-raise.
        # Pinecone is a critical dependency for finding redirect candidates.
        logging.error(f"Pinecone init error: {e}")
        raise
    
    # ─── HELPERS ───────────────────────────────────────────────────────────────────
    def canonical_url(url: str) -> str:
        """
        Converts a given URL into its canonical form by:
        1. Stripping query strings (e.g., `?param=value`) and URL fragments (e.g., `#section`).
        2. Handling URL-encoded fragment markers (`%23`).
        3. Preserving the trailing slash if it was present in the original URL's path.
           This ensures consistency with the original site's URL structure.
    
        Args:
            url (str): The input URL.
    
        Returns:
            str: The canonicalized URL.
        """
        # Remove query parameters and URL fragments.
        temp = url.split('?', 1)[0]
        temp = temp.split('#', 1)[0]
        # Check for URL-encoded fragment markers and remove them.
        enc_idx = temp.lower().find('%23')
        if enc_idx != -1:
            temp = temp[:enc_idx]
        # Determine if the original URL path ended with a trailing slash.
        preserve_slash = temp.endswith('/')
        # Strip trailing slash if not originally present.
        if not preserve_slash:
            temp = temp.rstrip('/')
        return temp
    
    
    def slug_from_url(url: str) -> str:
        """
        Extracts and joins meaningful, non-numeric path segments from a canonical URL
        to form a "slug" string. This slug can be used as text for embedding when
        a URL's title is not available.
    
        Args:
            url (str): The input URL.
    
        Returns:
            str: A hyphen-separated string of relevant slug parts.
        """
        clean = canonical_url(url) # Get the canonical version of the URL.
        path = urlparse(clean).path # Extract the path component of the URL.
        segments = [seg for seg in path.split('/') if seg] # Split path into segments and remove empty ones.
    
        # Filter segments based on criteria:
        # - Not purely numeric (e.g., '123' is excluded).
        # - Length is greater than or equal to MIN_SLUG_LENGTH.
        # - Contains at least one alphanumeric character (to exclude purely special character segments).
        parts = [seg for seg in segments
                 if not seg.isdigit()
                 and len(seg) >= MIN_SLUG_LENGTH
                 and re.search(r'[A-Za-z0-9]', seg)]
        return '-'.join(parts) # Join the filtered parts with hyphens.
    
    # ─── EMBEDDING GENERATION FUNCTION ─────────────────────────────────────────────
    # Apply retry mechanism for OpenAI API errors. This makes the embedding generation
    # more resilient to transient issues like network problems or API rate limits.
    @retry(
        wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10), # Exponential backoff for retries.
        stop=stop_after_attempt(MAX_RETRIES), # Stop retrying after a maximum number of attempts.
        retry=retry_if_exception_type(Exception), # Retry on any Exception from OpenAI client (can be refined to openai.APIError if desired).
        reraise=True # Re-raise the exception if all retries fail, allowing the calling function to handle it.
    )
    def generate_embedding(text: str) -> Optional[List[float]]:
        """
        Generate a vector embedding for the given text using OpenAI's text-embedding-ada-002
        via the globally initialized OpenAI client. Includes retry logic for API calls.
    
        Args:
            text (str): The input text (e.g., URL title or slug) to embed.
    
        Returns:
            Optional[List[float]]: A list of floats representing the embedding vector,
                                   or None if the input text is empty/whitespace or
                                   if an unexpected error occurs after retries.
        """
        if not text or not text.strip():
            # If the text is empty or only whitespace, no embedding can be generated.
            return None
        try:
            resp = client.embeddings.create( # Use the globally initialized OpenAI client to get embeddings.
                model=OPENAI_EMBEDDING_MODEL_ID,
                input=text
            )
            return resp.data[0].embedding # Return the embedding vector (list of floats).
        except Exception as e:
            # Log a warning if an OpenAI error occurs, then re-raise to trigger the `tenacity` retry mechanism.
            logging.warning(f"OpenAI embedding error (retrying): {e}")
            raise # The `reraise=True` in the decorator will catch this and retry.
    
    # ─── MAIN PROCESSING FUNCTION ─────────────────────────────────────────────────
    def build_redirect_map(
        input_csv: str,
        output_csv: str,
        fetch_count: int,
        test_mode: bool
    ):
        """
        Builds a redirect map by processing URLs from an input CSV, generating
        embeddings, querying Pinecone for similar articles, and identifying
        suitable redirect candidates.
    
        Args:
            input_csv (str): Path to the input CSV file.
            output_csv (str): Path to the output CSV file for the redirect map.
            fetch_count (int): Number of candidates to fetch from Pinecone.
            test_mode (bool): If True, process only a limited number of rows.
        """
        # Read the input CSV file into a Pandas DataFrame.
        df = pd.read_csv(input_csv)
        required = {"URL", "Title", "primary_category"}
        # Validate that all required columns are present in the DataFrame.
        if not required.issubset(df.columns):
            raise ValueError(f"Input CSV must have columns: {required}")
    
        # Create a set of canonicalized input URLs for efficient lookup.
        # This is used to prevent an input URL from redirecting to itself or another input URL,
        # which could create redirect loops or redirect to a page that is also being redirected.
        input_urls = set(df["URL"].map(canonical_url))
    
        start_idx = 0
        # Implement resume functionality: if the output CSV already exists,
        # try to find the last processed URL and resume from the next row.
        if os.path.exists(output_csv):
            try:
                prev = pd.read_csv(output_csv)
            except EmptyDataError:
                # Handle case where the output CSV exists but is empty.
                prev = pd.DataFrame()
            if not prev.empty:
                # Get the last URL that was processed and written to the output file.
                last = prev["URL"].iloc[-1]
                # Find the index of this last URL in the original input DataFrame.
                idxs = df.index[df["URL"].map(canonical_url) == last].tolist()
                if idxs:
                    # Set the starting index for processing to the row after the last processed URL.
                    start_idx = idxs[0] + 1
                    logging.info(f"Resuming from row {start_idx} after {last}.")
    
        # Determine the range of rows to process based on test_mode.
        if test_mode:
            end_idx = min(start_idx + MAX_TEST_ROWS, len(df))
            df_proc = df.iloc[start_idx:end_idx] # Select a slice of the DataFrame for testing.
            logging.info(f"Test mode: processing rows {start_idx} to {end_idx-1}.")
        else:
            df_proc = df.iloc[start_idx:] # Process all remaining rows.
            logging.info(f"Processing rows {start_idx} to {len(df)-1}.")
    
        total = len(df_proc) # Total number of URLs to process in this run.
        processed = 0        # Counter for successfully processed URLs.
        batch: List[Dict[str, Any]] = [] # List to store results before flushing to CSV.
    
        # Iterate over each row (URL) in the DataFrame slice to be processed.
        for _, row in df_proc.iterrows():
            raw_url = row["URL"] # Original URL from the input CSV.
            url = canonical_url(raw_url) # Canonicalized version of the URL.
            # Get title and category, handling potential missing values by defaulting to empty strings.
            title = row["Title"] if isinstance(row["Title"], str) else ""
            category = row["primary_category"] if isinstance(row["primary_category"], str) else ""
    
            # Determine the text to use for generating the embedding.
            # Prioritize the 'Title' if available, otherwise use a slug derived from the URL.
            if title.strip():
                text = title
            else:
                raw_slug = slug_from_url(raw_url)
                if not raw_slug or len(raw_slug) < MIN_SLUG_LENGTH:
                    # If no meaningful slug can be extracted, skip this URL.
                    logging.info(f"Skipping {raw_url}: insufficient slug context.")
                    continue
                text = raw_slug.replace('-', ' ').replace('_', ' ') # Prepare slug for embedding by replacing hyphens with spaces.
    
            # Attempt to generate the embedding for the chosen text.
            # This call is wrapped in a try-except block to catch final failures after retries.
            try:
                embedding = generate_embedding(text)
            except Exception as e: # Catch any exception from generate_embedding after all retries.
                # If embedding generation fails even after retries, log the error and skip this URL.
                logging.error(f"Failed to generate embedding for {raw_url} after {MAX_RETRIES} retries: {e}")
                continue # Move to the next URL.
    
            if not embedding:
                # If `generate_embedding` returned None (e.g., empty text or unexpected error), skip.
                logging.info(f"Skipping {raw_url}: no embedding.")
                continue
    
            # Build metadata filter for Pinecone query.
            # This helps narrow down search results to more relevant candidates (e.g., by category or publish year).
            filt: Dict[str, Any] = {}
            if category:
                # Split category string by comma and strip whitespace for multiple categories.
                cats = [c.strip() for c in category.split(",") if c.strip()]
                if cats:
                    filt["primary_category"] = {"$in": cats} # Filter by categories present in Pinecone metadata.
            if PUBLISH_YEAR_FILTER:
                filt["publish_year"] = {"$in": PUBLISH_YEAR_FILTER} # Filter by specified publish years.
            filt["id"] = {"$ne": url} # Exclude the current URL itself from the search results to prevent self-redirects.
    
            # Define a nested function for Pinecone query with retry mechanism.
            # This ensures that Pinecone queries are also robust against transient errors.
            @retry(
                wait=wait_exponential(multiplier=INITIAL_RETRY_DELAY, min=1, max=10),
                stop=stop_after_attempt(MAX_RETRIES),
                retry=retry_if_exception_type(PineconeException), # Only retry if a PineconeException occurs.
                reraise=True # Re-raise the exception if all retries fail.
            )
            def query_pinecone_with_retry(embedding_vector, top_k_count, pinecone_filter):
                """
                Performs a Pinecone index query with retry logic.
                """
                return index.query(
                    vector=embedding_vector,
                    top_k=top_k_count,
                    include_values=False, # We don't need the actual vector values in the response.
                    include_metadata=False, # We don't need the metadata in the response for this logic.
                    filter=pinecone_filter # Apply the constructed metadata filter.
                )
    
            # Attempt to query Pinecone for redirect candidates.
            try:
                res = query_pinecone_with_retry(embedding, fetch_count, filt)
            except PineconeException as e:
                # If Pinecone query fails after retries, log the error and skip this URL.
                logging.error(f"Failed to query Pinecone for {raw_url} after {MAX_RETRIES} retries: {e}")
                continue
    
            candidate = None # Initialize redirect candidate to None.
            score = None     # Initialize relevance score to None.
    
            # Iterate through the Pinecone query results (matches) to find a suitable candidate.
            for m in res.get("matches", []):
                cid = m.get("id") # Get the ID (URL) of the matched document in Pinecone.
                # A candidate is suitable if:
                # 1. It exists (cid is not None).
                # 2. It's not the original URL itself (to prevent self-redirects).
                # 3. It's not another URL from the input_urls set (to prevent redirecting to a page that's also being redirected).
                if cid and cid != url and cid not in input_urls:
                    candidate = cid # Assign the first valid candidate found.
                    score = m.get("score") # Get the relevance score of this candidate.
                    break # Stop after finding the first suitable candidate (Pinecone returns by relevance).
    
            # Append the results for the current URL to the batch.
            batch.append({"URL": url, "Redirect Candidate": candidate, "Relevance Score": score})
            processed += 1 # Increment the counter for processed URLs.
            msg = f"Mapped {url} β†’ {candidate}"
            if score is not None:
                msg += f" ({score:.4f})" # Add score to log message if available.
            logging.info(msg) # Log the mapping result.
    
            # Periodically flush the batch results to the output CSV.
            if processed % LOG_BATCH_SIZE == 0:
                out_df = pd.DataFrame(batch) # Convert the current batch to a DataFrame.
                # Determine file mode: 'a' (append) if file exists, 'w' (write) if new.
                mode = 'a' if os.path.exists(output_csv) else 'w'
                # Determine if header should be written (only for new files).
                header = not os.path.exists(output_csv)
                # Write the batch to the CSV.
                out_df.to_csv(output_csv, mode=mode, header=header, index=False)
                batch.clear() # Clear the batch after writing to free memory.
                if not test_mode:
                    clear_output(wait=True) # Clear output in Jupyter for cleaner progress display.
                    print(f"Progress: {processed} / {total}") # Print progress update.
    
            time.sleep(QUERY_DELAY) # Pause for a short delay to avoid overwhelming APIs.
    
        # After the loop, write any remaining items in the batch to the output CSV.
        if batch:
            out_df = pd.DataFrame(batch)
            mode = 'a' if os.path.exists(output_csv) else 'w'
            header = not os.path.exists(output_csv)
            out_df.to_csv(output_csv, mode=mode, header=header, index=False)
    
        logging.info(f"Completed. Total processed: {processed}") # Log completion message.
    
    if __name__ == "__main__":
        # This block ensures that build_redirect_map is called only when the script is executed directly.
        # It passes the user-defined configuration parameters to the main function.
        build_redirect_map(INPUT_CSV, OUTPUT_CSV, CANDIDATE_FETCH_COUNT, TEST_MODE)
    

    While the quality of the output may be considered satisfactory, it falls short of the quality observed with Google Vertex AI.

    Below in the table, you can see the difference in output quality.

    URL Google Vertex Open AI
    /what-is-eat/ /google-eat/what-is-it/ /5-things-you-can-do-right-now-to-improve-your-eat-for-google/408423/
    /local-seo-for-lawyers/ /law-firm-seo/what-is-law-firm-seo/ /legal-seo-conference-exclusively-for-lawyers-spa/528149/

    When it comes to SEO, even though Google Vertex AI is three times more expensive than OpenAI’s model, I prefer to use Vertex.

    The quality of the results is significantly higher. While you may incur a greater cost per unit of text processed, you benefit from the superior output quality, which directly saves valuable time on reviewing and validating the results.

    From my experience, it costs about $0.04 to process 20,000 URLs using Google Vertex AI.

    While it’s said to be more expensive, it’s still ridiculously cheap, and you shouldn’t worry if you’re dealing with tasks involving a few thousand URLs.

    In the case of processing 1 million URLs, the projected price would be approximately $2.

    If you still want a free method, use BERT and Llama models from Hugging Face to generate vector embeddings without paying a per-API-call fee.

    The real cost comes from the compute power needed to run the models, and you must generate vector embeddings of all your articles in Pinecone or any other vector database using those models if you will be querying using vectors generated from BERT or Llama.

    In Summary: AI Is Your Powerful Ally

    AI enables you to scale your SEO or marketing efforts and automate the most tedious tasks.

    This doesn’t replace your expertise. It’s designed to level up your skills and equip you to face challenges with greater capability, making the process more engaging and fun.

    Mastering these tools is essential for success. I’m passionate about writing about this topic to help beginners learn and feel inspired.

    As we move forward in this series, we will explore how to use Google Vertex AI for building an internal linking WordPress plugin.

    More Resources:Β 


    Featured Image: BestForBest/Shutterstock