What AI gets wrong about your site, and why it’s not your fault: meet llms.txt 

AI tools are everywhere — from chatbots that answer customer questions to language models that summarize everything from documentation to legal text. But if you’ve ever asked a model like ChatGPT to explain your site, your product, or your API, the results might not feel quite right. In fact, sometimes they’re way off. And no, that’s not your fault. 

The disconnect between websites and LLMs 

Large language models (LLMs) like ChatGPT, Claude, or Gemini are trained to understand a wide range of content. But when they try to interpret your website at runtime, that is, when someone is actively asking them a question, they run into a few core problems: 

  • HTML is noisy. Navigation bars, cookie banners, modal popups, and analytics scripts clutter the page. 
  • Context windows are limited. Most websites are too large for an LLM to process all at once. 
  • Important details are spread across multiple pages or hidden in tables, code blocks, or comments. 
  • Markdown docs may exist, but the model often can’t locate them, or even know they exist. 

So, when you ask an AI tool to “explain what this company does” or “summarize this library API”, it often gets stuck. It either skips key context or grabs the wrong signals from cluttered markup. 

It’s not bad intent; it’s a design limitation. 

Why it’s not your SEO’s fault, either 

You’ve probably invested time and effort into search engine optimization. Maybe your robots.txt and sitemap.xml are in place. You’ve got meta tags, structured data, and clean internal links. Good, but LLMs don’t always work like Google. 

Traditional SEO helps your site get found. However, it doesn’t guarantee that AI tools will understand what a human user would. That’s where a new proposal comes in. 

Meet llms.txt: A simple way to help AI understand your site 

A growing number of developers and AI researchers are adopting a lightweight, human-readable standard called llms.txt.  

What is llms.txt? 

llms.txt is a plain Markdown file placed at the root of your site that provides language models with a summary of your project and direct links to clean, LLM-readable versions of important pages. It’s designed for inference-time use, helping AI tools quickly understand a site’s structure, purpose, and content without relying on cluttered HTML or metadata intended for search engines. 

What it does: 

  • Gives a short summary of your site or project 
  • Links to clean, LLM-ready Markdown versions of key pages 
  • Helps AI tools find exactly what matters, without parsing messy HTML

Is it widely supported? Not yet 

Right now, no major LLM provider officially supports llms.txt. Tools like GPTBot (OpenAI), Claude (Anthropic), and Google’s AI crawlers don’t reference or follow it as part of their crawling behavior. Some companies like Anthropic publish llms.txt files themselves, but there’s no evidence that any crawler is actively using them in retrieval or training. 

Still, it’s a low-effort, no-risk addition that helps prepare your site for a future where structured LLM access becomes more standardized. And LLM-facing tools, or even your own AI agents, can make use of it today. 

Example use cases: 

  • A dev library links to .md-formatted API docs and usage examples. 
  • A university site highlights course descriptions and academic policies. 
  • A personal blog offers a simplified timeline of key projects or topics. 

You control the content and the structure. LLMs benefit from curated, LLM-aware context. And users asking questions about your site get better answers. 

Using our Yoast SEO plugin? 

If you’re already using our Yoast SEO (free or Premium) plugin, generating a llms.txt file is easy. Just enable the feature in your settings, and the plugin will automatically create and serve a complete llms.txt file for your site. You can view it anytime at yourdomain.com/llms.txt. 

Get Yoast SEO Premium

Unlock powerful SEO insights with our Premium plugin, including advanced content features, AI optimization tools, and real-time data built for the next generation of search.

An LLM-friendly web isn’t the same as a Google-friendly web 

This doesn’t replace SEO. Think of llms.txt as a companion to robots.txt. It tells AI bots: “Here’s the good stuff. Skip the noise.” 

Sitemaps help crawlers find everything. llms.txt tells LLMs what to focus on. 

It’s especially useful for: 

  • Developers and open-source maintainers 
  • Product marketers looking to reduce support load 
  • Teams that want chatbots to pull answers from docs, not guess 

You don’t need a new CMS or tech stack 

All this requires is creating two things: 

  1. A basic llms.txt file in Markdown
  2. Ideally, you’d also have Markdown versions (.html.md) of key pages included alongside the originals, with the same URL plus .md added. 

No new tools, plugins, or frameworks needed, although some ecosystems are already adding support. 

Here’s an example of a file automatically built by Yoast SEO, as it has an llms.txt generator built in:

Generated by Yoast SEO v25.3, this is an llms.txt file, meant for consumption by LLMs. This is the [sitemap](https://everydayimtravelling.com/sitemap_index.xml) of this website. 
 
# everydayimtravelling.com: Stories from our travels 
 
## Posts 
- [Test video](https://everydayimtravelling.com/test-video/) 
- [A Journey Through Portugal’s Wine Country: A Suggested Wine Tour Route](https://everydayimtravelling.com/a-wine-tour-through-portugal/) 
- [Travel essentials for backpackers FAQ](https://everydayimtravelling.com/travel-essentials-for-backpackers-faq/) 
 
## Pages 
- [Checkout](https://everydayimtravelling.com/checkout/) 
- [Contact us](https://everydayimtravelling.com/contact-us/) 
- [How we started this blog](https://everydayimtravelling.com/pagina-harry-potter/) 
- [My account](https://everydayimtravelling.com/my-account/) 
- [Cart](https://everydayimtravelling.com/cart/) 
 
## Categories 
- [Europe](https://everydayimtravelling.com/category/europe/) 
- [Asia](https://everydayimtravelling.com/category/asia/) 
- [South America](https://everydayimtravelling.com/category/south-america/) 
- [Food](https://everydayimtravelling.com/category/food/) 
- [Western Europe](https://everydayimtravelling.com/category/europe/west-europe/) 
 
## Tags 
- [Budget](https://everydayimtravelling.com/tag/budget/) 
Yoast SEO has an llms.txt generator onboard; you can find it in the API settings

Helping AI help you 

So, if AI is misinterpreting your website, producing erroneous summaries, or skipping critical content, there’s a reason, and it’s fixable. 

It’s not always your copy. Not your design or your metadata. It’s just that these language tools need a little guidance. In the future, llms.txt could be the way to give it to them, and you do so on your terms. 

Do you need help creating an llms.txt file or converting your existing content to Markdown for LLMs? Yoast SEO can automatically generate an llms.txt file for you. 

seo enhancements
New: Future proof your website for tomorrow’s visitors with Yoast SEO llms.txt

Increased usage of AI is changing how people discover businesses and services online. While your website may be optimized for traditional search engines, large language models (LLMs) process your website’s information differently. Our new feature, llms.txt offers to bridge the gap. Yoast SEO generates a file that highlights the most important, up-to-date content on your website as an invitation for LLMs to get the right picture. It’s automatic, requires no technical setup, and is ready in one click.

Helping AI understand your website

Unlike search engines that regularly crawl and index websites, LLMs like ChatGPT and Google Gemini work differently. They don’t store website content for future use. Instead, they gather information in real time when responding to user queries.

This means LLMs often only access a small portion of a website while looking for answers. This is especially true for large websites such as news platforms or ecommerce stores. This can lead to incomplete or even inaccurate AI-generated responses. Not ideal if you’re aiming to improve your visibility in LLM-generated answers as part of your marketing strategy.

If you want to better understand what LLMs tend to look for when accessing websites, this guide on optimizing content for LLMs offers a helpful overview.

What is an llms.txt file?

The llms.txt file gives LLMs a suggested, pre-prepared slice of your website, highlighting your most important and up-to-date content.

Think of it like a helpful guide at the entrance of a large department store. Imagine you’re walking in looking for socks. Someone greets you and hands you a store map that highlights where the socks are, along with other key departments like shoes, checkout, and customer service. You don’t have to use the map,  you can wander around on your own, but it makes it much easier to quickly find what you’re looking for.

In the same way, this file helps LLMs quickly identify the most relevant and useful parts of your website. While the models can still explore other areas, giving them clear guidance increases the chances that they’ll surface the right information in their responses.

How is it different from robots.txt?

robots.txt
  • Tells bots what not to access
  • Focuses on permission
  • Used for search engine indexing and crawling
  • Supported by traditional search engines

llms.txt

  • Suggests what AI should read
  • Focuses on guidance and clarity
  • Helps AI answer user questions more accurately
  • Designed for large language models like ChatGPT

How does Yoast SEO llms.txt work?

When you turn the feature on, it automatically generates an llms.txt file for your website, using a mix of relevant website data. It draws from:

  • Your most recently updated content
  • Technical SEO elements like your sitemap for context
  • Descriptions you’ve added about your website

This offers large language models a website summary to understand what your website is about and what content is most important.

Managing your llms.txt file

The plugin automatically creates and maintains the llms.txt file for you, refreshing every week. You can preview the file to ensure it accurately reflects your brand and prioritizes the right content before implementation.

Want full control or prefer to manage it yourself? Learn how to manually add an llms.txt file to your website by visiting our developer documentation.

At Yoast, our mission is SEO for everyone

Setting up an llms.txt file manually may only be accessible to a technical few. By automating the process, we make it easier for all website owners to benefit from this new technology, without needing to dive into code.

At Yoast, we believe that everyone should have a say in how their content is seen and used. Especially as AI plays a bigger role in how people discover information online. That’s why we’ve introduced this feature as opt-in, so you can decide if and when it makes sense for your website. We’ve seen early signs that this is something more website owners are starting to think about.

Just as robots.txt tries to help search engines understand what to index, llms.txt suggests which parts of your website large language models should pay attention to.If you’d like to see what an llms.txt file looks like in practice, you can view the live version on yoast.com.

The Download: an inspiring toy robot arm, and why AM radio matters

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

How a 1980s toy robot arm inspired modern robotics

—Jon Keegan

As a child of an electronic engineer, I spent a lot of time in our local Radio Shack as a kid. While my dad was locating capacitors and resistors, I was in the toy section. It was there, in 1984, that I discovered the best toy of my childhood: the Armatron robotic arm.

Described as a “robot-like arm to aid young masterminds in scientific and laboratory experiments,” it was a legit robotic arm. And the bold look and function of Armatron made quite an impression on many young kids who would one day have a career in robotics. Read the full story.

If you’re interested in the future of robots, why not check out:

+ Will we ever trust robots? If most robots still need remote human operators to be safe and effective, why should we welcome them into our homes? Read the full story.

+ When you might start speaking to robots. Google is only the latest to fuse large language models with robots. Here’s why the trend has big implications.

+ How AI models let robots carry out tasks in unfamiliar environments. Read the full story.

+ China’s EV giants are betting big on humanoid robots. Technical know-how and existing supply chains give Chinese electric-vehicle makers a significant head start in the sector. Read the full story.

Why we still need AM radio

The most reliable way to keep us informed in times of disaster is being threatened. Check out Ariel Aberg-Riger’s beautiful visual story illustrating AM radio’s importance in uncertain times. 

Both of these stories are from the most recent edition of our print magazine, which is all about how technology is changing creativity. Subscribe now to get future copies before they land.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 Protestors set Waymo robotaxis alight in Los Angeles
The groups clashed with police over the Trump administration’s immigration raids. (LA Times $)
+ Much of the technology that fuels deportation orders is error-ridden. (Slate $)
+ Immigrants are using a swathe of new apps to stay ahead of deportation. (Rest of World)

2 What’s next for Elon Musk and Donald Trump
A full breakdown in relations could be much worse for Musk in the long run. (NY Mag $)
+ Trump’s backers are rapidly turning on Musk, too. (New Yorker $)
+ The biggest winner from their fall out? Jeff Bezos. (The Information $)

3 DOGE used an inaccurate AI tool to terminate Veteran Affairs contacts
Its code frequently produced glaring mistakes. (ProPublica)
+ Undeterred, the department is on a hiring spree. (Wired $)
+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review)

4 Europe’s shrinking forests could cause it to miss net-zero targets
Its trees aren’t soaking up as much carbon as they used to. (New Scientist $)
+ Inside the controversial tree farms powering Apple’s carbon neutral goal. (MIT Technology Review)

5 OpenAI wants to embed ChatGPT into college campuses 
The ultimate goal? A personalized AI account for every student. (NYT $)
+ Meanwhile, other universities are experimenting with tech-free classes. (The Atlantic $)
+ ChatGPT is going to change education, not destroy it. (MIT Technology Review)

6 Chinese regulators are pumping the brakes on self-driving cars
They’re developing a new framework to assess the safety of autonomous features. (FT $)
+ The country’s robotaxis are rapidly catching up with the west. (Rest of World)
+ How China is regulating robotaxis. (MIT Technology Review)

7 Desalination is finally becoming a reality
Removing salt from seawater is one way to combat water scarcity. (WSJ $)
+ If you can make it through tons of plastic, that is. (The Atlantic $)

8 We’re getting better at fighting cancer
Deaths from the disease in the US have dropped by a third since 1991. (Vox)
+ Why it’s so hard to use AI to diagnose cancer. (MIT Technology Review)

9 Teenage TikTokers’ skin regimes offer virtually no benefit
And could even be potentially harmful. (The Guardian)
+ The fight for “Instagram face” (MIT Technology Review)

10 Tech’s layoff groups are providing much-needed support
Workers who have been let go by their employers are forming little communities. (Insider $)

Quote of the day

“Every tech company is doing similar things but we were open about it.”

—Luis von Ahn, chief executive of the language-learning app Duolingo, tells the Financial Times that his company is far from the only one adopting an AI-first strategy. 

One more thing

How to break free of Spotify’s algorithm

Since the heyday of radio, the branding of sound has evolved from broad genres like rock and hip-hop to “paranormal dark cabaret afternoon” and “synth space,” and streaming has become the default.

Meanwhile, the ritual of discovering something new is now neatly packaged in a 30-song playlist. The only rule in music streaming is personalization.

What we’ve gained in convenience, we’ve lost in curiosity. But it doesn’t have to be this way. Read the full story.

—Tiffany Ng

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ Happy birthday to Michael J Fox, who turns 64 today!
+ Whenever you need to play the world’s smallest violin, these scientists can help you out 🎻
+ An early JMW Turner oil painting has been rediscovered.
+ Watching robots attempt to kickbox is pretty amusing.

SEO for Google’s AI Fan-Out Results

Google introduced “fan-out” search results in a March 2025 blog post announcing AI Mode, its expanded version of AI Overviews. The term is new, but the concept is not.

Google’s algorithm has long moved beyond merely matching keywords. It now interprets what searchers are looking for. This intent-based approach is also known as thematic or semantic search.

Similarly, AI Mode “fans out” beyond searchers’ initial queries to address likely follow-ups. A single AI Mode response could include what once required multiple searches. Google’s March post included an example of a searcher seeking the best smartwatch for sleep tracking. An AI Mode answer could fan out to address related topics, such as explaining sleeping heart rates.

Yet keyword research remains essential. The words and phrases of prospects reveal their needs and shopping journeys.

And optimizing those keywords is crucial for citations and sources in AI Overviews. One tactic is to use Gemini, Google’s AI chatbot that powers Overviews and AI Mode.

Here’s how.

Use Gemini

First, generate keywords.

  • For existing pages, access Search Console’s “Performance” > “Queries” tab. Enter the URL for the “Page” filter to limit the query report for that content. Then download the final report as a CSV file.

In Gemini, upload the keyword file and enter the following prompt:

Here’s my target keyword list. What are the fan-out subqueries or facets related to these topics?

I uploaded a CSV file of approximately 500 keywords related to “website maintenance costs.” Gemini’s response, below, was very helpful. (I’ve bolded words for emphasis.) Answering many of its questions would enhance a website maintenance article.

Primary Topic: Website Maintenance and Running Costs
Core Query Theme: “How much does it cost to maintain/run a website?”
Fan-out Subqueries/Facets:

1. Cost Breakdown & Components

    • “What are the typical recurring costs for a website?”
    • “How much does website hosting cost per month/year?”
    • “What is the average cost of a domain name?”
    • “How much should I budget for website security (SSL, firewalls)?”
    • “What are the costs associated with website content updates or management?”
    • “What is the cost of website development or new feature implementation?”
    • “Are there hidden costs of website ownership?”
    • “Cost of plugins, themes, or extensions for platforms like WordPress.”

2. Factors Influencing Cost

    • “What factors increase or decrease website maintenance costs?”
    • “Does website complexity affect maintenance cost?”
    • “How does the type of website (e-commerce, blog, portfolio) impact running costs?”
    • “Is a custom-built website more expensive to maintain than a template-based one?”
    • “Does the frequency of updates influence maintenance expenses?”
    • “What’s the difference in cost between DIY website maintenance and hiring a professional?”

3. Importance & Benefits of Maintenance

    • “Why is website maintenance important for security?”
    • “How does regular maintenance improve website performance and speed?”
    • “What role does website maintenance play in SEO (Search Engine Optimization)?”
    • “How does maintenance impact user experience and conversions?”
    • “What are the risks of not maintaining a website?”

4. Types of Maintenance

    •  “What does website technical maintenance involve?”
    • “What is included in website security maintenance?”
    • “What is content maintenance for a website?”
    • “What is ‘preventative maintenance’ for a website?”

5. Solutions & Services

    • “What are common website maintenance packages or plans?”
    • “What tools are available for website maintenance (monitoring, backups)?”
    • “Should I use a website maintenance service or do it myself?”
    • “Comparison of website maintenance services and their pricing.”

6. Comparative Analysis & Alternatives

    • “How do website maintenance costs compare across different platforms (WordPress, Shopify, Wix, Squarespace)?”
    • “Is it cheaper to rebuild a website than to maintain an old one?”
    • “What are the long-term costs of a website?”
    • “Cost of maintaining a small business website vs. a large enterprise site.”

7. Related Services & Value

    • “Does website maintenance include SEO or marketing services?”
    • “What is the ROI (Return on Investment) of regular website maintenance?”

Gemini’s Reasoning

For the best prompts, run Gemini and open the “Show thinking” option (available in 2.5 Pro). Then follow Gemini’s logic to understand how it treats your core topic.

For example, I entered this prompt and then clicked “Show thinking”:

Where to buy designer shoes online. I’d like to be able to try them on.

Gemini’s “Show thinking” option explains how it treats a core topic. Click image to enlarge.

Gemini’s reasoning was eye-opening in terms of how it can help shoppers. (Words are bolded by me.)

I’m starting my search for online designer shoe retailers with “try before you buy” programs. I’m also looking for luxury stores with free returns and exchanges, focusing on options with the best return policies for my needs.

Compiling program details

My focus is now on compiling a detailed list of “try before you buy” options for designer shoes. I’m actively comparing return policies, specifically looking at restocking fees and shipping costs across various retailers and personal styling services. The goal is to identify the best value and flexibility for trying on shoes at home. I am ready to start compiling a list of top options for the user.

Expanding search scope

I’ve moved beyond the initial “try before you buy” focus. Subscription services, like JustFab and ShoeDazzle, emerged, though their designer offerings might be limited. My focus also shifted to retailers with liberal return policies, essentially facilitating home try-ons. DSW, Nordstrom, and Zappos surfaced as promising. I’m now also exploring personal styling services that include home try-on components.

Refining search criteria

I’ve moved on to the second part of the plan: detailing the specific “try-on” policies. I’m focusing on costs, trial periods, and brand selections for each retailer and service. I’m structuring the information into “Try Before You Buy,” “Retailers with Free Returns,” and “Personal Styling Services” to ensure clarity. I’m now investigating the designer shoe selection for services like Stitch Fix and Wantable.

Gemini suggested options beyond the initial prompt:

  • “Try Before You Buy”
  • “Retailers with Free Returns”
  • “Personal Styling Services”

It also generated a comparison chart of options meeting one or more of those options:

A comparison table of shoe retailers/services—Amazon Prime Try Before You Buy, Nordstrom, and Zappos—detailing their try-on policies, designer brand selection, return periods, shipping/return costs, and key features. Amazon allows customers to try up to 6 items at home for 7 days with no upfront charge. Nordstrom offers extensive high-end selections with no time limit on returns. Zappos allows 365-day returns. All offer free shipping and returns. Highlighted features include Amazon’s “Try Before You Buy,” Nordstrom’s customer service, and Zappos’s long return window. The right panel lists sources, including WeSupply Labs, Nordstrom, and Zappos.

Gemini suggested options beyond the initial prompt and generated a comparison chart. Click image to enlarge.

Note Gemini’s sources and citations from ecommerce brands. Thus to appear in AI Overviews, work on your site’s content explaining core values and needs of prospects, such as shipping, returns, unique products, free virtual help with installation, and more.

Additional Tools

Ultimately, adjust your content based on your knowledge of the niche and target audience. Third-party keyword tools can help brainstorm (i) related queries to expand your keyword list and (ii) related questions of the problems behind the queries.

Google Responds To Site That Lost Ranks After Googlebot DDoS Crawl via @sejournal, @martinibuster

Google’s John Mueller answered a question about a site that received millions of Googlebot requests for pages that don’t exist, with one non-existent URL receiving over two million hits, essentially DDoS-level page requests. The publisher’s concerns about crawl budget and rankings seemingly were realized, as the site subsequently experienced a drop in search visibility.

NoIndex Pages Removed And Converted To 410

The 410 Gone server response code belongs to the family 400 response codes that indicate a page is not available. The 404 response means that a page is not available and makes no claims as to whether the URL will return in the future, it simply says the page is not available.

The 410 Gone status code means that the page is gone and likely will never return. Unlike the 404 status code, the 410 signals the browser or crawler that the missing status of the resource is intentional and that any links to the resource should be removed.

The person asking the question was following up on a question they posted three weeks ago on Reddit where they noted that they had about 11 million URLs that should not have been discoverable that they removed entirely and began serving a 410 response code. After a month and a half Googlebot continued to return looking for the missing pages. They shared their concern about crawl budget and subsequent impacts to their rankings as a result.

Mueller at the time forwarded them to a Google support page.

Rankings Loss As Google Continues To Hit Site At DDOS Levels

Three weeks later things have not improved and they posted a follow-up question noting they’ve received over five millions requests for pages that don’t exist. They posted an actual URL in their question but I anonymized it, otherwise it’s verbatim.

The person asked:

“Googlebot continues to aggressively crawl a single URL (with query strings), even though it’s been returning a 410 (Gone) status for about two months now.

In just the past 30 days, we’ve seen approximately 5.4 million requests from Googlebot. Of those, around 2.4 million were directed at this one URL:
https://example.net/software/virtual-dj/ with the ?feature query string.

We’ve also seen a significant drop in our visibility on Google during this period, and I can’t help but wonder if there’s a connection — something just feels off. The affected page is:
https://example.net/software/virtual-dj/?feature=…

The reason Google discovered all these URLs in the first place is that we unintentionally exposed them in a JSON payload generated by Next.js — they weren’t actual links on the site.

We have changed how our “multiple features” works (using ?mf querystring and that querystring is in robots.txt)

Would it be problematic to add something like this to our robots.txt?

Disallow: /software/virtual-dj/?feature=*

Main goal: to stop this excessive crawling from flooding our logs and potentially triggering unintended side effects.”

Google’s John Mueller confirmed that it’s Google’s normal behavior to keep returning to check if a page that is missing has returned. This is Google’s default behavior based on the experience that publishers can make mistakes and so they will periodically return to verify whether the page has been restored. This is meant to be a helpful feature for publishers who might unintentionally remove a web page.

Mueller responded:

“Google attempts to recrawl pages that once existed for a really long time, and if you have a lot of them, you’ll probably see more of them. This isn’t a problem – it’s fine to have pages be gone, even if it’s tons of them. That said, disallowing crawling with robots.txt is also fine, if the requests annoy you.”

Caution: Technical SEO Ahead

This next part is where the SEO gets technical. Mueller cautions that the proposed solution of adding a robots.txt could inadvertently break rendering for pages that aren’t supposed to be missing.

He’s basically advising the person asking the question to:

  • Double-check that the ?feature= URLs are not being used at all in any frontend code or JSON payloads that power important pages.
  • Use Chrome DevTools to simulate what happens if those URLs are blocked — to catch breakage early.
  • Monitor Search Console for Soft 404s to spot any unintended impact on pages that should be indexed.

John Mueller continued:

“The main thing I’d watch out for is that these are really all returning 404/410, and not that some of them are used by something like JavaScript on pages that you want to have indexed (since you mentioned JSON payload).

It’s really hard to recognize when you’re disallowing crawling of an embedded resource (be it directly embedded in the page, or loaded on demand) – sometimes the page that references it stops rendering and can’t be indexed at all.

If you have JavaScript client-side-rendered pages, I’d try to find out where the URLs used to be referenced (if you can) and block the URLs in Chrome dev tools to see what happens when you load the page.

If you can’t figure out where they were, I’d disallow a part of them, and monitor the Soft-404 errors in Search Console to see if anything visibly happens there.

If you’re not using JavaScript client-side-rendering, you can probably ignore this paragraph :-).”

The Difference Between The Obvious Reason And The Actual Cause

Google’s John Mueller is right to suggest a deeper diagnostic to rule out errors on the part of the publisher. A publisher error started the chain of events that led to the indexing of pages against the publisher’s wishes. So it’s reasonable to ask the publisher to check if there may be a more plausible reason to account for a loss of search visibility. This is a classic situation where an obvious reason is not necessarily the correct reason. There’s a difference between being an obvious reason and being the actual cause. So Mueller’s suggestion to not give up on finding the cause is good advice.

Read the original discussion here.

Featured Image by Shutterstock/PlutusART

What Are Good Google Ads Benchmarks In 2025? [STUDY] via @sejournal, @brookeosmundson

Keeping up-to-date on industry Google Ads benchmarks is crucial to help answer questions you might get from clients or exec such as:

  • “Is this a good CTR?”
  • “Why is our CPA so high?”
  • “What’s a good conversion rate, anyway?”

Questions like these come up all the time, especially when budgets are tight and performance dips even slightly.

But unless you’ve got fresh benchmark data on hand, these conversations are usually filled with guesswork, vague assurances, or worse, outdated reports that no longer reflect how competitive today’s ad landscape really is.

Wordstream by LocaliQ recently updated its Search Advertising benchmarks for 2025, compiling real data from thousands of Google and Microsoft Ads campaigns across 20 verticals.

The data consists of data points from thousands of campaigns in both Google and Microsoft Ads for some of the top industries. Some of the top industries include:

  • Arts & Entertainment.
  • Automotive.
  • Education.
  • Finance & Insurance.
  • Health & Fitness.
  • Home Improvement.
  • Shopping & Retail.
  • Travel.

While these benchmarks are a starting point, it’s important to note that many factors go into setting benchmarks that are attainable for your business.

We hope this data is useful for you to help level-set expectations and goals for your business, and get a sense of how you stack up to the competition.

In this report, you’ll find benchmarks for Search campaigns in Google & Microsoft Ads for:

  • Click-through rate (CTR).
  • Average cost-per-click (CPC).
  • Conversion rate (CVR).
  • Cost per lead (CPL).

Let’s dig into the data.

Average Click-Through Rate In Google & Microsoft Ads By Industry

Average CTR by IndustryData from LocaliQ benchmark report, June 2025

The average click-through rate for Google & Microsoft Ads across all industries averaged out to 6.66% over the last 12 months.

Compared to when the company first started gathering data in 2015, the average CTR for search ads was minimal at 1.35%.

The business category that boasted the highest CTR was Arts & Entertainment, with an astounding 13.10% CTR.

At the other end of the spectrum was Dentists and Dental Services at a 5.44% CTR.

The CTR metric should be analyzed as only one indicator of performance, not the end-all-be-all when trying to determine if your ads are doing well.

The widespread in CTR performance is influenced by:

  • Your competition (Is the SERP saturated?).
  • Your bidding strategy.
  • Your position on the results page.
  • Your ad copy relevancy.
  • Your audience targeting.

High CTR doesn’t always mean high performance, though. Sometimes it just means your ad is click-worthy, not necessarily that it’s converting. That’s why CTR should be viewed as one piece of the puzzle, not the whole picture.

If your CTR is low compared to your industry average, tools like Google’s Quality Score can help pinpoint the problem areas, from poor ad relevance to weak expected click-through rate.

Average Cost-Per-Click In Google & Microsoft Ads By Industry

Average CPC by IndustryData from LocaliQ benchmark report, June 2025

The average cost-per-click for Google and Microsoft Ads across all industries over the past 12 months averaged $5.26.

While the Attorneys and Legal Services showcased one of the lowest CTR categories, it also boasted the highest average CPC. In 2025, the average CPC for this industry came in at $8.58.

This average is unsurprising, given the higher-than-average cost of acquiring a customer.

On the lower end of the spectrum, the Arts & Entertainment industry had the lowest average CPC at $1.60.

Similar to analyzing the CTR metric, average CPC is just one performance indicator.

For example, your ads may show a low average CPC and a low CTR. This could mean your bids aren’t high enough to be competitive in the market, and you may want to consider raising bids.

On the other hand, if you have a higher-than-average CPC, you’ll want to monitor these more closely to ensure you can prove your return on ad spend/investment.

Average Conversion Rates In Google & Microsoft Ads By Industry

Average Conversion Rate by IndustryData from LocaliQ benchmark report, June 2025

The average conversion rate across all industries for Google and Microsoft Ads in the last twelve months was 7.52%.

The average conversion rate is calculated from the number of leads/sales you get divided by the number of clicks from your ad.

When looking at the data from 2025, the average conversion rate varied highly across industries.

On the high end of performance, Automotive had the highest conversion rate at 14.67%, followed by Animals and Pets at 13.07%.

The industries that had the lowest conversion rate included:

  • Finance & Insurance: 2.55%
  • Furniture: 2.73%
  • Real Estate: 3.28%

When looking at these industries and the products they sell, these conversion rates make sense.

Furniture is a high-ticket item for many customers. Users do a lot of research online before making a purchase. Not only that, but because of the price tag, many customers end up purchasing in stores instead of online.

While the conversion rate may be low in this particular industry, it’s more important than ever to be able to measure offline conversions, such as in-store visits or purchases.

In the apparel industry, new brands seem to pop up every day.

If you do a simple search for Nike sneakers, the number of sellers and resellers for these types of products has skyrocketed in recent years.

The amount of competition can directly contribute to a low (or high) conversion rate.

Average Cost Per Lead In Google & Microsoft Ads By Industry

Average Cost Per Lead by IndustryData from LocaliQ benchmark report, June 2025

The average cost per lead across all industries for Google and Microsoft Ads in the last twelve months was $70.11.

The average cost per lead is a core KPI that advertisers should keep a pulse on when analyzing performance.

It remains one of the most scrutinized metrics by marketing and finance teams alike.

It’s no surprise that certain industries have a much higher CPL compared to other industries. Some of the factors that can influence CPL include:

  • Average CPC.
  • Average CTR (this influences your CPC).
  • Audience targeting.
  • Conversion rate.
  • The type of product/service you’re selling.

On average, the CPL across all industries reported was $70.11.

The Attorneys and Legal Services industry had the highest CPL out of all industries at a whopping $131.63.

However, while the CPL may be high, many businesses in that industry find that well worth the investment, considering their return on each individual they represent.

Those industries with lower-priced products and services likely have a lower CPL goal.

The industries that showed the lowest CPL in 2025 were Automotive Repair, Services & Parts at $28.50, followed by Arts & Entertainment and Restaurants & Food at $30.27.

Compared to last year’s data, 13 out of the 23 industries reported an increase in CPL.

Average Google Ads Cost Per Lead by YearData from LocaliQ benchmark report, June 2025

While the last few years have seen such a large fluctuation in CPL due to the record inflation and economic instability, the year-over-year changes in CPL have mellowed out a bit.

Summary

Benchmark reports are exactly that: benchmarks. They’re not scorecards, and they don’t account for your specific brand, audience, goals, or tech stack.

So, if your numbers don’t perfectly align with the averages, it doesn’t mean you’re underperforming.

If you’re looking to make progress in the second half of the year, try following the tips below:

  • Make sure your goals are aligned with your industry’s actual buying journey.
  • Explore alternative platforms like Microsoft Ads to diversify CPC risk.
  • Prioritize ad relevance and landing page experience.
  • Improve tracking for offline conversions where applicable.
  • Don’t forget to test (and retest) your keyword and bidding strategy.
  • Don’t forget about the mobile experience!

Make sure to check out Wordstream by LocaliQ’s full report on benchmarks and tips to improve your campaigns.

More Resources:


Featured Image: Roman Samborskyi/Shutterstock

How AI Detects Customer Hesitation (And Converts It Into Sales) via @sejournal, @purnavirji

Yesterday, I had hiking boots in my cart. Size selected, reviews read, I was even picturing myself on the trail. Then I hesitated. “Will these pinch my wide feet?” Three clicks later, I bounced.

These types of hesitations cost businesses millions.

We’ve gotten excellent at grabbing attention and driving traffic. But success comes down to attention coupled with intention.

The real challenge is optimizing for the micro-moments that determine conversions. Those moments where a finger hovers over “buy.” Eyes flick to the return policy. And then, that dreaded tab back to your competitor.

An essential skill for today’s marketers is conversion design, where we decode hesitation as a behavioral signal.

How do you guide attention toward action? How do you eliminate the friction that causes hesitation? AI can help us spot and solve for these in a way that we haven’t been able to previously.

78% of organizations now use AI in at least one business function according to McKinsey’s 2025 State of AI research, yet most aren’t applying it where it matters most: the critical seconds when attention converts to action.

Understanding The Hesitation Moment

Your visitors have done their research. They’re on your product page, comparing options, genuinely considering a purchase. Then doubt creeps in:

“Will this integration work with our current setup?”

“Is this jacket too warm for Seattle?”

“Can I trust this company with a project this important?”

These small but significant moments determine whether someone converts or walks away. Behavioral science calls this “ambiguity aversion,” our brain’s tendency to avoid uncertain outcomes.

AI is now giving us visibility into these hesitation patterns that were invisible before. Let’s look at how leading brands are responding.

Retail: Removing Size Uncertainty

A Fortune 100 retailer analyzed cart abandonment and discovered shoppers were lingering over size charts before dropping off.

Instead of simply displaying standard measurements, they built a system that detects hesitation patterns and immediately surfaces:

  • Photos of real customers with height/weight stats wearing that exact item.
  • One-click connection to a live sizing consultant.
  • 90-day wear reviews showing how fit changed over time.

This resulted in 22% fewer returns and 37% higher conversion rates [Source: Anonymized client data].

Lululemon: AI-Powered Customer Segmentation

Google’s recent case study on Lululemon shows how the activewear brand used AI to address hesitation at scale.

Instead of treating all visitors the same, Lululemon’s AI identifies where customers are in their decision journey and adjusts messaging accordingly.

Their approach included:

The results showed a substantial reduction in customer acquisition costs, increased new customer revenue from 6% to 15%, and an 8% boost in return on ad spend (ROAS). The strategy was so effective that it earned top honors at the Google Search Honours Awards in Canada.

B2B: Enterprise Software Hesitation

In B2B, hesitation moments are different but no less critical. Enterprise buyers often get stuck on three key concerns:

  • Integration compatibility: “Will this work with our existing systems?”
  • ROI justification:How do I prove value to leadership?
  • Implementation risk: “What if this disrupts our operations?”

Smart B2B companies use AI to detect these hesitation patterns:

  • When someone spends 60+ seconds on pricing pages, especially toggling between tiers.
  • Downloads technical specs, then immediately visits competitor comparison pages.
  • Views implementation timelines multiple times without requesting a demo.

Leading SaaS platforms can trigger personalized responses based on these signals, such as custom ROI calculators, implementation case studies from similar companies, or direct connection to technical specialists.

Microsoft’s Conversational AI In Action

Microsoft’s data shows the power of AI in addressing customer hesitation in real-time. Their recent analysis reveals:

  • AI-powered ads deliver 25% higher relevance compared to traditional search ads.
  • Copilot ad conversions increased by 1.3x across all ad types since the November 2024 relaunch.
  • 40% of users say well-placed AI-powered ads enhance their online experience.

AI is well beyond automating existing processes to now anticipating uncertainty and responding in real time.

The Hesitation-To-Action Framework

Here’s how to start optimizing for hesitation reduction:

1. Identify Hesitation Moments

Use tools like:

  • Heatmaps to see where users pause or hover, e.g., Users hover over “compatibility” but don’t click. Add clarity to product specs.
  • Session recordings to watch actual user behavior, e.g., A user toggles pricing tiers, then exits, indicating confusion or doubt.
  • Behavioral tracking to identify patterns before drop-off, e.g., Users who view the return policy are 2x more likely to abandon cart.
  • Sales call logs to find commonly asked questions and concerns, e.g., “How long does onboarding take?” Add a visual onboarding timeline.

2. Create Confidence Content

Address uncertainty directly:

  • Technical specifications for B2B concerns, e.g., “Compare to Your Stack” chart.
  • Social proof from similar customers, e.g., Quotes from similar customers with similar concerns.
  • Transparent information about potential drawbacks, e.g., “Who This Isn’t Right For” section to builds trust (Sometimes, showing a drawback increases trust more than another benefit).
  • Comparison tools that highlight advantages, e.g., “Compare us to [Competitor X]” chart, to keep people on site.

3. Deploy Behavioral Triggers

Implement AI-powered responses:

  • Dynamic content that adapts based on user behavior, e.g., Lingers on “Team Plan” pricing tier? Show a testimonial from a similar-sized company.
  • Personalized chat prompts triggered by hesitation signals, e.g., Toggles pricing three times? Prompt: “Want help calculating ROI for your team size?”
  • Targeted offers that address specific concerns, e.g., Returning visitor? “Still deciding? Here’s 10% off.”
  • Smart recommendations based on similar customer patterns, e.g., Read three CRM blog posts? Show a case study on CRM integration.

4. Test And Optimize

Microsoft emphasizes the importance of continuous testing. 85% of marketers using generative AI report improved productivity across content and ad creation.

Start small:

  • Choose one campaign or conversion point to optimize, e.g., Demo sign-ups underperforming? Test new headline and CTA.
  • Test AI-generated variations of copy and creative, e.g., Speed vs. security vs. ROI messaging.
  • Monitor real-time insights to refine approaches, e.g., “See how it works” gets more clicks than “Get Started.”
  • Scale successful tactics across other touchpoints, e.g., Winning copy gets rolled into LinkedIn ads and webinar invites.

5. Solve For The Measurement Challenge

Lululemon’s success came from implementing what they called a “measurement trifecta by blending marketing mix modeling (MMM), experiments, and attribution to gain a more holistic view of performance.”

This comprehensive approach revealed:

  • How different activities influenced sales over time.
  • Which touchpoints were most effective in the customer journey.
  • Where hesitation was occurring and being resolved.

The Strategic Shift For Search And Social

SEO

AI Overviews (AIO) are changing how content gets discovered. It’s important to anticipate doubts before they form, structure answers for AI extraction, and prove claims with third-party data.

Create content that addresses hesitation at different stages of the buying journey. Your product pages need to rank and convert uncertain visitors into confident customers.

Paid Search

Use AI to detect behavioral signals that indicate hesitation. Adjust landing pages, ad copy, and bidding strategies based on where users are in their decision process.

Track micro-conversions that indicate reduced hesitation, such as time spent with size charts, clicks on customer reviews, and interactions with chat.

Social Media

  • Share case studies and video testimonials addressing common concerns.
  • Post behind-the-scenes content showing actual product usage.
  • Share first-party data and statistics as proof points.
  • Use polls to identify hesitation points in your audience.
  • Use sentiment analysis to identify hesitation in comments and messages.
  • Test dynamic ad content and AI-generated social copy variations.

Closing The Attention To Intention Gap

Traffic is just the beginning.

For high impact, you need to earn trust in the seconds that matter most. AI gives us the power to see hesitation in real time and resolve it before it becomes regret.

Success often comes down to these micro-moments, these seconds when someone hovers between interest and action.

Master those micro-moments and everything else follows.

More Resources:


Featured Image: fizkes/Shutterstock

WordPress Co-Founder Mullenweg’s Reaction To FAIR Project via @sejournal, @martinibuster

The Linux Foundation recently announced the FAIR Package Manager project, an open-source, distributed WordPress plugin and theme repository that decentralizes control of the repository. A distributed theme and plugin repository became a priority for many in the WordPress community after Matt Mullenweg took control of certain paid premium plugins and created free versions from them, in addition to removing access to the free versions of the original plugins.

The Linux announcement, made on Friday, June 6, came during the middle of WordCamp Europe, all but assuring that it would be a topic of discussion at the three-day conference.

According to the Linus foundation announcement:

“…The FAIR Package Manager project paves the way for the stability and growth of open source content management, giving contributors and businesses additional options governed by a neutral community…”

It was inevitable that Matt Mullenweg would be asked about it and that’s what happened, twice. Mullenweg was gracious about answering the questions but he was also understandably cautious about it, given that it had only been less than 24 hours since the FAIR project had been announced.

Initial Reaction To Project FAIR

The first question was asked early in the question and answer period, where Mullenweg was asked how he sees such initiatives coexisting with WordPress and asking what he sees as the ideal outcome.

Mullenweg expressed cautious optimism, praising the open source nature of WordPress by saying that that’s the point of open source, that it can coexist with everything. But he also was reluctant to say much more. He did seem a little annoyed that the FAIR project was created “in secret.” I don’t know the extent of whether the FAIR project was created in secret but it did seem as if the Linux foundation essentially ambushed WordPress and WordCampe with their announcement.

Mullenweg answered:

“…I think that’s part of the beauty that something like this can be written with the APIs that WordPress has. I don’t know if I want to comment too much further on it just because kind of just found out about it last night, there hasn’t been that much time. There’s a lot of code and uh and complexities.

You know, I do wish if the team did want to collaborate or the team says we want to be transparent and everything. But it did sort of drop as a surprise. It was worked on in secret for six months. But we can work past that and look at it. “

Do Users Want A Federated Repository?

Mullenweg next turned the question away from what he might think about it and asked if this is something that WordPress users would want. He also explained the immensity of the undertaking a decentralized system for the repository.

He continued his answer:

“I do think things we need to keep in mind are, you know, what are users asking for?

What are the challenges they’re facing around finding the right things, knowing it’s secure, getting updates? You know the stats around how many sites that are hacked are from out of date plugins. Those are things that are top of my mind for the plugin directory and so the trust and safety elements of that for the.org directory.

…So we’re now up to 72,000 plugins and themes. This is about 3.2 terabytes, like zip files. That’s not counting all the SVN history and everything like that. So there’s a there’s a lot of data there, which also we need to make sure, like if 500 mirrors are set up and they’re all sucking down the directory like, that could DDOS us.”

About twenty minutes later someone else stepped up and asked the question again, sharing about her long history with WordPress and her opinion of why the FAIR project may be useful.

She said:

“I’ve been contributing to the communication team for 14 years and contributing to plug in review team for a couple of years and my whole work in documentation was serving the user every decision we made we made was to serve user. And in plugin review team we also include plugin authors So everything we do we do for plugin authors and users to make their lives easier and better.”

Next she offered an explanation of why she thinks the FAIR project is good for plugin authors and users:

“So the Fair project is actually federated and independent repository of trusted plugins and teams. And it is under the Linux Foundation. So that means a lot when it’s under the Linux foundation.

And what it means for users and plugin authors and team authors is actually making their lives easier and better, more secure. It makes all the products more discoverable and also developers can choose their source. Where are they using their supply chain from.

But also, it is helping WordPress.org because these are mirrors so it will reduce the load from WordPress.org for every update and all of that.

…I don’t know if you trust me, but it seemed to me that this aligns with the idea of having users and developers first in mind. Would you as wordpress.org consider collaborating with this project?”

Mullenweg’s answer was cautious in tone, giving the impression that he didn’t know much about the FAIR project aside from the public announcement made by the Linux Foundation.

He answered:

“Of course we consider everything, but even in what you said, I think there’s a lot of challenges to it. So for example, right now, a supply chain attack needs to breach wordpress.org which has never been hacked.”

At this point loud laughter rang out in the hall, catching Mullenweg by surprise.

He then continued, offering an idea of the complexity of a federated theme and plugin repository:

“The… now all of a sudden there is N places that could potentially be compromised that you know there’s ways to do that, many ways. There’s N places with uptime issues.

And… it makes it much more difficult for, I don’t know if it’s actually better for WordPress.org, because it makes it much more difficult to do things like rollouts, phased rollouts, or let’s say we get plugin authors the ability to ship to 5% of users and then see what happens, which means we also need things being checked back and then we can roll out to the rest, which is something that I’ve heard a ton of plugin authors ask for.

It will break all the analytics and stats that we provide and also that we internally …use to make decisions, for example which versions of PHP we support…

So I think that it’s uh a big part of why WordPress is where it is today is because of the infrastructure and the sort of feedback loop that we get from wordpress.org.

Also, the trust that we’re able to engender by having that be a resource. When you look at marketplaces, people aren’t asking necessarily for I want it to be downloaded from more locations.

  • They’re asking for how do I know this is trustworthy?
  • How do I know these reviews are real?
  • Who’s moderating?
  • Who’s checking the IP’s on these different reviews?
  • What’s the plug in rating?
  • What’s the compatibility for it?
  • How does it, compatible with my other plugins?

These are things I’m hearing from users, not I need it hosted in a different place. This is one example.

And again, I don’t want to get too far into it because I want to read the code. I want to dive more into it. I want colleagues to look at it. So, I think it’s kind of premature, less than 24 hours in to say like we’re going to …this or not.”

At this point Mullenweg praised the fact that people were being constructive rather than arguing.

He continued:

“But I do think it’s awesome that people are shipping code versus just arguing or talking or writing blog posts. I think that’s a pretty productive way to sort of channel possible disagreements or anything, and then we can see how it looks. Might be a super niche thing that a few people use, maybe one or two hosts or it might be something that maybe there’s something in there that becomes …popular.”

Then he returned to listing things that still need to be looked into, trying to give an idea of how complex creating a decentralized repository is.

Mullenweg continued:

“Like something that we probably need to do in the plug and review is something about these admin banners right, now how is that enforced in a distributed FAIR system?”

Mullenweg then asked the person asking the question how she would solve all of those problems to which she answered that she’s not the smartest person in the room but that this is something to be collaborated on and then she tossed off a joking remark that maybe they can ask ChatGPT, which drew laughter and applause, breaking the tension of the moment and ending the question on a light note.

Watch the question and answer session in about the 8 hour mark of the video:

Google’s Local Job Type Algorithm Detailed In Research Paper via @sejournal, @martinibuster

Google published a research paper describing how it extracts “services offered” information from local business sites to add it to business profiles in Google Maps and Search. The algorithm describes specific relevance factors and confirms that the system has been successfully in use for a year.

What makes this research paper especially notable is that one of the authors is Marc Najork, a distinguished research scientist at Google who is associated with many milestones in information retrieval, natural language processing, and artificial intelligence.

The purpose of this system is to make it easier for users to find local businesses that provide the services they are looking for. The paper was published in 2024 (according to the Internet Archive) and is dated 2023.

The research paper explains:

“…to reduce user effort, we developed and deployed a pipeline to automatically extract the job types from business websites. For example, if a web page owned by a plumbing business states: “we provide toilet installation and faucet repair service”, our pipeline outputs toilet installation and faucet repair as the job types for this business.”

Developing A Local Search System

The first step for creating a system for crawling and extracting job type information was to create training data from scratch. They selected billions of home pages that are listed in Google business profiles and extracted job type information from tables and formatted lists on home pages or pages that were one click away from the home pages. This job type data became the seed set of job types.

The extracted job type data was used as search queries, augmented with query expansion (synonyms) to expand the list of job types to include all possible variations of job type keyword phrases.

Second Step: Fixing A Relevance Problem

Google’s researchers applied their system on the billions of pages and it didn’t work as intended because many pages had job type phrases that were not describing services offered.

The research paper explains:

“We found that many pages mention job type names for other purposes like giving life tips. For example, a web page that teaches readers to deal with bed bugs might contain a sentence like a solution is to call home cleaning services if you find bed bugs in your home. They usually provide services like bed bug control. Though this page mentions multiple job type names, the page is not provided by a home cleaning business.”

Limiting the crawling and indexing to identifying job type keyword phrases resulted in false positives. The solution was to incorporate sentences that surrounded the keyword phrases so that they could better understand the context of the job type keyword phrases.

The success of using surrounding text is explained:

“As shown in Table 2, JobModelSurround performs significantly better than JobModel, which suggests that the surrounding words could indeed explain the intent of the seed job type mentions. This successfully improves the semantic understanding without processing the entire text of each page, keeping our models efficient.”

SEO Insight
The described local search algorithm is purposely excluding all information on the page and zeroing in on job type keyword phrases and surrounding words and phrases around those keywords. This shows the importance of how the words around important keyword phrases can provide context for the keyword phrases and make it easier for Google’s crawlers to understand what the page is about without having to process the entire web page.

SEO Insight
Another insight is that Google is not indexing the entire web page for the limited purpose of identifying job type keyword phrases. The algorithm is hunting for the keyword phrase and surrounding keyword phrases.

SEO Insight
The concept of analyzing only a part of a page is similar to Google’s Centerpiece Annotation where a section of content is identified as the main topic of the page. I’m not saying these are related. I’m just pointing out one feature out of many where a Google algorithm zeroes in on just a section of a page.

The System Uses BERT

Google used the BERT language model to classify whether phrases extracted from business websites describe actual job types. BERT was fine-tuned on labeled examples and given additional context such as website structure, URL patterns, and business category to improve precision without sacrificing scalability.

The Extraction System Can Be Generalized To Other Contexts

An interesting finding detailed by the research paper is that the system they developed can be used in areas (domains) other than local businesses, such as “expertise finding, legal and medical information extraction.”

They write:

“The lessons we shared in developing the largescale extraction pipeline from scratch can generalize to other information extraction or machine learning tasks. They have direct applications to domain-specific extraction tasks, exemplified by expertise finding, legal and medical information extraction.

Three most important lessons are:

(1) utilizing the data properties such as structured content could alleviate the cold start problem of data annotation;

(2) formulating the task as a retrieval problem could help researchers and practitioners deal with a large dataset;

(3) the context information could improve the model quality without sacrificing its scalability.”

Job Type Extract Is A Success

The research paper says that their system is a success, it has a high level of precision (accuracy) and that it is scalable. The research paper says that it has already been in use for a year. The research is dated 2023 but according to the Internet Archive (Wayback Machine), it was published sometime in July 2024.

The researchers write:

“Our pipeline is executed periodically to keep the extracted content up-to-date. It is currently deployed in production, and the output job types are surfaced to millions of Google Search and Maps users.”

Takeaways

  • Google’s Algorithm That Extracts Job Types from Webpages
    Google developed an algorithm that extracts “job types” (i.e., services offered) from business websites to display in Google Maps and Search.
  • Pipeline Extracts From Unstructured Content
    Instead of relying on structured HTML elements, the algorithm reads free-text content, making it effective even when services are buried in paragraphs.
  • Contextual Relevance Is Important
    The system evaluates surrounding words to confirm that service-related terms are actually relevant to the business, improving accuracy.
  • Model Generalization Potential
    The approach can be applied to other fields like legal or medical information extraction, showing how it can be applied to other kinds of knowledge.
  • High Accuracy and Scalability
    The system has been deployed for over a year and delivers scalable, high-precision results across billions of webpages.

Google published a research paper about an algorithm that automatically extracts service descriptions from local business websites by analyzing keyword phrases and their surrounding context, enabling more accurate and up-to-date listings in Google Maps and Search. This technique avoids dependence on HTML structure and can be adapted for use in other industries where extracting information from unstructured text is needed.

Read the research paper abstract and download the PDF version here:

Job Type Extraction for Service Businesses

Featured Image by Shutterstock/ViDI Studio

Can One Person Run a Billion-Dollar Store?

Eduardo Samayoa believes a solo entrepreneur could someday run a billion-dollar ecommerce company. Not with hustle. With AI.

For most folks, the idea of an individual (or even a small team) managing a massive online shop sounds outrageous. Yet AI tools are not just helping ecommerce operators work faster. They are changing who can use the tools and at what scale.

Today, AI agents are automating work that once required entire departments, according to Samayoa, who is the co-founder and CEO of Thinkr, a Shopify-centric AI platform.

Home page of Thinkr

Thinkr is an AI platform for Shopify-powered stores.

AI Growth

While artificial intelligence has existed for many years, the current AI boom emerged on November 30, 2022. That was when OpenAI released ChatGPT.

From generating basic text such as ecommerce product descriptions, AI tools have expanded to all forms of analysis, content, and, most recently, agentic-based automation.

Samayoa’s Thinkr, for example, could recommend discounting several products to boost sales and deplete aging inventory. The tool makes this recommendation to the store’s staff and, if approved, executes the plan, updating pricing in Shopify.

A store staff member clicked one button, and the AI agent did all of the work.

Other unrelated AI tools recommend and execute on advertising budgets.

Prerequisites

Merely asking ChatGPT, Gemini, Claude, or any other generative AI to analyze a store’s sales patterns and update pricing won’t work. An AI tool needs data sources and application programming interfaces.

Samayoa noted that Thinkr is connected to Shopify’s API, meaning it can access a store’s sales history and implement operational changes.

Thinkr also integrates with Google and Meta, enhancing its understanding of a shop’s analytics and advertising performance while expanding the operations it can perform.

Similarly, Shopify’s own AI tools — collectively called Shopify Magic — can access a shop’s sales history and execute all sorts of approved tasks.

According to Samayoa, AI platforms that operate a store require context, a history of its operations, to make effective recommendations. Hence the tools likely work best for established businesses with at least $200,000 in annual sales.

Do It for You

The purpose of AI operations platforms is to make and implement recommendations. The platforms do it for you.

This could mean completing complex tasks in combination. For example, an AI operations platform might:

  • Recognize an upcoming retail holiday, such as Father’s Day.
  • Plan a Father’s Day promotion based on previous campaign data.
  • Build and publish the promotional landing page.
  • Generate advertising assets.
  • Set up a Meta Ads campaign.
  • Generate and schedule promotional email messages.
  • Launch the promotion after a one-click approval.
  • Report on the campaign’s performance.

The solo entrepreneur remains in charge, but the AI handles everything else.

Sidekick (part of Shopify Magic) can update themes and pages. Thinkr could plan the Father’s Day promotion and perform some of its tasks thanks to integrations with Meta and Klaviyo.

In a sense, operational AI has just launched and could someday have its own November 30, 2022.

Innovate?

Ecommerce, while competitive, has enabled solo entrepreneurs and small businesses to thrive. Virtual assistants, agencies, and low-cost offshore talent enhance the capabilities, as do platforms such as Shopify and Amazon.

AI operations platforms will likely extend this trend, making it possible for one person to manage massive operations as Samayoa predicted. Given its explosive growth and trajectory, AI platforms could soon lead product development, customer service, marketing, financial forecasting, and most day-to-day operations.

Although some aspects — accountability, decision-making, emotional intelligence — will remain human endeavors even as AI takes on more tasks.

A final consideration is whether operational AI platforms can innovate. Early on, AI could be a significant advantage, but could widespread adoption eventually homogenize ecommerce operations? Would that be bad or good?