Google Research’s ALDRIFT: AI Answers That Do More Than Sound Plausible via @sejournal, @martinibuster

Google Research published a paper that studies how to make generative AI systems produce answers that do more than sound plausible. The researchers say that their ALDRIFT framework “opens exciting avenues” for moving beyond answers that merely have a high probability.

The paper, titled “Sample-Efficient Optimization over Generative Priors via Coarse Learnability,” examines a problem in which generated answers must remain likely under a model while also moving toward a separate goal. The research points toward new avenues for addressing the AI plausibility trap.

Google ALDRIFT

The evidence in the paper centers on a framework called ALDRIFT (Algorithm Driven Iterated Fitting of Targets). The method repeatedly refines a generative model toward lower-cost answers and uses a correction step to reduce accumulated error during the process.

The paper also introduces “coarse learnability.” The term means the learned model does not need to perfectly match the ideal target. It needs to keep enough coverage over important parts of the answer space so useful possibilities are not lost too early. Under that assumption, the authors prove that ALDRIFT can approximate the target distribution with a polynomial number of samples.

ALDRIFT Operates On A Two-Part Setup

ALDRIFT operates on a two-part setup:

  1. The generative model represents what kinds of answers remain likely under the model.
  2. The outside scoring process measures whether a candidate answer performs well against the target goal.

The authors describe that score as a “cost.” The word “cost” refers to the measured penalty assigned to a candidate answer. A lower cost means the candidate did better according to the requirement being checked. ALDRIFT does not simply search for any low-cost answer. It searches for answers that score well while still remaining likely under the generative model.

Some AI Answers Need To Work As A Whole

The researchers are focused on AI answers for problems where the response has to function in the real world such as their examples of route planning and conference planning.

  • Route planning: The paper explains that an LLM may evaluate whether individual route segments are scenic, but may struggle to ensure that those segments connect into a valid path.
  • Conference planning: An LLM may group sessions by topic, while a classical algorithm may be needed to schedule those sessions into a timetable without conflicts.

These examples show why the paper treats plausible answers as only part of the problem. The harder issue is producing answers that remain coherent when separate parts have to work together as one complete solution.

The Coarse Learnability Assumption

The paper treats this as a problem of guiding a generative model toward answers that hold together across all their parts. The authors connect the problem to inference-time alignment, where a model is adjusted during use based on whether a specific answer works as a complete solution. That connection gives the research practical relevance, although the paper’s contribution remains theoretical and depends on the coarse learnability assumption.

The phrase “coarse learnability assumption” means the paper’s theory depends on an assumption that the model can keep enough useful possibilities available while it is being pushed toward better answers.

It does not mean the model has to learn the target perfectly. It means the model has to preserve enough coverage of the answer space so the process does not get stuck too early or lose possible better answers.

Existing Optimization Methods Leave Sample-Limited Gaps

The paper identifies several gaps in how existing optimization methods are understood:

  • Limitation of existing methods: Classical model-based optimization methods rely on “asymptotic convergence arguments.” This means they are theoretically understood after very large amounts of sampling, but not necessarily in practical settings with limited samples.
  • Failure with expressive models: The paper says these classical assumptions “break down” when using expressive generative models like neural networks.
  • Gap in understanding: The authors say the “finite-sample behavior” of optimization in this setting is “theoretically uncharacterized.” That means the theory does not fully explain how these methods behave when only limited samples are available.

The paper’s solution is to introduce “coarse learnability” to explain how a generative model can be pushed toward better answers while keeping enough useful possibilities available along the way.

The LLM Evidence Is Limited

The paper’s main proof applies to analytic generative models, which are easier to analyze mathematically than modern LLMs. The LLM evidence is narrower: the authors use GPT-2 in simple scheduling and graph-related problems, showing behavior that supports the idea without proving that the same assumptions hold for modern LLMs.

The Research Points To A Foundation For Future Research

The paper offers a theoretical foundation for studying how generative models could be combined with external checking processes.

The research shows that Google researchers are exploring a framework for addressing the “plausible answer” problem, and the authors write that the “framework opens exciting avenues for future research.” They conclude that this research points “toward a principled foundation for adaptive generative models.”

Takeaways

  • The “Coverage” Requirement:
    Coarse learnability means the model does not have to learn the target perfectly. It needs to avoid losing useful areas of the answer space where better solutions might exist.
  • The Correction Step Matters:
    ALDRIFT uses a correction step to keep the search closer to the intended target as the model is pushed toward better answers.
  • Two-Part Approach:
    The framework uses a division of labor. The generative model handles qualitative or semantic preferences, while a separate process checks whether the answer works as a complete solution.
  • Limited LLM Evidence:
    Tests with GPT-2 showed behavior that supports the idea in simple scheduling and graph-related examples, but not proof that the same assumptions hold for modern LLMs.
  • Real-World Use Is The Larger Goal:
    The research matters to SEOs and businesses because AI answers are increasingly expected to do more than summarize information. They need to support decisions, plans, and actions that hold together outside the chat interface. While the framework is likely not being used in production, it does show Google is making progress on providing answers that are more than plausible.

Read the research paper here:

Sample-Efficient Optimization over Generative Priors via Coarse Learnability (PDF)

Featured Image by Shutterstock/Faizal Ramli

The Fully Non-Human Web: No One Builds The Page, No One Visits It via @sejournal, @slobodanmanic

In January 2026, Google was granted patent US12536233B1. Six engineers worked on it, and it describes a system that scores a landing page on conversion rate, bounce rate, and design quality. If the landing page falls below a threshold, generate an AI replacement personalized to the searcher. The advertiser never sees it. Never approves it. Might not even know it happened.

The debate around this patent has centered on scope: Is it limited to shopping ads, or does it signal something broader? That’s the wrong question.

The right question: What happens when you combine AI-generated pages with AI agents that browse, shop, and transact on behalf of humans?

For the first time, we have the infrastructure for a web where no human creates the page and no human visits it. Both sides can be non-human. That changes everything.

The Supply Side: AI-Generated Pages

The supply side of the web has always been human. Someone designs a page, writes copy, publishes it. Three developments are changing that.

Google’s patent US12536233B1 is the most direct: Score a landing page on conversion rate, bounce rate, and design quality, then replace underperforming pages with AI-generated versions. The replacement pages draw on the searcher’s full search history, previous queries, click behavior, location, and device data. Google builds personalized landing pages no advertiser can match, because no advertiser has access to cross-query behavioral data at that scale. Barry Schwartz covered the patent on Search Engine Land, describing a system where Google could automatically create custom landing pages, replacing organic results. Glenn Gabe called Google’s AI landing page patent potentially more controversial than AI Overviews. Roger Montti at Search Engine Journal argued the patent’s scope is limited to shopping and ads. Both camps agree: the technology to score and replace landing pages with AI exists and works.

NLWeb, Microsoft’s open project, takes a different approach. NLWeb turns any website into a natural language interface using existing Schema.org markup and RSS feeds. An AI agent querying an NLWeb-enabled site doesn’t load a page at all. The agent asks a structured question, NLWeb returns a structured answer. The rendered page becomes optional.

WebMCP goes further still. With WebMCP, a website registers tools with defined input/output schemas that AI agents discover and call as functions. A product search becomes a function call. A checkout becomes an API request. WebMCP eliminates the “page” concept entirely, dissolving the web page as a unit of content into a set of callable capabilities.

Each mechanism works differently, but the direction is the same: the page is becoming something generated, queried, or bypassed entirely. The human-designed, human-published web page is no longer the only way content reaches an audience.

The Demand Side: AI Agents As Visitors

The demand side shifted faster. In 2024, bots surpassed human traffic for the first time in a decade, accounting for 51% of all web activity. Cloudflare’s data shows AI “user action” crawling (agents actively doing things, not just indexing) grew 15x during 2025. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. The scale is hard to overstate.

Agentic browsers are the most visible shift. Chrome’s auto browse turned 3 billion Chrome installations into potential AI agent launchpads. Google’s Gemini scrolls, clicks, fills forms, and completes multi-step tasks autonomously inside Chrome. Perplexity’s Comet browser conducts deep research across multiple sites simultaneously. Microsoft’s Edge Copilot Mode handles multi-step workflows from within the browser sidebar. The full agentic browser landscape now includes over a dozen consumer and developer tools, all browsing on behalf of humans.

Commerce agents have moved past browsing into buying. OpenAI launched Instant Checkout to let users purchase products directly inside ChatGPT, powered by Stripe’s Agentic Commerce Protocol (ACP). OpenAI killed the feature in March 2026 after near-zero purchase conversions and only a dozen merchant integrations out of over a million promised. The failure was execution, not concept: Alibaba’s Qwen app processed 120 million orders in six days in February 2026 because Alibaba owns the AI model, the marketplace, the payment rails (Alipay), and the logistics. OpenAI tried to replicate agentic commerce without owning the stack. Google and Shopify’s Universal Commerce Protocol (UCP) connects over 20 companies, including Walmart, Target, and Mastercard, in a framework designed for AI agents to handle commerce from product discovery through checkout. Shopify auto-opted over a million merchants into agentic shopping experiences with ChatGPT, Copilot, and Perplexity. The transaction happens in an AI conversation. No checkout page loads.

Agent-to-agent communication removes the human from both ends. Google’s Agent-to-Agent (A2A) protocol lets AI agents from different vendors discover each other’s capabilities and collaborate on tasks without human mediation. A travel planning agent negotiates directly with a booking agent. A procurement agent evaluates supplier agents across vendors. Over 150 organizations support A2A, including Salesforce, SAP, and PayPal, making agent-to-agent commerce and coordination a production reality.

When Both Sides Go Non-Human

Until now, one side of the web was always human. A person built the page, or a person visited it. Usually both.

Google’s patent closes the circuit.

Here’s what a complete non-human flow might look like. A user tells their AI assistant they need running shoes. The assistant queries product data through NLWeb or WebMCP, no page load needed. The assistant evaluates options by checking inventory across retailers via A2A. If the user needs to review a comparison, Google generates a landing page personalized to that specific user’s search history and preferences. The assistant completes checkout through ACP or UCP using Shared Payment Tokens. The user receives a confirmation.

The human’s role in that entire flow: stating intent and approving the purchase. Discovery, page generation, product evaluation, and transaction completion are all handled by AI systems. The human touches only the two endpoints of the chain.

Every piece of technology in that chain exists in production today. Chrome auto browse is live for 3 billion Chrome users. A2A has 150+ organizational supporters. ACP underpins Stripe’s agentic commerce infrastructure (ChatGPT’s Instant Checkout failed on execution, not protocol). UCP connects Shopify, Google, Walmart, and Target. Patent US12536233B1 is granted. No single company has assembled the full loop yet, but every component is operational.

Who’s Building The Non-Human Web

Here’s where it gets interesting. Map out who’s building what, and a pattern emerges:

Layer What Who
Page generation AI landing pages Google
Content-as-API WebMCP, NLWeb Google, Microsoft
Agent infrastructure MCP, A2A Anthropic, Google
Agent browsers Chrome, Comet, Copilot Google, Perplexity, Microsoft
Agent commerce ACP, UCP Stripe + OpenAI, Shopify + Google
Edge delivery Markdown for Agents Cloudflare

Google appears in five of six layers: page generation (patent US12536233B1), content-as-API (WebMCP), agent infrastructure (A2A), agent browsers (Chrome auto browse), and commerce (UCP). Google is positioning itself to mediate the non-human web the same way Google mediates the human one through Search.

The Agentic AI Foundation (AAIF), formed under the Linux Foundation with Anthropic, OpenAI, Google, and Microsoft as platinum members, provides the governance layer. The AAIF functions as the W3C for the agentic web: the vendor-neutral body that decides which protocols become standards for agent interoperability.

What Website Owners Need To Know

This isn’t an optimization checklist. It’s three structural shifts in what your website is for.

Your Data Layer Is Your Website

Google’s patent generates landing pages from product feed data, making product feeds the most important asset an ecommerce business maintains. NLWeb queries Schema.org markup instead of rendering pages, making structured markup the front door to your content. WebMCP exposes site capabilities as function calls, making tool definitions the user interface agents interact with.

Structured data, product feeds, JSON-LD, and API surfaces have traditionally been treated as backend infrastructure. In the non-human web, these data layers become the primary way a business reaches customers. Product feed accuracy (specs, pricing, stock levels, images) matters more than homepage design when AI systems generate the page from that feed.

Trust Is The Moat

AI can generate a page. It cannot generate a reason to seek you out by name.

Direct traffic, email subscribers, community members, and brand reputation persist when the page itself becomes replaceable. An AI agent can build a product page, but no AI agent can build the trust that makes a consumer (or their agent) request a specific brand by name.

The brands that matter in the non-human web are the ones people tell their agents to find. “Get me a fleece jacket” is a commodity query. “Get me a fleece jacket from Patagonia” is a brand moat.

The Measurement Problem

How do you measure a page you didn’t build? How do you A/B test against something Google generates dynamically? How do you attribute a conversion that happened inside ChatGPT, initiated by an agent acting on behalf of a user who never saw your website?

Traditional web analytics (page views, sessions, bounce rate, time on site) assume two things: a human visitor and a page you control. On the non-human web, neither assumption holds. A Google-generated landing page isn’t yours. A ChatGPT checkout session doesn’t register in your analytics.

I don’t have a clean answer here, and neither does anyone else. Measurement is the genuinely unsolved problem of the non-human web. New metrics will need to track agent discoverability, agent conversion rate, and data feed quality. But as of March 2026, the measurement infrastructure hasn’t caught up to the technology it needs to measure.

Four Predictions For 2026-2027

Four things to watch over the next 12-18 months.

Google ships patent US12536233B1, or something like it. The technology for scoring and replacing landing pages exists. The business incentive exists. Google has a history of introducing features in ads first, then expanding (Google Shopping went from free to paid to essential). AI-generated landing pages will likely appear in shopping ads first, then broaden to other verticals. Landing page quality scores in Google Ads serve as the early warning system for which pages Google considers replaceable.

Agent traffic becomes measurable. Analytics platforms will need to distinguish human sessions from agent sessions. BrightEdge reports AI agents account for roughly 33% of organic search activity as of early 2026. WP Engine’s traffic data shows 1 AI bot visit for every 31 human visits by Q4 2025, up from 1 per 200 at the start of that year. Agent traffic ratios will accelerate further as Chrome auto browse rolls out globally beyond the US. New metrics around agent conversion rate and agent discoverability will emerge from necessity.

The protocol stack consolidates. MCP, A2A, NLWeb, and WebMCP form a coherent stack covering tool access, agent communication, content querying, and browser-level integration. Expect more interoperability between these protocols and fewer competing standards. The Agentic AI Foundation (AAIF) accelerates consolidation. Within 18 months, “does your site support MCP?” will be as standard a question as “is your site mobile-friendly?”

Brand differentiation gets harder and more important. When AI generates pages and agents do the shopping, the only defensible position is being the brand people (and their agents) seek out by name. Direct relationships, owned audiences, trust signals. Everything else is a commodity.

The Web Splits In Two

When Shopify auto-opted merchants into agentic shopping, I asked whether your website just became optional. The answer is more nuanced than optional or essential. It’s becoming something different.

The web isn’t dying. It’s splitting.

The transactional web (product listings, checkout flows, information retrieval, comparison shopping) is going non-human first. AI generates the landing pages. AI agents visit and transact on those pages. Humans approve decisions at the endpoints. Google’s patent lives in the transactional web, and the economics of conversion optimization push hardest toward automation in this layer.

The experiential web (brand storytelling, community, content that rewards sustained attention, design that creates emotional response) stays human. Not because AI can’t generate brand experiences, but because the value of those experiences comes from the human connection behind them. Nobody tells their agent to “go enjoy a brand experience on my behalf.”

Your website’s new job description: data source for the agents, trust anchor for the humans, brand home for both. The companies that treat their structured data, product feeds, and API surfaces with the same care they give their homepage design are the ones that show up in both worlds.

The non-human web isn’t replacing the human web. It’s growing alongside it. Your job is to show up in both.

More Resources:


This was originally published on No Hacks.


Featured Image: Yaaaaayy/Shutterstock

The Facts About Google Click Signals, Rankings, And SEO via @sejournal, @martinibuster

Clicks as a ranking-related signal have been a subject of debate for over twenty years, although nowadays most SEOs understand that clicks are not a direct ranking factor. The simple truth about clicks is that they are raw data and, surprisingly, processed with some similarity to human rater scores.

Clicks Are A Raw Signal

The DOJ Antitrust memorandum opinion from September 2025 mentions clicks as a “raw signal” that Google uses. It also categorizes content and search queries as raw signals. This is important because a raw signal is the lowest-level data point which is processed into higher level ranking signals or used for training a model like RankEmbed and its successor, RankEmbedBERT.

Those are considered raw signals because they are:

  • Directly observed
  • But not yet interpreted or used for training data

The DOJ document quotes professor James Allan, who gave expert testimony on behalf of Google:

“Signals range in complexity. There are “raw” signals, like the number of clicks, the content of a web page, and the terms within a query.

…These signals can be created with simple methods, such as counting occurrences (e.g., how many times a web page was clicked in response to a particular query). Id.
at 2859:3–2860:21 (Allan) (discussing Navboost signal) “

He then contrasts the raw signals with how they are processed:

“At the other end of the spectrum are innovative deep-learning models, which are machine-learning models that discern complex patterns in large datasets.

Deep models find and exploit patterns in vast data sets. They add unique capabilities at high cost.”

Professor Allan explains that “top-level signals” are used to produce the “final” scores for a web page, including popularity and quality.

Raw Signals Are Data To Be Further Processed

Navboost is mentioned several times in the September 2025 antitrust document as popularity data. It’s not mentioned in the context of clicks having a ranking effect on individal sites.

It’s referred to as a way to measure popularity and intent:

“…popularity as measured by user intent and feedback systems including Navboost/Glue…”

And elsewhere, in the context of explaining why some of the Navboost data is privileged:

“They are ‘popularity as measured by user intent and feedback systems including Navboost/Glue’…”

In the context of explaining why some of the Navboost data is privileged:

“Under the proposed remedy, Google must make available to Qualified Competitors …the following datasets:

1. User-side Data used to build, create, or operate the GLUE statistical model(s);

2. User-side Data used to train, build, or operate the RankEmbed model(s); and

3. The User-side Data used as training data for GenAI Models used in Search or any GenAI Product that can be used to access Search.

Google uses the first two datasets to build search signals and the third to train and refine the models underlying AI Overviews and (arguably) the Gemini app.”

Clicks, like human rater scores, are just a raw signal that is used further up the algorithm chain to train AI models to better able match web pages to queries or to generate a quality or relevance signal that is then added to the rest of the ranking signals by a ranking engine or a rank modifier engine.

70 Days Of Search Logs

The DOJ document makes reference to using 70 days of search logs. But that’s just eleven words in a larger context.

Here is the part that is frequently quoted:

“70 days of search logs plus scores generated by human raters”

I get it, it’s simple and direct. But there is more context to it:

“RankEmbed and its later iteration RankEmbedBERT are ranking models that rely on two main sources of data: [Redacted]% of 70 days of search logs plus scores generated by human raters and used by Google to measure the quality of organic search results.”

The 70 days of search logs are not click data used for ranking purposes in Google, AI Mode, or Gemini. It’s data in aggregate that is further processed in order to train specialized AI models like RankEmbedBERT that in turn rank web pages based on natural language analysis.

That part of the DOJ document does not claim that Google is directly using click data for ranking search results. It’s data, like the human rater data, that’s used by other systems for training data or to be further processed.

What Is Google’s RankEmbed?

RankEmbed is a natural language approach to identifying relevant documents and ranking them.

The same DOJ document explains:

“The RankEmbed model itself is an AI-based, deep-learning system that has strong natural-language understanding. This allows the model to more efficiently identify the best documents to retrieve, even if a query lacks certain terms.”

It’s trained on less data than previous models. The data partially consists of query terms and web page pairs:

“…RankEmbed is trained on 1/100th of the data used to train earlier ranking models yet provides higher quality search results.

…Among the underlying training data is information about the query, including the salient terms that Google has derived from the query, and the resultant web pages.”

That’s training data for training a model to recognize how query terms are relevant to web pages.

The same document explains:

“The data underlying RankEmbed models is a combination of click-and-query data and scoring of web pages by human raters.”

It’s crystal clear that in the context of this specific passage, it’s describing the use of click data (and human rater data) to train AI models, not to directly influence rankings.

What About Google’s Click Ranking Patent?

Way back in 2006 Google filed a patent related to clicks called, Modifying search result ranking based on implicit user feedback. The invention is about the mathematical formula for creating a “measure of relevance” out of the aggregated raw data of clicks (plural).

The patent distinguishes between the creation of the signal and the act of ranking itself. The “measure of relevance” is output to a ranking engine, which then can add it to existing ranking scores to rank search results for new searches.

Here’s what the patent describes:

“A ranking Sub-system can include a rank modifier engine that uses implicit user feedback to cause re-ranking of search results in order to improve the final ranking
presented to a user of an information retrieval system.

User selections of search results (click data) can be tracked and transformed into a click fraction that can be used to re-rank future search results.”

That “click fraction” is a measure of relevance. The invention described in the patent isn’t about tracking the click; it’s about the mathematical measure (the click fraction) that results from combining all those individual clicks together. That includes the Short Click, Medium Click, Long Click, and the Last Click.

Technically, it’s called the LCIC (Long Click divided by Clicks) Fraction. It’s “clicks” plural because it’s making decisions based on the sums of many clicks (aggregate), not the individual click.

That click fraction is an aggregate because:

  • Summation:
    The “first number” used for ranking is the sum of all those individual weighted clicks for a specific query-document pair.
  • Normalization:
    It takes that sum and divides it by the total count of all clicks (the “second number”).
  • Statistical Smoothing:
    The system applies “smoothing factors” to this aggregate number to ensure that a single click on a “rare” query doesn’t unfairly skew the results, especially for spammers.

That 2006 patent describes it’s weighting formula like this:

“A base LCC click fraction can be defined as:

LCC_BASE=[#WC(Q,D)]/[#C(Q,D)+S0)

where iWC(Q.D) is the sum of weighted clicks for a query URL…pair, iC(Q.D) is the total number of clicks (ordinal count, not weighted) for the query-URL pair, and S0 is a smoothing factor.”

That formula describes summing and dividing the data from many users to create a single score for a document. The “query-URL” pair is a “bucket” of data that stores the click behavior of every user who ever typed that specific query and clicked that specific search result. The smoothing factor is the anti-spam part that includes not counting single clicks on rare search queries.

Even way back in 2006, clicks is just raw data that is transformed further up the chain across multiple stages of aggregation, into a statistical measure of relevance before it ever reaches the ranking stage. In this patent, the clicks themselves are not ranking factors that directly influence whether a site is ranked or not. They were used in aggregate as a measure of relevance, which in turn was fed into another engine for ranking.

By the time the information reaches the ranking engine, the raw data has been transformed from individual user actions into an aggregate measure of relevance.

  • Thinking about clicks in relation to ranking is not as simple as clicks drive search rankings.
  • Clicks are just raw data.
  • Clicks are used to train AI systems like RankEmbedBert.
  • Clicks are not directly influencing search results. They have always been raw data, the starting point for systems that use the data in aggregate to create a signal that is then mixed into ranking decision making systems at Google.
  • So yes, like human rater data, raw data is processed to create a signal or to train AI systems.

Read the DOJ memorandum in PDF form here.

Read about four research papers about CTR.

Read the 2006 Google patent, Modifying search result ranking based on implicit user feedback.

Featured Image by Shutterstock/Carkhe

Google’s Patent On Autonomous Search Results via @sejournal, @martinibuster

The United States Patent Office recently published Google’s continuation on a patent for a search system that detects when there is no satisfactory answer for a query and waits to automatically deliver the answer when it becomes available.

Search And AI Assistant

The patent, published in February 2026, is a continuation of an older patent, with the main changes being to apply this patent within the context of an AI assistant. The invention describes solving the problem of answering a question when no actual answer is available at the time a user makes the query. What it does is waits until there’s a satisfactory answer, at which point it circles back to the user with the answer, without them having to ask again.

The patent is titled, Autonomously providing search results post-facto, including in assistant context. Although the patent mentions quality thresholds, those thresholds are defined in the sense of whether the answer meets the user’s needs.

The patent describes six scenarios that would trigger the invention:

  1. When no search results meet defined quality or authoritative-answer criteria.
  2. When results exist but fail to provide a definitive or authoritative answer that satisfies those criteria.
  3. When no results meet quality criteria because the information is not yet available.
  4. When a query seeks a specific answer and no result satisfies the required criteria.
  5. When a resource later satisfies the defined criteria after previously lacking required information.
  6. When a previously available resource is refined or updated so that it now meets the criteria.

Useful And Complete Answers

Google’s patent says that the invention is a solution for times when there is no useful or complete answers because the information does not yet exist or is not good enough, forcing users to keep searching repeatedly.

The system checks if results meet:

  • A quality standard
  • Authoritativeness standard
  • Or a completeness standard.

If the current answers don’t meet those standards, the system will store the query and monitor for new or updated information. Once it becomes available it will send the results to the user later without them searching again.

Follow-Up Questions Are Not Necessary

What is novel about the invention is that it enables follow-up delivery of results after the original query without requiring a new follow-up questions. It also surfaces search results proactively in notifications or assistant conversations.

At a later time, when new or updated information becomes available that satisfies the criteria, the system proactively delivers that information to the user. This delivery can occur through notifications, within an unrelated interaction, or during a later conversation with an automated assistant.

The system may also optionally notify the user that no good results are currently available and ask if they want to be informed when better results appear.

What this system does is it transforms search from a one-time, user-initiated action into a persistent, ongoing process where the system continues working in the background and updates the user when meaningful information becomes available.

Cross-Device Continuity

An interesting feature of this invention is that it can reach out to the user across multiple devices.

Here is where it’s outlined:

[0012] In some implementations, the query is received on an additional computing device that is in addition to the computing device for which the content is provided for presentation to the user.”

This capabiilty is highlighed again in section [0067]:

“For example, the content may be provided for presentation to the user via the same computing device the user utilized to submit the query and/or via a separate computing device.”

It can also go cross-device as a visual and/or audible output across devices and in the form of an automated assistant, and can present the information when the user is interacting with the automated assistant in a different context, describing an “ecosystem” of devices.

Lastly, the patent explains that the information can be surfaced when the user is interfacing with the automated assistant in a completely different context:

[0040]”…the content may be provided for presentation to the user via the same computing device the user utilized to submit the query and/or via a separate computing device. The content may be provided for presentation in various forms. For example, the content may be provided as a visual and/or audible push notification on a mobile computing device of the user, and may be surfaced independent of the user again submitting the query and/or another query.

Also, for example, the content may be presented as visual and/or audible output of an automated assistant during a dialog session between the user and the automated assistant, where the dialog session is unrelated to the query and/or another query seeking similar information.”

Takeaways

The patent (Autonomously providing search results post-facto, including in assistant context) is in line with Google’s vision of tasked-based agentic search, where AI assistants help users accomplish things. This patent could be applied to an AI agent that is asked for tickets to an event when the tickets aren’t yet available. Or it could be applied to making restaurant reservations when the reservations when the dates open up. Both of those scenarios are related to task-based agentic search (TBAS)

Here are seven takeaways:

  1. The system stores data associated with the user about unresolved queries, allowing it to track unanswered information needs over time rather than treating each search as a one-off event.
  2. It delivers results within future interactions, including unrelated assistant conversations, not just through standalone notifications.
  3. The notifications can happen across an ecosystem of devices.
  4. A lack of results is defined by failing to meet quality criteria, which can be the absence of information, the answer not being available yet, or the answer is not available from authoritative sources.
  5. The system focuses on queries that seek specific answers, rather than general informational searches.
  6. It supports cross-device continuity, enabling a query on one device to be fulfilled later on another.
  7. The design reduces repeated searches by eliminating the need for users to check back, then autonomously circling back when the information is available.

Featured Image by Shutterstock/uyabdami

Google’s SAGE Agentic AI Research: What It Means For SEO via @sejournal, @martinibuster

Google published a research paper about creating a challenging dataset for training AI agents for deep research. The paper offers insights into how agentic AI deep research works, which implies insights for optimizing content.

The acronym SAGE stands for Steerable Agentic Data Generation for Deep Search with Execution Feedback.

Synthetic Question And Answer Pairs

The researchers noted that the previous state of the art AI training datasets (like Musique and HotpotQA) required no more than four reasoning steps in order to answer the questions. On the number of searches needed to answer a question, Musique averages 2.7 searches per question and HotpotQA averaged 2.1 searches. Another commonly used dataset named Natural Questions (NQ) only required an average of 1.3 searches per question.

These datasets that are used to train AI agents created a training gap for deep search tasks that required more reasoning steps and a greater number of searches. How can you train an AI agent for complex real-world deep search tasks if the AI agents haven’t been trained to tackle genuinely difficult questions.

The researchers created a system called SAGE that automatically generates high-quality, complex question-answer pairs for training AI search agents. SAGE is a “dual-agent” system where one AI writes a question and a second “search agent” AI tries to solve it, providing feedback on the complexity of the question.

  • The goal of the first AI is to write a question that’s challenging to answer and requires many reasoning steps and multiple searches to solve.
  • The goal of the second AI is try to measure if the question is answerable and calculate how difficult it is (minimum number of search steps required).

The key to SAGE is that if the second AI solves the question too easily or gets it wrong, the specific steps and documents it found (the execution trace) is fed back to the first AI. This feedback enables the first AI to identify one of four shortcuts that enable the second AI to solve the question in fewer steps.

It’s these shortcuts that provide insights into how to rank better for deep research tasks.

Four Ways That Deep Research Was Avoided

The goal of the paper was to create a set of question and answer pairs that were so difficult that it took the AI agent multiple steps to solve. The feedback showed four ways that made it less necessary for the AI agent to do additional searches to find an answer.

Four Reasons Deep Research Was Unnecessary

  1. Information Co-Location
    This is the most common shortcut, accounting for 35% of the times when deep research was not necessary. This happens when two or more pieces of information needed to answer a question are located in the same document. Instead of searching twice, the AI finds both answers in one “hop”.
  2. Multi-query Collapse
    This happened in 21% of cases. The cause is when a single, clever search query retrieves enough information from different documents to solve multiple parts of the problem at once. This “collapses” what should have been a multi-step process into a single step.
  3. Superficial Complexity
    This accounts for 13% of times when deep research was not necessary. The question looks long and complicated to a human, but a search engine (that an AI agent is using) can jump straight to the answer without needing to reason through the intermediate steps.
  4. Overly Specific Questions
    31% of the failures are questions that contain so much detail that the answer becomes obvious in the very first search, removing the need for any “deep” investigation.

The researchers found that some questions look hard but are actually relatively easy because the information is “co-located” in one document. If an agent can answer a 4-hop question in 1 hop because one website was comprehensive enough to have all the answers, that data point is considered a failure for training the agent for reasoning but it’s still something that can happen in real-life and the agent will take advantage of finding all the information on one page.

SEO Takeaways

It’s possible to gain some insights into what kinds of content satisfies the deep research. While these aren’t necessarily tactics for ranking better in agentic AI deep search, these insights do show what kinds of scenarios caused the AI agents to find all or most of the answers in one web page.

“Information Co-location” Could Be An SEO Win
The researchers found that when multiple pieces of information required to answer a question occur in the same document, it reduces the number of search steps needed. For a publisher, this means consolidating “scattered” facts into one page prevents an AI agent from having to “hop” to a competitor’s site to find the rest of the answer.

Triggering “Multi-query Collapse”
The authors identified a phenomenon where information from different documents can be retrieved using a single query. By structuring content to answer several sub-questions at once, you enable the agent to find the full solution on your page faster, effectively “short-circuiting” the long reasoning chain the agent was prepared to undertake.

Eliminating “Shortcuts” (The Reasoning Gap)
The research paper notes that the data generator fails when it accidentally creates a “shortcut” to the answer. As an SEO, your goal is to be that shortcut—providing the specific data points like calculations, dates, or names that allow the agent to reach the final answer without further exploration.

The Goal Is Still To Rank In Classic Search

For an SEO and a publisher, these shortcuts underline the value of creating a comprehensive document because it will remove the need for an AI agent from getting triggered to hop somewhere else. This doesn’t mean it will be helpful to add all the information in one page. If it makes sense for a user it may be useful to link out from one page to another page for related information.

The reason I say that is because the AI agent is conducting classic search looking for answers, so the goal remains to optimize a web page for classic search. Furthermore, in this research, the AI agent is pulling from the top three ranked web pages for each query that it’s executing. I don’t know if this is how agentic AI search works in a live environment, but this is something to consider.

In fact, one of the tests that the researchers did was conducted using the Serper API to extract search results from Google.

So when it comes to ranking in agentic AI search, consider these takeaways:

  • It may be useful to consider the importance of ranking in the top three.
  • Do optimize web pages for classic search.
  • Do not optimize web pages for AI search
  • If it’s possible to be comprehensive, remain on-topic, and rank in the top three, then do that.
  • Interlink to relevant pages to help those rank in classic search, preferably in the top three (to be safe).

It could be that agentic AI search will consider pulling from more than the top three in classic search. But it may be helpful to set the goal of ranking for the top 3 in classic search and to focus on ranking other pages that may be a part of the multi-hop deep research.

The research paper was published by Google on January 26, 2026. It’s available in PDF form:  SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback.

Featured Image by Shutterstock/Shutterstock AI Generator

Google’s New User Intent Extraction Method via @sejournal, @martinibuster

Google published a research paper on how to extract user intent from user interactions that can then be used for autonomous agents. The method they discovered uses on-device small models that do not need to send data back to Google, which means that a user’s privacy is protected.

The researchers discovered they were able to solve the problem by splitting it into two tasks. Their solution worked so well it was able to beat the base performance of multi-modal large language models (MLLMs) in massive data centers.

Smaller Models On Browsers And Devices

The focus of the research is on identifying the user intent through the series of actions that a user takes on their mobile device or browser while also keeping that information on the device so that no information is sent back to Google. That means the processing must happen on the device.

They accomplished this in two stages.

  1. The first stage the model on the device summarizes what the user was doing.
  2. The sequence of summaries are then sent to a second model that identifies the user intent.

The researchers explained:

“…our two-stage approach demonstrates superior performance compared to both smaller models and a state-of-the-art large MLLM, independent of dataset and model type.
Our approach also naturally handles scenarios with noisy data that traditional supervised fine-tuning methods struggle with.”

Intent Extraction From UI Interactions

Intent extraction from screenshots and text descriptions of user interactions was a technique that was proposed in 2025 using Multimodal Large Language Models (MLLMs). The researchers say they followed this approach to their problem but using an improved prompt.

The researchers explained that extracting intent is not a trivial problem to solve and that there are multiple errors that can happen along the steps. The researchers use the word trajectory to describe a user journey within a mobile or web application, represented as a sequence of interactions.

The user journey (trajectory) is turned into a formula where each interaction step consists of two parts:

  1. An Observation
    This is the visual state of the screen (screenshot) of where the user is at that step.
  2. An Action
    The specific action that the user performed on that screen (like clicking a button, typing text, or clicking a link).

They described three qualities of a good extracted intent:

  • “faithful: only describes things that actually occur in the trajectory;
  • comprehensive: provides all of the information about the user intent required to re-enact the trajectory;
  • and relevant: does not contain extraneous information beyond what is needed for comprehensiveness.”

Challenging To Evaluate Extracted Intents

The researchers explain that grading extracted intent is difficult because user intents contain complex details (like dates or transaction data) and the user intents are inherently subjective, containing ambiguities, which is a hard problem to solve. The reason trajectories are subjective is because the underlying motivations are ambiguous.

For example, did a user choose a product because of the price or the features? The actions are visible but the motivations are not. Previous research shows that intents between humans matched 80% on web trajectories and 76% on mobile trajectories, so it’s not like a given trajectory can always indicate a specific intent.

Two-Stage Approach

After ruling out other methods like Chain of Thought (CoT) reasoning (because small language models struggled with the reasoning), they chose a two-stage approach that emulated Chain of Thought reasoning.

The researchers explained their two-stage approach:

“First, we use prompting to generate a summary for each interaction (consisting of a visual screenshot and textual action representation) in a trajectory. This stage is
prompt-based as there is currently no training data available with summary labels for individual interactions.

Second, we feed all of the interaction-level summaries into a second stage model to generate an overall intent description. We apply fine-tuning in the second stage…”

The First Stage: Screenshot Summary

The first summary, for the screenshot of the interaction, they divide the summary into two parts, but there is also a third part.

  1. A description of what’s on the screen.
  2. A description of the user’s action.

The third component (speculative intent) is a way to get rid of speculation about the user’s intent, where the model is basically guessing at what’s going on. This third part is labeled “speculative intent” and they actually just get rid of it. Surprisingly, allowing the model to speculate and then getting rid of that speculation leads to a higher quality result.

The researchers cycled through multiple prompting strategies and this was the one that worked the best.

The Second Stage: Generating Overall Intent Description

For the second stage, the researchers fine tuned a model for generating an overall intent description. They fine tuned the model with training data that is made up of two parts:

  1. Summaries that represent all interactions in the trajectory
  2. The matching ground truth that describes the overall intent for each of the trajectories.

The model initially tended to hallucinate because the first part (input summaries) are potentially incomplete, while the “target intents” are complete. That caused the model to learn to fill in the missing parts in order to make the input summaries match the target intents.

They solved this problem by “refining” the target intents by removing details that aren’t reflected in the input summaries. This trained the model to infer the intents based only on the inputs.

The researchers compared four different approaches and settled on this approach because it performed so well.

Ethical Considerations And Limitations

The research paper ends by summarizing potential ethical issues where an autonomous agent might take actions that are not in the user’s interest and stressed the necessity to build the proper guardrails.

The authors also acknowledged limitations in the research that might limit generalizability of the results. For example, the testing was done only on Android and web environments, which means that the results might not generalize to Apple devices. Another limitation is that the research was limited to users in the United States in the English language.

There is nothing in the research paper or the accompanying blog post that suggests that these processes for extracting user intent are currently in use. The blog post ends by communicating that the described approach is helpful:

“Ultimately, as models improve in performance and mobile devices acquire more processing power, we hope that on-device intent understanding can become a building block for many assistive features on mobile devices going forward.”

Takeaways

Neither the blog post about this research or the research paper itself describe the results of these processes as something that might be used in AI search or classic search. It does mention the context of autonomous agents.

The research paper explicitly mentions the context of an autonomous agent on the device that is observing how the user is interacting with a user interface and then be able to infer what the goal (the intent) of those actions are.

The paper lists two specific applications for this technology:

  1. Proactive Assistance:
    An agent that watches what a user is doing for “enhanced personalization” and “improved work efficiency”.
  2. Personalized Memory
    The process enables a device to “remember” past activities as an intent for later.

Shows The Direction Google Is Heading In

While this might not be used right away, it shows the direction that Google is heading, where small models on a device will be watching user interactions and sometimes stepping in to assist users based on their intent. Intent here is used in the sense of understanding what a user is trying to do.

Read Google’s blog post here:

Small models, big results: Achieving superior intent extraction through decomposition

Read the PDF research paper:

Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition (PDF)

Featured Image by Shutterstock/ViDI Studio

How Recommender Systems Like Google Discover May Work via @sejournal, @martinibuster

Google Discover is largely a mystery to publishers and the search marketing community even though Google has published official guidance about what it is and what they feel publishers should know about it. Nevertheless, it’s so mysterious that it’s generally not even considered as a recommender system, yet that is what it is. This is a review of a classic research paper that shows how to scale a recommender system. Although it’s for YouTube, it’s not hard to imagine how this kind of system can be adapted to Google Discover.

Recommender Systems

Google Discover belongs to the class of systems known as a recommender systems. A classic recommender system I remember is the MovieLens system from way back in 1997. It is a university science department project that allowed users to rate movies and it would use those ratings to recommend movies to watch. The way it worked is like, people who tend to like these kinds of movies tend to also like these other kinds of movies. But these kinds of algorithms have limitations that make them fall short for the scale necessary to personalize recommendations for YouTube or Google Discover.

Two-Tower Recommender System Model

The modern style of recommender systems are sometimes referred to as the Two-Tower architecture or the Two-Tower model. The Two-Tower model came about as a solution for YouTube, even though the original research paper (Deep Neural Networks for YouTube Recommendations) does not use this term.

It may seem counterintuitive to look to YouTube to understand how the Google Discover algorithm works, but the fact is that the system Google developed for YouTube became the foundation for how to scale a recommender system for an environment where massive amounts of content are generated every hour of the day, 24 hours a day.

It’s called the Two-Tower architecture because there are two representations that are matched against each other, like two towers.

In this model, which handles the initial “retrieval” of content from the database, a neural network processes user information to produce a user embedding, while content items are represented by their own embeddings. These two representations are matched using similarity scoring rather than being combined inside a single network.

I’m going to repeat that the research paper does not refer to the architecture as a Two-Tower architecture, it’s a description for this kind of approach that was created later. So, while the research paper doesn’t use the word tower, I’m going to continue using it as it makes it easier to visualize what’s going on in this kind of recommender system.

User Tower
The User Tower processes things like a user’s watch history, search tokens, location, and basic demographics. It uses this data to create a vector representation that maps the user’s specific interests in a mathematical space.

Item Tower
The Item Tower represents content using learned embedding vectors. In the original YouTube implementation, these were trained alongside the user model and stored for fast retrieval. This allows the system to compare a user’s “coordinates” against millions of video “coordinates” instantly, without having to run a complex analysis on every single video each time you refresh your feed.

The Fresh Content Problem

Google’s research paper offers an interesting take on freshness. The problem of freshness is described as a tradeoff between exploitation and exploration. The YouTube recommendation system has to balance between showing users content that is already known to be popular (exploitation) versus exposing them to new and unproven content (exploration). What motivates Google to show new but unproven content, at least for the context of YouTube, is that users show a strong preference for new and fresh content.

The research paper explains why fresh content is important:

“Many hours worth of videos are uploaded each second to YouTube. Recommending this recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance.”

This tendency to show fresh content seems to hold true for Google Discover, where Google tends to show fresh content on topics that users are personally trending with. Have you ever noticed how Google Discover tends to favor fresh content? The insights that the researchers had about user preferences probably carry over to the Google Discover recommendation system. The takeaway here is that producing content on a regular basis could be helpful for getting web pages surfaced in Google Discover.

An interesting insight in this research paper, and I don’t know if it’s still true but it’s still interesting, is that the researchers state that machine learning algorithms show an implicit biased toward older existing content because they are trained on historical data.

They explain:

“Machine learning systems often exhibit an implicit bias towards the past because they are trained to predict future behavior from historical examples.”

The neural network is trained on past videos and they learn that things from one or two days ago were popular. But this creates a bias for things that happened in the past. The way they solved the freshness issue is when the system is recommending videos to a user (serving), this time-based feature is set to zero days ago (or slightly negative). This signals to the model that it is making a prediction at the very end of the training window, essentially forcing it to predict what is popular right now rather than what was popular on average in the past.

Accuracy Of Click Data

Google’s foundational research paper also provides insights about implicit user feedback signals, which is a reference to click data. The researchers say that this kind of data rarely provides accurate user satisfaction information.

The researchers write:

“Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. We rarely obtain the ground truth of user satisfaction and instead model noisy implicit feedback signals. Furthermore, metadata associated with content is poorly structured without a well defined ontology. Our algorithms need
to be robust to these particular characteristics of our training data.”

The researchers conclude the paper by stating that this approach to recommender systems helped increase user watch time and proved to be more effective than other systems.

They write:

“We have described our deep neural network architecture for recommending YouTube videos, split into two distinct problems: candidate generation and ranking.
Our deep collaborative filtering model is able to effectively assimilate many signals and model their interaction with layers of depth, outperforming previous matrix factorization approaches used at YouTube.

We demonstrated that using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos. This improved offline holdout precision results and increased the watch time dramatically on recently uploaded videos in A/B testing.

Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction. Recommendation systems in particular benefit from specialized features describing past user behavior with items. Deep neural networks require special representations of categorical and continuous features which we transform with embeddings and quantile normalization, respectively.”

Although this research paper is ten years old, it still offers insights into how recommender systems work and takes a little of the mystery out of recommender systems like Google Discover. Read the original research paper: Deep Neural Networks for YouTube Recommendations

Featured Image by Shutterstock/Andrii Iemelianenko

Google Ads Using New AI Model To Catch Fraudulent Advertisers via @sejournal, @martinibuster

Google published a research paper about a new AI model for detecting fraud in the Google Ads system that’s a strong improvement over what they were previously using. What’s interesting is that the research paper, dated December 31, 2025,  says that the new AI is deployed, resulting in an improvement in the detection rate of over 40 percentage points and achieving 99.8% precision on specific policies.

ALF: Advertiser Large Foundation Model

The new AI is called ALF (Advertiser Large Foundation Model), the details of which were published on December 31, 2025. ALF is a multimodal large foundation model that analyzes text, images, and video, together with factors like account age, billing details, and historical performance metrics.

The researchers explain that many of these factors in isolation won’t flag an account as potentially problematic, but that comparing all of these factors together provides a better understanding of advertiser behavior and intent.

They write:

“A core challenge in this ecosystem is to accurately and efficiently understand advertiser intent and behavior. This understanding is critical for several key applications, including matching users with ads and identifying fraud and policy violations.

Addressing this challenge requires a holistic approach, processing diverse data types including structured account information (e.g., account age, billing details), multi-modal ad creative assets (text, images, videos), and landing page content.

For example, an advertiser might have a recently created account, have text and image ads for a well known large brand, and have had a credit card payment declined once. Although each element could exist innocently in isolation, the combination strongly suggests a fraudulent operation.”

The researchers address three challenges that previous systems were unable to overcome:

1. Heterogeneous and High-Dimensional Data
Heterogeneous data refers to the fact that advertiser data comes in multiple formats, not just one type. This includes structured data like account age and billing type and unstructured data like creative assets such as images, text, and video. High-dimensional data refers to the hundreds or thousands of data points associated with each advertiser, causing the mathematical representation of each one to become high-dimensional, which presents challenges for conventional models.

2. Unbounded Sets of Creative Assets
Advertisers could have thousands of creative assets, such as images, and hide one or two malicious ones among thousands of innocent assets. This scenario overwhelmed the previous system.

3. Real-World Reliability and Trustworthiness
The system needs to be able to generate trustworthy confidence scores that a business has malicious intent because a false positive would otherwise affect an innocent advertiser. The system must be expected to work without having to constantly retune it to catch mistakes.

Privacy and Safety

Although ALF analyzes sensitive signals like billing history and account details, the researchers emphasize that the system is designed with strict privacy safeguards. Before the AI processes any data, all personally identifiable information (PII) is stripped away. This ensures that the model identifies risk based on behavioral patterns rather than sensitive personal data.

The Secret Sauce: How It Spots Outliers

The model also uses a technique called “Inter-Sample Attention” to improve its detection skills. Instead of analyzing a single advertiser in a vacuum, ALF looks at “large advertiser batches” to compare their interactions against one another. This allows the AI to learn what normal activity looks like across the entire ecosystem and make it more accurate in spotting suspicious outliers that don’t fit into normal behavior.

Alf Outperforms Production Benchmarks

The researchers explain that their tests show that ALF outperforms a heavily tuned production baseline:

“Our experiments show ALF significantly outperforms a heavily tuned production baseline while also performing strongly on public benchmarks. In production, ALF delivers substantial and simultaneous gains in precision and recall, boosting recall by over 40 percentage points on one critical policy while increasing precision to 99.8% on another.”

This result demonstrates that ALF can deliver measurable gains across multiple evaluation criteria under actual real-world production conditions, rather than just in offline or benchmarked environments.

Elsewhere they mention tradeoffs in speed:

“The effectiveness of this approach was validated against an exceptionally strong production baseline, itself the result of an extensive search across various architectures and hyperparameters, including DNNs, ensembles, GBDTs, and logistic regression with feature cross exploration.

While ALF’s latency is higher due to its larger model size, it remains well within the acceptable range for our production environment and can be further optimized using hardware accelerators. Experiments show ALF significantly outperforms the baseline on key risk detection tasks, a performance lift driven by its unique ability to holistically model content embeddings, which simpler architectures struggled to leverage. This trade-off is justified by its successful deployment, where ALF serves millions of requests daily.”

Latency refers to the amount of time the system takes to produce a response after receiving a request, and the researcher data shows that although ALF increases this response time relative to the baseline, the latency remains acceptable for production use and is already operating at scale while delivering substantially better fraud detection performance.

Improved Fraud Detection

The researchers say that ALF is now deployed to the Google Ads Safety system for identifying advertisers that are violating Google Ads policies. There is no indication that the system is being used elsewhere such as in Search or Google Business Profiles. But they did say that future work could focus on time-based factors (“temporal dynamics”) for catching evolving patterns. They also indicated that it could be useful for audience modeling and creative optimization.

Read the original PDF version of the research paper:

ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

Featured Image by Shutterstock/Login

Google’s Recommender System Breakthrough Detects Semantic Intent via @sejournal, @martinibuster

Google published a research paper about helping recommender systems understand what users mean when they interact with them. Their goal with this new approach is to overcome the limitations inherent in the current state-of-the-art recommender systems in order to get a finer, detailed understanding of what users want to read, listen to, or watch at the level of the individual.

Personalized Semantics

Recommender systems predict what a user would like to read or watch next. YouTube, Google Discover, and Google News are examples of recommender systems for recommending content to users. Other kinds of recommender systems are shopping recommendations.

Recommender systems generally work by collecting data about the kinds of things a user clicks on, rates, buys, and watches and then using that data to suggest more content that aligns with a user’s preferences.

The researchers referred to those kinds of signals as primitive user feedback because they’re not so good at recommendations based on an individual’s subjective judgment about what’s funny, cute, or boring.

The intuition behind the research is that the rise of LLMs presents an opportunity to leverage natural language interactions to better understand what a user wants through identifying semantic intent.

The researchers explain:

“Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue).

Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user’s semantic intent from the open-ended terms or attributes often used to describe a desired item. This is critical for recommender systems that wish to support users in their everyday, intuitive use of natural language to refine recommendation results.”

The Soft Attributes Challenge

The researchers explained that hard attributes are something that recommender systems can understand because they are objective ground truths like “genre, artist, director.” What they had problems with were other kinds of attributes called “soft attributes” that are subjective and for which they couldn’t be matched with movies, content, or product items.

The research paper states the following characteristics of soft attributes:

  • “There is no definitive “ground truth” source associating such soft attributes with items
  • The attributes themselves may have imprecise interpretations
  • And they may be subjective in nature (i.e., different users may interpret them differently)”

The problem of soft attributes is the problem that the researchers set out to solve and why the research paper is called Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors.

Novel Use Of Concept Activation Vectors (CAVs)

Concept Activation Vectors (CAVs) are a way to probe AI models to understand the mathematical representations (vectors) the models use internally. They provide a way for humans to connect those internal vectors to concepts.

So the standard direction of the CAV is interpreting the model. What the researchers did was to change that direction so that the goal is now to interpret the users, translating subjective soft attributes into mathematical representations for recommender systems. The researchers discovered that adapting CAVs to interpret users enabled vector representations that helped AI models detect subtle intent and subjective human judgments that are personalized to an individual.

As they write:

“We demonstrate … that our CAV representation not only accurately interprets users’ subjective semantics, but can also be used to improve recommendations through interactive item critiquing.”

For example, the model can learn that users mean different things by “funny” and be better able to leverage those personalized semantics when making recommendations.

The problem the researchers are solving is figuring out how to bridge the semantic gap between how humans speak and how recommender systems “think.”

Humans think in concepts, using vague or subjective descriptions (called soft attributes).

Recommender systems “think” in math: They operate on vectors (lists of numbers) in a high-dimensional “embedding space”.

The problem then becomes making the subjective human speech less ambiguous but without having to modify or retrain the recommender system with all the nuances. The CAVs do that heavy lifting.

The researchers explain:

“…we infer the semantics of soft attributes using the representation learned by the recommender system model itself.”

They list four advantages of their approach:

“(1) The recommender system’s model capacity is directed to predicting user-item preferences without further trying to predict additional side information (e.g., tags), which often does not improve recommender system performance.

(2) The recommender system model can easily accommodate new attributes without retraining should new sources of tags, keywords or phrases emerge from which to derive new soft attributes.

(3) Our approach offers a means to test whether specific soft attributes are relevant to predicting user preferences. Thus, we are able focus attention on attributes most relevant to capturing a user’s intent (e.g., when explaining recommendations, eliciting preferences, or suggesting critiques).

(4) One can learn soft attribute/tag semantics with relatively small amounts of labelled data, in the spirit of pre-training and few-shot learning.”

They then provide a high-level explanation of how the system works:

“At a high-level, our approach works as follows. we assume we are given:

(i) a collaborative filtering-style model (e.g.,probabilistic matrix factorization or dual encoder) which embeds items and users in a latent space based on user-item ratings; and

(ii) a (small) set of tags (i.e., soft attribute labels) provided by a subset of users for a subset of items.

We develop methods that associate with each item the degree to which it exhibits a soft attribute, thus determining that attribute’s semantics. We do this by applying concept activation vectors (CAVs) —a recent method developed for interpretability of machine-learned models—to the collaborative filtering model to detect whether it learned a representation of the attribute.

The projection of this CAV in embedding space provides a (local) directional semantics for the attribute that can then be applied to items (and users). Moreover, the technique can be used to identify the subjective nature of an attribute, specifically, whether different users have different meanings (or tag senses) in mind when using that tag. Such a personalized semantics for subjective attributes can be vital to the sound interpretation of a user’s true intent when trying to assess her preferences.”

Does This System Work?

One of the interesting findings is that their test of an artificial tag (odd year) showed that the systems accuracy rate was barely above a random selection, which corroborated their hypothesis that “CAVs are useful for identifying preference related attributes/tags.”

They also found that using CAVs in recommender systems were useful for understanding “critiquing-based” user behavior and improved those kinds of recommender systems.

The researchers listed four benefits:

“(i) using a collaborative filtering representation to identify attributes of greatest relevance to the recommendation task;

(ii) distinguishing objective and subjective tag usage;

(iii) identifying personalized, user-specific semantics for subjective attributes; and

(iv) relating attribute semantics to preference representations, thus allowing interactions using soft attributes/tags in example critiquing and other forms of preference elicitation.”

They found that their approach improved recommendations for situations where discovery of soft attributes are important. Using this approach for situations in which hard attributes are more the norm, such as in product shopping, is a future area of study to see if soft attributes would aid in making product recommendations.

Takeaways

The research paper was published in 2024 and I had to dig around to actually find it, which may explain why it generally went unnoticed in the search marketing community.

Google tested some of this approach with an algorithm called WALS (Weighted Alternating Least Squares), actual production code that is a product in Google Cloud for developers.

Two notes in a footnote and in the appendix explain:

“CAVs on MovieLens20M data with linear attributes use embeddings that were learned (via WALS) using internal production code, which is not releasable.”

…The linear embeddings were learned (via WALS, Appendix A.3.1) using internal production code, which is not releasable.”

“Production code” refers to software that is currently running in Google’s user-facing products, in this case Google Cloud. It’s likely not the underlying engine for Google Discover, however it’s important to note because it shows how easily it can be integrated into an existing recommender system.

They tested this system using the MovieLens20M dataset, which is a public dataset of 20 million ratings, with some of the tests done with Google’s proprietary recommendation engine (WALS). This lends credibility to the inference that this code can be used on a live system without having to retrain or modify them.

The takeaway that I see in this research paper is that this makes it possible for recommender systems to leverage semantic data about soft attributes. Google Discover is regarded by Google as a subset of search, and search patterns are some of the data that the system uses to surface content. Google doesn’t say whether they are using this kind of method, but given the positive results, it is possible that this approach could be used in Google’s recommender systems. If that’s the case, then that means Google’s recommendations may be more responsive to users’ subjective semantics.

The research paper credits Google Research (60% of the credits), and also Amazon, Midjourney, and Meta AI.

The PDF is available here:

Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors

Featured Image by Shutterstock/Here

Google’s New Graph Foundation Model Catches Spam Up To 40x Better via @sejournal, @martinibuster

Google published details of a new kind of AI based on graphs called a Graph Foundation Model (GFM) that generalizes to previously unseen graphs and delivers a three to forty times boost in precision over previous methods, with successful testing in scaled applications such as spam detection in ads.

The announcement of this new technology is referred to as expanding the boundaries of what has been possible up to today:

“Today, we explore the possibility of designing a single model that can excel on interconnected relational tables and at the same time generalize to any arbitrary set of tables, features, and tasks without additional training. We are excited to share our recent progress on developing such graph foundation models (GFM) that push the frontiers of graph learning and tabular ML well beyond standard baselines.”

Google's Graph Foundation Model shows 3-40 times performance improvement in precision

Graph Neural Networks Vs. Graph Foundation Models

Graphs are representations of data that are related to each other. The connections between the objects are called edges and the objects themselves are called nodes. In SEO, the most familiar type of graph could be said to be the Link Graph, which is a map of the entire web by the links that connect one web page to another.

Current technology uses Graph Neural Networks (GNNs) to represent data like web page content and can be used to identify the topic of a web page.

A Google Research blog post about GNNs explains their importance:

“Graph neural networks, or GNNs for short, have emerged as a powerful technique to leverage both the graph’s connectivity (as in the older algorithms DeepWalk and Node2Vec) and the input features on the various nodes and edges. GNNs can make predictions for graphs as a whole (Does this molecule react in a certain way?), for individual nodes (What’s the topic of this document, given its citations?)…

Apart from making predictions about graphs, GNNs are a powerful tool used to bridge the chasm to more typical neural network use cases. They encode a graph’s discrete, relational information in a continuous way so that it can be included naturally in another deep learning system.”

The downside to GNNs is that they are tethered to the graph on which they were trained and can’t be used on a different kind of graph. To use it on a different graph, Google has to train another model specifically for that other graph.

To make an analogy, it’s like having to train a new generative AI model on French language documents just to get it to work in another language, but that’s not the case because LLMs can generalize to other languages, which is not the case for models that work with graphs. This is the problem that the invention solves, to create a model that generalizes to other graphs without having to be trained on them first.

The breakthrough that Google announced is that with the new Graph Foundation Models, Google can now train a model that can generalize across new graphs that it hasn’t been trained on and understand patterns and connections within those graphs. And it can do it three to forty times more precisely.

Announcement But No Research Paper

Google’s announcement does not link to a research paper. It’s been variously reported that Google has decided to publish less research papers and this is a big example of that policy change. Is it because this innovation is so big they want to keep this as a competitive advantage?

How Graph Foundation Models Work

In a conventional graph, let’s say a graph of the Internet, web pages are the nodes. The links between the nodes (web pages) are called the edges. In that kind of graph, you can see similarities between pages because the pages about a specific topic tend to link to other pages about the same specific topic.

In very simple terms, a Graph Foundation Model turns every row in every table into a node and connects related nodes based on the relationships in the tables. The result is a single large graph that the model uses to learn from existing data and make predictions (like identifying spam) on new data.

Screenshot Of Five Tables

Image by Google

Transforming Tables Into A Single Graph

The research paper says this about the following images which illustrate the process:

“Data preparation consists of transforming tables into a single graph, where each row of a table becomes a node of the respective node type, and foreign key columns become edges between the nodes. Connections between five tables shown become edges in the resulting graph.”

Screenshot Of Tables Converted To Edges

Image by Google

What makes this new model exceptional is that the process of creating it is “straightforward” and it scales. The part about scaling is important because it means that the invention is able to work across Google’s massive infrastructure.

“We argue that leveraging the connectivity structure between tables is key for effective ML algorithms and better downstream performance, even when tabular feature data (e.g., price, size, category) is sparse or noisy. To this end, the only data preparation step consists of transforming a collection of tables into a single heterogeneous graph.

The process is rather straightforward and can be executed at scale: each table becomes a unique node type and each row in a table becomes a node. For each row in a table, its foreign key relations become typed edges to respective nodes from other tables while the rest of the columns are treated as node features (typically, with numerical or categorical values). Optionally, we can also keep temporal information as node or edge features.”

Tests Are Successful

Google’s announcement says that they tested it in identifying spam in Google Ads, which was difficult because it’s a system that uses dozens of large graphs. Current systems are unable to make connections between unrelated graphs and miss important context.

Google’s new Graph Foundation Model was able to make the connections between all the graphs and improved performance.

The announcement described the achievement:

“We observe a significant performance boost compared to the best tuned single-table baselines. Depending on the downstream task, GFM brings 3x – 40x gains in average precision, which indicates that the graph structure in relational tables provides a crucial signal to be leveraged by ML models.”

Is Google Using This System?

It’s notable that Google successfully tested the system with Google Ads for spam detection and reported upsides and no downsides. This means that it can be used in a live environment for a variety of real-world tasks. They used it for Google Ads spam detection and because it’s a flexible model that means it can be used for other tasks for which multiple graphs are used, from identifying content topics to identifying link spam.

Normally, when something falls short the research papers and announcement say that it points the way for future but that’s not how this new invention is presented. It’s presented as a success and it ends with a statement saying that these results can be further improved, meaning it can get even better than these already spectacular results.

“These results can be further improved by additional scaling and diverse training data collection together with a deeper theoretical understanding of generalization.”

Read Google’s announcement:

Graph foundation models for relational data

Featured Image by Shutterstock/SidorArt