Google’s SAGE Agentic AI Research: What It Means For SEO via @sejournal, @martinibuster

Google published a research paper about creating a challenging dataset for training AI agents for deep research. The paper offers insights into how agentic AI deep research works, which implies insights for optimizing content.

The acronym SAGE stands for Steerable Agentic Data Generation for Deep Search with Execution Feedback.

Synthetic Question And Answer Pairs

The researchers noted that the previous state of the art AI training datasets (like Musique and HotpotQA) required no more than four reasoning steps in order to answer the questions. On the number of searches needed to answer a question, Musique averages 2.7 searches per question and HotpotQA averaged 2.1 searches. Another commonly used dataset named Natural Questions (NQ) only required an average of 1.3 searches per question.

These datasets that are used to train AI agents created a training gap for deep search tasks that required more reasoning steps and a greater number of searches. How can you train an AI agent for complex real-world deep search tasks if the AI agents haven’t been trained to tackle genuinely difficult questions.

The researchers created a system called SAGE that automatically generates high-quality, complex question-answer pairs for training AI search agents. SAGE is a “dual-agent” system where one AI writes a question and a second “search agent” AI tries to solve it, providing feedback on the complexity of the question.

  • The goal of the first AI is to write a question that’s challenging to answer and requires many reasoning steps and multiple searches to solve.
  • The goal of the second AI is try to measure if the question is answerable and calculate how difficult it is (minimum number of search steps required).

The key to SAGE is that if the second AI solves the question too easily or gets it wrong, the specific steps and documents it found (the execution trace) is fed back to the first AI. This feedback enables the first AI to identify one of four shortcuts that enable the second AI to solve the question in fewer steps.

It’s these shortcuts that provide insights into how to rank better for deep research tasks.

Four Ways That Deep Research Was Avoided

The goal of the paper was to create a set of question and answer pairs that were so difficult that it took the AI agent multiple steps to solve. The feedback showed four ways that made it less necessary for the AI agent to do additional searches to find an answer.

Four Reasons Deep Research Was Unnecessary

  1. Information Co-Location
    This is the most common shortcut, accounting for 35% of the times when deep research was not necessary. This happens when two or more pieces of information needed to answer a question are located in the same document. Instead of searching twice, the AI finds both answers in one “hop”.
  2. Multi-query Collapse
    This happened in 21% of cases. The cause is when a single, clever search query retrieves enough information from different documents to solve multiple parts of the problem at once. This “collapses” what should have been a multi-step process into a single step.
  3. Superficial Complexity
    This accounts for 13% of times when deep research was not necessary. The question looks long and complicated to a human, but a search engine (that an AI agent is using) can jump straight to the answer without needing to reason through the intermediate steps.
  4. Overly Specific Questions
    31% of the failures are questions that contain so much detail that the answer becomes obvious in the very first search, removing the need for any “deep” investigation.

The researchers found that some questions look hard but are actually relatively easy because the information is “co-located” in one document. If an agent can answer a 4-hop question in 1 hop because one website was comprehensive enough to have all the answers, that data point is considered a failure for training the agent for reasoning but it’s still something that can happen in real-life and the agent will take advantage of finding all the information on one page.

SEO Takeaways

It’s possible to gain some insights into what kinds of content satisfies the deep research. While these aren’t necessarily tactics for ranking better in agentic AI deep search, these insights do show what kinds of scenarios caused the AI agents to find all or most of the answers in one web page.

“Information Co-location” Could Be An SEO Win
The researchers found that when multiple pieces of information required to answer a question occur in the same document, it reduces the number of search steps needed. For a publisher, this means consolidating “scattered” facts into one page prevents an AI agent from having to “hop” to a competitor’s site to find the rest of the answer.

Triggering “Multi-query Collapse”
The authors identified a phenomenon where information from different documents can be retrieved using a single query. By structuring content to answer several sub-questions at once, you enable the agent to find the full solution on your page faster, effectively “short-circuiting” the long reasoning chain the agent was prepared to undertake.

Eliminating “Shortcuts” (The Reasoning Gap)
The research paper notes that the data generator fails when it accidentally creates a “shortcut” to the answer. As an SEO, your goal is to be that shortcut—providing the specific data points like calculations, dates, or names that allow the agent to reach the final answer without further exploration.

The Goal Is Still To Rank In Classic Search

For an SEO and a publisher, these shortcuts underline the value of creating a comprehensive document because it will remove the need for an AI agent from getting triggered to hop somewhere else. This doesn’t mean it will be helpful to add all the information in one page. If it makes sense for a user it may be useful to link out from one page to another page for related information.

The reason I say that is because the AI agent is conducting classic search looking for answers, so the goal remains to optimize a web page for classic search. Furthermore, in this research, the AI agent is pulling from the top three ranked web pages for each query that it’s executing. I don’t know if this is how agentic AI search works in a live environment, but this is something to consider.

In fact, one of the tests that the researchers did was conducted using the Serper API to extract search results from Google.

So when it comes to ranking in agentic AI search, consider these takeaways:

  • It may be useful to consider the importance of ranking in the top three.
  • Do optimize web pages for classic search.
  • Do not optimize web pages for AI search
  • If it’s possible to be comprehensive, remain on-topic, and rank in the top three, then do that.
  • Interlink to relevant pages to help those rank in classic search, preferably in the top three (to be safe).

It could be that agentic AI search will consider pulling from more than the top three in classic search. But it may be helpful to set the goal of ranking for the top 3 in classic search and to focus on ranking other pages that may be a part of the multi-hop deep research.

The research paper was published by Google on January 26, 2026. It’s available in PDF form:  SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback.

Featured Image by Shutterstock/Shutterstock AI Generator

Google’s New User Intent Extraction Method via @sejournal, @martinibuster

Google published a research paper on how to extract user intent from user interactions that can then be used for autonomous agents. The method they discovered uses on-device small models that do not need to send data back to Google, which means that a user’s privacy is protected.

The researchers discovered they were able to solve the problem by splitting it into two tasks. Their solution worked so well it was able to beat the base performance of multi-modal large language models (MLLMs) in massive data centers.

Smaller Models On Browsers And Devices

The focus of the research is on identifying the user intent through the series of actions that a user takes on their mobile device or browser while also keeping that information on the device so that no information is sent back to Google. That means the processing must happen on the device.

They accomplished this in two stages.

  1. The first stage the model on the device summarizes what the user was doing.
  2. The sequence of summaries are then sent to a second model that identifies the user intent.

The researchers explained:

“…our two-stage approach demonstrates superior performance compared to both smaller models and a state-of-the-art large MLLM, independent of dataset and model type.
Our approach also naturally handles scenarios with noisy data that traditional supervised fine-tuning methods struggle with.”

Intent Extraction From UI Interactions

Intent extraction from screenshots and text descriptions of user interactions was a technique that was proposed in 2025 using Multimodal Large Language Models (MLLMs). The researchers say they followed this approach to their problem but using an improved prompt.

The researchers explained that extracting intent is not a trivial problem to solve and that there are multiple errors that can happen along the steps. The researchers use the word trajectory to describe a user journey within a mobile or web application, represented as a sequence of interactions.

The user journey (trajectory) is turned into a formula where each interaction step consists of two parts:

  1. An Observation
    This is the visual state of the screen (screenshot) of where the user is at that step.
  2. An Action
    The specific action that the user performed on that screen (like clicking a button, typing text, or clicking a link).

They described three qualities of a good extracted intent:

  • “faithful: only describes things that actually occur in the trajectory;
  • comprehensive: provides all of the information about the user intent required to re-enact the trajectory;
  • and relevant: does not contain extraneous information beyond what is needed for comprehensiveness.”

Challenging To Evaluate Extracted Intents

The researchers explain that grading extracted intent is difficult because user intents contain complex details (like dates or transaction data) and the user intents are inherently subjective, containing ambiguities, which is a hard problem to solve. The reason trajectories are subjective is because the underlying motivations are ambiguous.

For example, did a user choose a product because of the price or the features? The actions are visible but the motivations are not. Previous research shows that intents between humans matched 80% on web trajectories and 76% on mobile trajectories, so it’s not like a given trajectory can always indicate a specific intent.

Two-Stage Approach

After ruling out other methods like Chain of Thought (CoT) reasoning (because small language models struggled with the reasoning), they chose a two-stage approach that emulated Chain of Thought reasoning.

The researchers explained their two-stage approach:

“First, we use prompting to generate a summary for each interaction (consisting of a visual screenshot and textual action representation) in a trajectory. This stage is
prompt-based as there is currently no training data available with summary labels for individual interactions.

Second, we feed all of the interaction-level summaries into a second stage model to generate an overall intent description. We apply fine-tuning in the second stage…”

The First Stage: Screenshot Summary

The first summary, for the screenshot of the interaction, they divide the summary into two parts, but there is also a third part.

  1. A description of what’s on the screen.
  2. A description of the user’s action.

The third component (speculative intent) is a way to get rid of speculation about the user’s intent, where the model is basically guessing at what’s going on. This third part is labeled “speculative intent” and they actually just get rid of it. Surprisingly, allowing the model to speculate and then getting rid of that speculation leads to a higher quality result.

The researchers cycled through multiple prompting strategies and this was the one that worked the best.

The Second Stage: Generating Overall Intent Description

For the second stage, the researchers fine tuned a model for generating an overall intent description. They fine tuned the model with training data that is made up of two parts:

  1. Summaries that represent all interactions in the trajectory
  2. The matching ground truth that describes the overall intent for each of the trajectories.

The model initially tended to hallucinate because the first part (input summaries) are potentially incomplete, while the “target intents” are complete. That caused the model to learn to fill in the missing parts in order to make the input summaries match the target intents.

They solved this problem by “refining” the target intents by removing details that aren’t reflected in the input summaries. This trained the model to infer the intents based only on the inputs.

The researchers compared four different approaches and settled on this approach because it performed so well.

Ethical Considerations And Limitations

The research paper ends by summarizing potential ethical issues where an autonomous agent might take actions that are not in the user’s interest and stressed the necessity to build the proper guardrails.

The authors also acknowledged limitations in the research that might limit generalizability of the results. For example, the testing was done only on Android and web environments, which means that the results might not generalize to Apple devices. Another limitation is that the research was limited to users in the United States in the English language.

There is nothing in the research paper or the accompanying blog post that suggests that these processes for extracting user intent are currently in use. The blog post ends by communicating that the described approach is helpful:

“Ultimately, as models improve in performance and mobile devices acquire more processing power, we hope that on-device intent understanding can become a building block for many assistive features on mobile devices going forward.”

Takeaways

Neither the blog post about this research or the research paper itself describe the results of these processes as something that might be used in AI search or classic search. It does mention the context of autonomous agents.

The research paper explicitly mentions the context of an autonomous agent on the device that is observing how the user is interacting with a user interface and then be able to infer what the goal (the intent) of those actions are.

The paper lists two specific applications for this technology:

  1. Proactive Assistance:
    An agent that watches what a user is doing for “enhanced personalization” and “improved work efficiency”.
  2. Personalized Memory
    The process enables a device to “remember” past activities as an intent for later.

Shows The Direction Google Is Heading In

While this might not be used right away, it shows the direction that Google is heading, where small models on a device will be watching user interactions and sometimes stepping in to assist users based on their intent. Intent here is used in the sense of understanding what a user is trying to do.

Read Google’s blog post here:

Small models, big results: Achieving superior intent extraction through decomposition

Read the PDF research paper:

Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition (PDF)

Featured Image by Shutterstock/ViDI Studio

How Recommender Systems Like Google Discover May Work via @sejournal, @martinibuster

Google Discover is largely a mystery to publishers and the search marketing community even though Google has published official guidance about what it is and what they feel publishers should know about it. Nevertheless, it’s so mysterious that it’s generally not even considered as a recommender system, yet that is what it is. This is a review of a classic research paper that shows how to scale a recommender system. Although it’s for YouTube, it’s not hard to imagine how this kind of system can be adapted to Google Discover.

Recommender Systems

Google Discover belongs to the class of systems known as a recommender systems. A classic recommender system I remember is the MovieLens system from way back in 1997. It is a university science department project that allowed users to rate movies and it would use those ratings to recommend movies to watch. The way it worked is like, people who tend to like these kinds of movies tend to also like these other kinds of movies. But these kinds of algorithms have limitations that make them fall short for the scale necessary to personalize recommendations for YouTube or Google Discover.

Two-Tower Recommender System Model

The modern style of recommender systems are sometimes referred to as the Two-Tower architecture or the Two-Tower model. The Two-Tower model came about as a solution for YouTube, even though the original research paper (Deep Neural Networks for YouTube Recommendations) does not use this term.

It may seem counterintuitive to look to YouTube to understand how the Google Discover algorithm works, but the fact is that the system Google developed for YouTube became the foundation for how to scale a recommender system for an environment where massive amounts of content are generated every hour of the day, 24 hours a day.

It’s called the Two-Tower architecture because there are two representations that are matched against each other, like two towers.

In this model, which handles the initial “retrieval” of content from the database, a neural network processes user information to produce a user embedding, while content items are represented by their own embeddings. These two representations are matched using similarity scoring rather than being combined inside a single network.

I’m going to repeat that the research paper does not refer to the architecture as a Two-Tower architecture, it’s a description for this kind of approach that was created later. So, while the research paper doesn’t use the word tower, I’m going to continue using it as it makes it easier to visualize what’s going on in this kind of recommender system.

User Tower
The User Tower processes things like a user’s watch history, search tokens, location, and basic demographics. It uses this data to create a vector representation that maps the user’s specific interests in a mathematical space.

Item Tower
The Item Tower represents content using learned embedding vectors. In the original YouTube implementation, these were trained alongside the user model and stored for fast retrieval. This allows the system to compare a user’s “coordinates” against millions of video “coordinates” instantly, without having to run a complex analysis on every single video each time you refresh your feed.

The Fresh Content Problem

Google’s research paper offers an interesting take on freshness. The problem of freshness is described as a tradeoff between exploitation and exploration. The YouTube recommendation system has to balance between showing users content that is already known to be popular (exploitation) versus exposing them to new and unproven content (exploration). What motivates Google to show new but unproven content, at least for the context of YouTube, is that users show a strong preference for new and fresh content.

The research paper explains why fresh content is important:

“Many hours worth of videos are uploaded each second to YouTube. Recommending this recently uploaded (“fresh”) content is extremely important for YouTube as a product. We consistently observe that users prefer fresh content, though not at the expense of relevance.”

This tendency to show fresh content seems to hold true for Google Discover, where Google tends to show fresh content on topics that users are personally trending with. Have you ever noticed how Google Discover tends to favor fresh content? The insights that the researchers had about user preferences probably carry over to the Google Discover recommendation system. The takeaway here is that producing content on a regular basis could be helpful for getting web pages surfaced in Google Discover.

An interesting insight in this research paper, and I don’t know if it’s still true but it’s still interesting, is that the researchers state that machine learning algorithms show an implicit biased toward older existing content because they are trained on historical data.

They explain:

“Machine learning systems often exhibit an implicit bias towards the past because they are trained to predict future behavior from historical examples.”

The neural network is trained on past videos and they learn that things from one or two days ago were popular. But this creates a bias for things that happened in the past. The way they solved the freshness issue is when the system is recommending videos to a user (serving), this time-based feature is set to zero days ago (or slightly negative). This signals to the model that it is making a prediction at the very end of the training window, essentially forcing it to predict what is popular right now rather than what was popular on average in the past.

Accuracy Of Click Data

Google’s foundational research paper also provides insights about implicit user feedback signals, which is a reference to click data. The researchers say that this kind of data rarely provides accurate user satisfaction information.

The researchers write:

“Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. We rarely obtain the ground truth of user satisfaction and instead model noisy implicit feedback signals. Furthermore, metadata associated with content is poorly structured without a well defined ontology. Our algorithms need
to be robust to these particular characteristics of our training data.”

The researchers conclude the paper by stating that this approach to recommender systems helped increase user watch time and proved to be more effective than other systems.

They write:

“We have described our deep neural network architecture for recommending YouTube videos, split into two distinct problems: candidate generation and ranking.
Our deep collaborative filtering model is able to effectively assimilate many signals and model their interaction with layers of depth, outperforming previous matrix factorization approaches used at YouTube.

We demonstrated that using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos. This improved offline holdout precision results and increased the watch time dramatically on recently uploaded videos in A/B testing.

Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction. Recommendation systems in particular benefit from specialized features describing past user behavior with items. Deep neural networks require special representations of categorical and continuous features which we transform with embeddings and quantile normalization, respectively.”

Although this research paper is ten years old, it still offers insights into how recommender systems work and takes a little of the mystery out of recommender systems like Google Discover. Read the original research paper: Deep Neural Networks for YouTube Recommendations

Featured Image by Shutterstock/Andrii Iemelianenko

Google Ads Using New AI Model To Catch Fraudulent Advertisers via @sejournal, @martinibuster

Google published a research paper about a new AI model for detecting fraud in the Google Ads system that’s a strong improvement over what they were previously using. What’s interesting is that the research paper, dated December 31, 2025,  says that the new AI is deployed, resulting in an improvement in the detection rate of over 40 percentage points and achieving 99.8% precision on specific policies.

ALF: Advertiser Large Foundation Model

The new AI is called ALF (Advertiser Large Foundation Model), the details of which were published on December 31, 2025. ALF is a multimodal large foundation model that analyzes text, images, and video, together with factors like account age, billing details, and historical performance metrics.

The researchers explain that many of these factors in isolation won’t flag an account as potentially problematic, but that comparing all of these factors together provides a better understanding of advertiser behavior and intent.

They write:

“A core challenge in this ecosystem is to accurately and efficiently understand advertiser intent and behavior. This understanding is critical for several key applications, including matching users with ads and identifying fraud and policy violations.

Addressing this challenge requires a holistic approach, processing diverse data types including structured account information (e.g., account age, billing details), multi-modal ad creative assets (text, images, videos), and landing page content.

For example, an advertiser might have a recently created account, have text and image ads for a well known large brand, and have had a credit card payment declined once. Although each element could exist innocently in isolation, the combination strongly suggests a fraudulent operation.”

The researchers address three challenges that previous systems were unable to overcome:

1. Heterogeneous and High-Dimensional Data
Heterogeneous data refers to the fact that advertiser data comes in multiple formats, not just one type. This includes structured data like account age and billing type and unstructured data like creative assets such as images, text, and video. High-dimensional data refers to the hundreds or thousands of data points associated with each advertiser, causing the mathematical representation of each one to become high-dimensional, which presents challenges for conventional models.

2. Unbounded Sets of Creative Assets
Advertisers could have thousands of creative assets, such as images, and hide one or two malicious ones among thousands of innocent assets. This scenario overwhelmed the previous system.

3. Real-World Reliability and Trustworthiness
The system needs to be able to generate trustworthy confidence scores that a business has malicious intent because a false positive would otherwise affect an innocent advertiser. The system must be expected to work without having to constantly retune it to catch mistakes.

Privacy and Safety

Although ALF analyzes sensitive signals like billing history and account details, the researchers emphasize that the system is designed with strict privacy safeguards. Before the AI processes any data, all personally identifiable information (PII) is stripped away. This ensures that the model identifies risk based on behavioral patterns rather than sensitive personal data.

The Secret Sauce: How It Spots Outliers

The model also uses a technique called “Inter-Sample Attention” to improve its detection skills. Instead of analyzing a single advertiser in a vacuum, ALF looks at “large advertiser batches” to compare their interactions against one another. This allows the AI to learn what normal activity looks like across the entire ecosystem and make it more accurate in spotting suspicious outliers that don’t fit into normal behavior.

Alf Outperforms Production Benchmarks

The researchers explain that their tests show that ALF outperforms a heavily tuned production baseline:

“Our experiments show ALF significantly outperforms a heavily tuned production baseline while also performing strongly on public benchmarks. In production, ALF delivers substantial and simultaneous gains in precision and recall, boosting recall by over 40 percentage points on one critical policy while increasing precision to 99.8% on another.”

This result demonstrates that ALF can deliver measurable gains across multiple evaluation criteria under actual real-world production conditions, rather than just in offline or benchmarked environments.

Elsewhere they mention tradeoffs in speed:

“The effectiveness of this approach was validated against an exceptionally strong production baseline, itself the result of an extensive search across various architectures and hyperparameters, including DNNs, ensembles, GBDTs, and logistic regression with feature cross exploration.

While ALF’s latency is higher due to its larger model size, it remains well within the acceptable range for our production environment and can be further optimized using hardware accelerators. Experiments show ALF significantly outperforms the baseline on key risk detection tasks, a performance lift driven by its unique ability to holistically model content embeddings, which simpler architectures struggled to leverage. This trade-off is justified by its successful deployment, where ALF serves millions of requests daily.”

Latency refers to the amount of time the system takes to produce a response after receiving a request, and the researcher data shows that although ALF increases this response time relative to the baseline, the latency remains acceptable for production use and is already operating at scale while delivering substantially better fraud detection performance.

Improved Fraud Detection

The researchers say that ALF is now deployed to the Google Ads Safety system for identifying advertisers that are violating Google Ads policies. There is no indication that the system is being used elsewhere such as in Search or Google Business Profiles. But they did say that future work could focus on time-based factors (“temporal dynamics”) for catching evolving patterns. They also indicated that it could be useful for audience modeling and creative optimization.

Read the original PDF version of the research paper:

ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

Featured Image by Shutterstock/Login

Google’s Recommender System Breakthrough Detects Semantic Intent via @sejournal, @martinibuster

Google published a research paper about helping recommender systems understand what users mean when they interact with them. Their goal with this new approach is to overcome the limitations inherent in the current state-of-the-art recommender systems in order to get a finer, detailed understanding of what users want to read, listen to, or watch at the level of the individual.

Personalized Semantics

Recommender systems predict what a user would like to read or watch next. YouTube, Google Discover, and Google News are examples of recommender systems for recommending content to users. Other kinds of recommender systems are shopping recommendations.

Recommender systems generally work by collecting data about the kinds of things a user clicks on, rates, buys, and watches and then using that data to suggest more content that aligns with a user’s preferences.

The researchers referred to those kinds of signals as primitive user feedback because they’re not so good at recommendations based on an individual’s subjective judgment about what’s funny, cute, or boring.

The intuition behind the research is that the rise of LLMs presents an opportunity to leverage natural language interactions to better understand what a user wants through identifying semantic intent.

The researchers explain:

“Interactive recommender systems have emerged as a promising paradigm to overcome the limitations of the primitive user feedback used by traditional recommender systems (e.g., clicks, item consumption, ratings). They allow users to express intent, preferences, constraints, and contexts in a richer fashion, often using natural language (including faceted search and dialogue).

Yet more research is needed to find the most effective ways to use this feedback. One challenge is inferring a user’s semantic intent from the open-ended terms or attributes often used to describe a desired item. This is critical for recommender systems that wish to support users in their everyday, intuitive use of natural language to refine recommendation results.”

The Soft Attributes Challenge

The researchers explained that hard attributes are something that recommender systems can understand because they are objective ground truths like “genre, artist, director.” What they had problems with were other kinds of attributes called “soft attributes” that are subjective and for which they couldn’t be matched with movies, content, or product items.

The research paper states the following characteristics of soft attributes:

  • “There is no definitive “ground truth” source associating such soft attributes with items
  • The attributes themselves may have imprecise interpretations
  • And they may be subjective in nature (i.e., different users may interpret them differently)”

The problem of soft attributes is the problem that the researchers set out to solve and why the research paper is called Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors.

Novel Use Of Concept Activation Vectors (CAVs)

Concept Activation Vectors (CAVs) are a way to probe AI models to understand the mathematical representations (vectors) the models use internally. They provide a way for humans to connect those internal vectors to concepts.

So the standard direction of the CAV is interpreting the model. What the researchers did was to change that direction so that the goal is now to interpret the users, translating subjective soft attributes into mathematical representations for recommender systems. The researchers discovered that adapting CAVs to interpret users enabled vector representations that helped AI models detect subtle intent and subjective human judgments that are personalized to an individual.

As they write:

“We demonstrate … that our CAV representation not only accurately interprets users’ subjective semantics, but can also be used to improve recommendations through interactive item critiquing.”

For example, the model can learn that users mean different things by “funny” and be better able to leverage those personalized semantics when making recommendations.

The problem the researchers are solving is figuring out how to bridge the semantic gap between how humans speak and how recommender systems “think.”

Humans think in concepts, using vague or subjective descriptions (called soft attributes).

Recommender systems “think” in math: They operate on vectors (lists of numbers) in a high-dimensional “embedding space”.

The problem then becomes making the subjective human speech less ambiguous but without having to modify or retrain the recommender system with all the nuances. The CAVs do that heavy lifting.

The researchers explain:

“…we infer the semantics of soft attributes using the representation learned by the recommender system model itself.”

They list four advantages of their approach:

“(1) The recommender system’s model capacity is directed to predicting user-item preferences without further trying to predict additional side information (e.g., tags), which often does not improve recommender system performance.

(2) The recommender system model can easily accommodate new attributes without retraining should new sources of tags, keywords or phrases emerge from which to derive new soft attributes.

(3) Our approach offers a means to test whether specific soft attributes are relevant to predicting user preferences. Thus, we are able focus attention on attributes most relevant to capturing a user’s intent (e.g., when explaining recommendations, eliciting preferences, or suggesting critiques).

(4) One can learn soft attribute/tag semantics with relatively small amounts of labelled data, in the spirit of pre-training and few-shot learning.”

They then provide a high-level explanation of how the system works:

“At a high-level, our approach works as follows. we assume we are given:

(i) a collaborative filtering-style model (e.g.,probabilistic matrix factorization or dual encoder) which embeds items and users in a latent space based on user-item ratings; and

(ii) a (small) set of tags (i.e., soft attribute labels) provided by a subset of users for a subset of items.

We develop methods that associate with each item the degree to which it exhibits a soft attribute, thus determining that attribute’s semantics. We do this by applying concept activation vectors (CAVs) —a recent method developed for interpretability of machine-learned models—to the collaborative filtering model to detect whether it learned a representation of the attribute.

The projection of this CAV in embedding space provides a (local) directional semantics for the attribute that can then be applied to items (and users). Moreover, the technique can be used to identify the subjective nature of an attribute, specifically, whether different users have different meanings (or tag senses) in mind when using that tag. Such a personalized semantics for subjective attributes can be vital to the sound interpretation of a user’s true intent when trying to assess her preferences.”

Does This System Work?

One of the interesting findings is that their test of an artificial tag (odd year) showed that the systems accuracy rate was barely above a random selection, which corroborated their hypothesis that “CAVs are useful for identifying preference related attributes/tags.”

They also found that using CAVs in recommender systems were useful for understanding “critiquing-based” user behavior and improved those kinds of recommender systems.

The researchers listed four benefits:

“(i) using a collaborative filtering representation to identify attributes of greatest relevance to the recommendation task;

(ii) distinguishing objective and subjective tag usage;

(iii) identifying personalized, user-specific semantics for subjective attributes; and

(iv) relating attribute semantics to preference representations, thus allowing interactions using soft attributes/tags in example critiquing and other forms of preference elicitation.”

They found that their approach improved recommendations for situations where discovery of soft attributes are important. Using this approach for situations in which hard attributes are more the norm, such as in product shopping, is a future area of study to see if soft attributes would aid in making product recommendations.

Takeaways

The research paper was published in 2024 and I had to dig around to actually find it, which may explain why it generally went unnoticed in the search marketing community.

Google tested some of this approach with an algorithm called WALS (Weighted Alternating Least Squares), actual production code that is a product in Google Cloud for developers.

Two notes in a footnote and in the appendix explain:

“CAVs on MovieLens20M data with linear attributes use embeddings that were learned (via WALS) using internal production code, which is not releasable.”

…The linear embeddings were learned (via WALS, Appendix A.3.1) using internal production code, which is not releasable.”

“Production code” refers to software that is currently running in Google’s user-facing products, in this case Google Cloud. It’s likely not the underlying engine for Google Discover, however it’s important to note because it shows how easily it can be integrated into an existing recommender system.

They tested this system using the MovieLens20M dataset, which is a public dataset of 20 million ratings, with some of the tests done with Google’s proprietary recommendation engine (WALS). This lends credibility to the inference that this code can be used on a live system without having to retrain or modify them.

The takeaway that I see in this research paper is that this makes it possible for recommender systems to leverage semantic data about soft attributes. Google Discover is regarded by Google as a subset of search, and search patterns are some of the data that the system uses to surface content. Google doesn’t say whether they are using this kind of method, but given the positive results, it is possible that this approach could be used in Google’s recommender systems. If that’s the case, then that means Google’s recommendations may be more responsive to users’ subjective semantics.

The research paper credits Google Research (60% of the credits), and also Amazon, Midjourney, and Meta AI.

The PDF is available here:

Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors

Featured Image by Shutterstock/Here

Google’s New Graph Foundation Model Catches Spam Up To 40x Better via @sejournal, @martinibuster

Google published details of a new kind of AI based on graphs called a Graph Foundation Model (GFM) that generalizes to previously unseen graphs and delivers a three to forty times boost in precision over previous methods, with successful testing in scaled applications such as spam detection in ads.

The announcement of this new technology is referred to as expanding the boundaries of what has been possible up to today:

“Today, we explore the possibility of designing a single model that can excel on interconnected relational tables and at the same time generalize to any arbitrary set of tables, features, and tasks without additional training. We are excited to share our recent progress on developing such graph foundation models (GFM) that push the frontiers of graph learning and tabular ML well beyond standard baselines.”

Google's Graph Foundation Model shows 3-40 times performance improvement in precision

Graph Neural Networks Vs. Graph Foundation Models

Graphs are representations of data that are related to each other. The connections between the objects are called edges and the objects themselves are called nodes. In SEO, the most familiar type of graph could be said to be the Link Graph, which is a map of the entire web by the links that connect one web page to another.

Current technology uses Graph Neural Networks (GNNs) to represent data like web page content and can be used to identify the topic of a web page.

A Google Research blog post about GNNs explains their importance:

“Graph neural networks, or GNNs for short, have emerged as a powerful technique to leverage both the graph’s connectivity (as in the older algorithms DeepWalk and Node2Vec) and the input features on the various nodes and edges. GNNs can make predictions for graphs as a whole (Does this molecule react in a certain way?), for individual nodes (What’s the topic of this document, given its citations?)…

Apart from making predictions about graphs, GNNs are a powerful tool used to bridge the chasm to more typical neural network use cases. They encode a graph’s discrete, relational information in a continuous way so that it can be included naturally in another deep learning system.”

The downside to GNNs is that they are tethered to the graph on which they were trained and can’t be used on a different kind of graph. To use it on a different graph, Google has to train another model specifically for that other graph.

To make an analogy, it’s like having to train a new generative AI model on French language documents just to get it to work in another language, but that’s not the case because LLMs can generalize to other languages, which is not the case for models that work with graphs. This is the problem that the invention solves, to create a model that generalizes to other graphs without having to be trained on them first.

The breakthrough that Google announced is that with the new Graph Foundation Models, Google can now train a model that can generalize across new graphs that it hasn’t been trained on and understand patterns and connections within those graphs. And it can do it three to forty times more precisely.

Announcement But No Research Paper

Google’s announcement does not link to a research paper. It’s been variously reported that Google has decided to publish less research papers and this is a big example of that policy change. Is it because this innovation is so big they want to keep this as a competitive advantage?

How Graph Foundation Models Work

In a conventional graph, let’s say a graph of the Internet, web pages are the nodes. The links between the nodes (web pages) are called the edges. In that kind of graph, you can see similarities between pages because the pages about a specific topic tend to link to other pages about the same specific topic.

In very simple terms, a Graph Foundation Model turns every row in every table into a node and connects related nodes based on the relationships in the tables. The result is a single large graph that the model uses to learn from existing data and make predictions (like identifying spam) on new data.

Screenshot Of Five Tables

Image by Google

Transforming Tables Into A Single Graph

The research paper says this about the following images which illustrate the process:

“Data preparation consists of transforming tables into a single graph, where each row of a table becomes a node of the respective node type, and foreign key columns become edges between the nodes. Connections between five tables shown become edges in the resulting graph.”

Screenshot Of Tables Converted To Edges

Image by Google

What makes this new model exceptional is that the process of creating it is “straightforward” and it scales. The part about scaling is important because it means that the invention is able to work across Google’s massive infrastructure.

“We argue that leveraging the connectivity structure between tables is key for effective ML algorithms and better downstream performance, even when tabular feature data (e.g., price, size, category) is sparse or noisy. To this end, the only data preparation step consists of transforming a collection of tables into a single heterogeneous graph.

The process is rather straightforward and can be executed at scale: each table becomes a unique node type and each row in a table becomes a node. For each row in a table, its foreign key relations become typed edges to respective nodes from other tables while the rest of the columns are treated as node features (typically, with numerical or categorical values). Optionally, we can also keep temporal information as node or edge features.”

Tests Are Successful

Google’s announcement says that they tested it in identifying spam in Google Ads, which was difficult because it’s a system that uses dozens of large graphs. Current systems are unable to make connections between unrelated graphs and miss important context.

Google’s new Graph Foundation Model was able to make the connections between all the graphs and improved performance.

The announcement described the achievement:

“We observe a significant performance boost compared to the best tuned single-table baselines. Depending on the downstream task, GFM brings 3x – 40x gains in average precision, which indicates that the graph structure in relational tables provides a crucial signal to be leveraged by ML models.”

Is Google Using This System?

It’s notable that Google successfully tested the system with Google Ads for spam detection and reported upsides and no downsides. This means that it can be used in a live environment for a variety of real-world tasks. They used it for Google Ads spam detection and because it’s a flexible model that means it can be used for other tasks for which multiple graphs are used, from identifying content topics to identifying link spam.

Normally, when something falls short the research papers and announcement say that it points the way for future but that’s not how this new invention is presented. It’s presented as a success and it ends with a statement saying that these results can be further improved, meaning it can get even better than these already spectacular results.

“These results can be further improved by additional scaling and diverse training data collection together with a deeper theoretical understanding of generalization.”

Read Google’s announcement:

Graph foundation models for relational data

Featured Image by Shutterstock/SidorArt

Google’s Trust Ranking Patent Shows How User Behavior Is A Signal via @sejournal, @martinibuster

Google long ago filed a patent for ranking search results by trust. The groundbreaking idea behind the patent is that user behavior can be used as a starting point for developing a ranking signal.

The big idea behind the patent is that the Internet is full of websites all linking to and commenting about each other. But which sites are trustworthy? Google’s solution is to utilize user behavior to indicate which sites are trusted and then use the linking and content on those sites to reveal more sites that are trustworthy for any given topic.

PageRank is basically the same thing only it begins and ends with one website linking to another website. The innovation of Google’s trust ranking patent is to put the user at the start of that trust chain like this:

User trusts X Websites > X Websites trust Other Sites > This feeds into Google as a ranking signal

The trust originates from the user and flows to trust sites that themselves provide anchor text, lists of other sites and commentary about other sites.

That, in a nutshell, is what Google’s trust-based ranking algorithm is about.

The deeper insight is that it reveals Google’s groundbreaking approach to letting users be a signal of what’s trustworthy. You know how Google keeps saying to create websites for users? This is what the trust patent is all about, putting the user in the front seat of the ranking algorithm.

Google’s Trust And Ranking Patent

The patent was coincidentally filed around the same period that Yahoo and Stanford University published a Trust Rank research paper which is focused on identifying spam pages.

Google’s patent is not about finding spam. It’s focused on doing the opposite, identifying trustworthy web pages that satisfy the user’s intent for a search query.

How Trust Factors Are Used

The first part of any patent consists of an Abstract section that offers a very general description of the invention that that’s what this patent does as well.

The patent abstract asserts:

  • That trust factors are used to rank web pages.
  • The trust factors are generated from “entities” (which are later described to be the users themselves, experts, expert web pages, and forum members) that link to or comment about other web pages).
  • Those trust factors are then used to re-rank web pages.
  • Re-ranking web pages kicks in after the normal ranking algorithm has done its thing with links, etc.

Here’s what the Abstract says:

“A search engine system provides search results that are ranked according to a measure of the trust associated with entities that have provided labels for the documents in the search results.

A search engine receives a query and selects documents relevant to the query.

The search engine also determines labels associated with selected documents, and the trust ranks of the entities that provided the labels.

The trust ranks are used to determine trust factors for the respective documents. The trust factors are used to adjust information retrieval scores of the documents. The search results are then ranked based on the adjusted information retrieval scores.”

As you can see, the Abstract does not say who the “entities” are nor does it say what the labels are yet, but it will.

Field Of The Invention

The next part is called the Field Of The Invention. The purpose is to describe the technical domain of the invention (which is information retrieval) and the focus (trust relationships between users) for the purpose of ranking web pages.

Here’s what it says:

“The present invention relates to search engines, and more specifically to search engines that use information indicative of trust relationship between users to rank search results.”

Now we move on to the next section, the Background, which describes the problem this invention solves.

Background Of The Invention

This section describes why search engines fall short of answering user queries (the problem) and why the invention solves the problem.

The main problems described are:

  • Search engines are essentially guessing (inference) what the user’s intent is when they only use the search query.
  • Users rely on expert-labeled content from trusted sites (called vertical knowledge sites) to tell them which web pages are trustworthy
  • Explains why the content labeled as relevant or trustworthy is important but ignored by search engines.
  • It’s important to remember that this patent came out before the BERT algorithm and other natural language approaches that are now used to better understand search queries.

This is how the patent explains it:

“An inherent problem in the design of search engines is that the relevance of search results to a particular user depends on factors that are highly dependent on the user’s intent in conducting the search—that is why they are conducting the search—as well as the user’s circumstances, the facts pertaining to the user’s information need.

Thus, given the same query by two different users, a given set of search results can be relevant to one user and irrelevant to another, entirely because of the different intent and information needs.”

Next it goes on to explain that users trust certain websites that provide information about certain topics:

“…In part because of the inability of contemporary search engines to consistently find information that satisfies the user’s information need, and not merely the user’s query terms, users frequently turn to websites that offer additional analysis or understanding of content available on the Internet.”

Websites Are The Entities

The rest of the Background section names forums, review sites, blogs, and news websites as places that users turn to for their information needs, calling them vertical knowledge sites. Vertical Knowledge sites, it’s explained later, can be any kind of website.

The patent explains that trust is why users turn to those sites:

“This degree of trust is valuable to users as a way of evaluating the often bewildering array of information that is available on the Internet.”

To recap, the “Background” section explains that the trust relationships between users and entities like forums, review sites, and blogs can be used to influence the ranking of search results. As we go deeper into the patent we’ll see that the entities are not limited to the above kinds of sites, they can be any kind of site.

Patent Summary Section

This part of the patent is interesting because it brings together all of the concepts into one place, but in a general high-level manner, and throws in some legal paragraphs that explain that the patent can apply to a wider scope than is set out in the patent.

The Summary section appears to have four sections:

  • The first section explains that a search engine ranks web pages that are trusted by entities (like forums, news sites, blogs, etc.) and that the system maintains information about these labels about trusted web pages.
  • The second section offers a general description of the work of the entities (like forums, news sites, blogs, etc.).
  • The third offers a general description of how the system works, beginning with the query, the assorted hand waving that goes on at the search engine with regard to the entity labels, and then the search results.
  • The fourth part is a legal explanation that the patent is not limited to the descriptions and that the invention applies to a wider scope. This is important. It enables Google to use a non-existent thing, even something as nutty as a “trust button” that a user selects to identify a site as being trustworthy as an example. This enables an example like a non-existent “trust button” to be a stand-in for something else, like navigational queries or Navboost or anything else that is a signal that a user trusts a website.

Here’s a nutshell explanation of how the system works:

  • The user visits sites that they trust and click a “trust button” that tells the search engine that this is a trusted site.
  • The trusted site “labels” other sites as trusted for certain topics (the label could be a topic like “symptoms”).
  • A user asks a question at a search engine (a query) and uses a label (like “symptoms”).
  • The search engine ranks websites according to the usual manner then it looks for sites that users trust and sees if any of those sites have used labels about other sites.
  • Google ranks those other sites that have had labels assigned to them by the trusted sites.

Here’s an abbreviated version of the third part of the Summary that gives an idea of the inner workings of the invention:

“A user provides a query to the system…The system retrieves a set of search results… The system determines which query labels are applicable to which of the search result documents. … determines for each document an overall trust factor to apply… adjusts the …retrieval score… and reranks the results.”

Here’s that same section in its entirety:

  • “A user provides a query to the system; the query contains at least one query term and optionally includes one or more labels of interest to the user.
  • The system retrieves a set of search results comprising documents that are relevant to the query term(s).
  • The system determines which query labels are applicable to which of the search result documents.
  • The system determines for each document an overall trust factor to apply to the document based on the trust ranks of those entities that provided the labels that match the query labels.
  • Applying the trust factor to the document adjusts the document’s information retrieval score, to provide a trust adjusted information retrieval score.
  • The system reranks the search result documents based at on the trust adjusted information retrieval scores.”

The above is a general description of the invention.

The next section, called Detailed Description, deep dives into the details. At this point it’s becoming increasingly evident that the patent is highly nuanced and can not be reduced to simple advice similar to: “optimize your site like this to earn trust.”

A large part of the patent hinges on a trust button and an advanced search query:  label:

Neither the trust button or the label advanced search query have ever existed. As you’ll see, they are quite probably stand-ins for techniques that Google doesn’t want to explicitly reveal.

Detailed Description In Four Parts

The details of this patent are located in four sections within the Detailed Description section of the patent. This patent is not as simple as 99% of SEOs say it is.

These are the four sections:

  1. System Overview
  2. Obtaining and Storing Trust Information
  3. Obtaining and Storing Label Information
  4. Generated Trust Ranked Search Results

The System Overview is where the patent deep dives into the specifics. The following is an overview to make it easy to understand.

System Overview

1. Explains how the invention (a search engine system) ranks search results based on trust relationships between users and the user-trusted entities who label web content.

2. The patent describes a “trust button” that a user can click that tells Google that a user trusts a website or trusts the website for a specific topic or topics.

3. The patent says a trust related score is assigned to a website when a user clicks a trust button on a website.

4. The trust button information is stored in a trust database that’s referred to as #190.

Here’s what it says about assigning a trust rank score based on the trust button:

“The trust information provided by the users with respect to others is used to determine a trust rank for each user, which is measure of the overall degree of trust that users have in the particular entity.”

Trust Rank Button

The patent refers to the “trust rank” of the user-trusted websites. That trust rank is based on a trust button that a user clicks to indicate that they trust a given website, assigning a trust rank score.

The patent says:

“…the user can click on a “trust button” on a web page belonging to the entity, which causes a corresponding record for a trust relationship to be recorded in the trust database 190.

In general any type of input from the user indicating that such as trust relationship exists can be used.”

The trust button has never existed and the patent quietly acknowledges this by stating that any type of input can be used to indicate the trust relationship.

So what is it? I believe that the “trust button” is a stand-in for user behavior metrics in general, and site visitor data in particular. The patent Claims section does not mention trust buttons at all but does mention user visitor data as an indicator of trust.

Here are several passages that mention site visits as a way to understand if a user trusts a website:

“The system can also examine web visitation patterns of the user and can infer from the web visitation patterns which entities the user trusts. For example, the system can infer that a particular user trust a particular entity when the user visits the entity’s web page with a certain frequency.”

The same thing is stated in the Claims section of the patent, it’s the very first claim they make for the invention:

“A method performed by data processing apparatus, the method comprising:
determining, based on web visitation patterns of a user, one or more trust relationships indicating that the user trusts one or more entities;”

It may very well be that site visitation patterns and other user behaviors are what is meant by the “trust button” references.

Labels Generated By Trusted Sites

The patent defines trusted entities as news sites, blogs, forums, and review sites, but not limited to those kinds of sites, it could be any other kind of website.

Trusted websites create references to other sites and in that reference they label those other sites as being relevant to a particular topic. That label could be an anchor text. But it could be something else.

The patent explicitly mentions anchor text only once:

“In some cases, an entity may simply create a link from its site to a particular item of web content (e.g., a document) and provide a label 107 as the anchor text of the link.”

Although it only explicitly mentions anchor text once, there are other passages where it anchor text is strongly implied, for example, the patent offers a general description of labels as describing or categorizing the content found on another site:

“…labels are words, phrases, markers or other indicia that have been associated with certain web content (pages, sites, documents, media, etc.) by others as descriptive or categorical identifiers.”

Labels And Annotations

Trusted sites link out to web pages with labels and links. The combination of a label and a link is called an annotation.

This is how it’s described:

“An annotation 106 includes a label 107 and a URL pattern associated with the label; the URL pattern can be specific to an individual web page or to any portion of a web site or pages therein.”

Labels Used In Search Queries

Users can also search with “labels” in their queries by using a non-existent “label:” advanced search query. Those kinds of queries are then used to match the labels that a website page is associated with.

This is how it’s explained:

“For example, a query “cancer label:symptoms” includes the query term “cancel” and a query label “symptoms”, and thus is a request for documents relevant to cancer, and that have been labeled as relating to “symptoms.”

Labels such as these can be associated with documents from any entity, whether the entity created the document, or is a third party. The entity that has labeled a document has some degree of trust, as further described below.”

What is that label in the search query? It could simply be certain descriptive keywords, but there aren’t any clues to speculate further than that.

The patent puts it all together like this:

“Using the annotation information and trust information from the trust database 190, the search engine 180 determines a trust factor for each document.”

Takeaway:

A user’s trust is in a website. That user-trusted website is not necessarily the one that’s ranked, it’s the website that’s linking/trusting another relevant web page. The web page that is ranked can be the one that the trusted site has labeled as relevant for a specific topic and it could be a web page in the trusted site itself. The purpose of the user signals is to provide a starting point, so to speak, from which to identify trustworthy sites.

Experts Are Trusted

Vertical Knowledge Sites, sites that users trust, can host the commentary of experts. The expert could be the publisher of the trusted site as well. Experts are important because links from expert sites are used as part of the ranking process.

Experts are defined as publishing a deep level of content on the topic:

“These and other vertical knowledge sites may also host the analysis and comments of experts or others with knowledge, expertise, or a point of view in particular fields, who again can comment on content found on the Internet.

For example, a website operated by a digital camera expert and devoted to digital cameras typically includes product reviews, guidance on how to purchase a digital camera, as well as links to camera manufacturer’s sites, new products announcements, technical articles, additional reviews, or other sources of content.

To assist the user, the expert may include comments on the linked content, such as labeling a particular technical article as “expert level,” or a particular review as “negative professional review,” or a new product announcement as ;new 10MP digital SLR’.”

Links From Expert Sites

Links and annotations from user-trusted expert sites are described as sources of trust information:

“For example, Expert may create an annotation 106 including the label 107 “Professional review” for a review 114 of Canon digital SLR camera on a web site “www.digitalcameraworld.com”, a label 107 of “Jazz music” for a CD 115 on the site “www.jazzworld.com”, a label 107 of “Classic Drama” for the movie 116 “North by Northwest” listed on website “www.movierental.com”, and a label 107 of “Symptoms” for a group of pages describing the symptoms of colon cancer on a website 117 “www.yourhealth.com”.

Note that labels 107 can also include numerical values (not shown), indicating a rating or degree of significance that the entity attaches to the labeled document.

Expert’s web site 105 can also include trust information. More specifically, Expert’s web site 105 can include a trust list 109 of entities whom Expert trusts. This list may be in the form of a list of entity names, the URLs of such entities’ web pages, or by other identifying information. Expert’s web site 105 may also include a vanity list 111 listing entities who trust Expert; again this may be in the form of a list of entity names, URLs, or other identifying information.”

Inferred Trust

The patent describes additional signals that can be used to signal (infer) trust. These are more traditional type signals like links, a list of trusted web pages (maybe a resources page?) and a list of sites that trust the website.

These are the inferred trust signals:

“(1) links from the user’s web page to web pages belonging to trusted entities;
(2) a trust list that identifies entities that the user trusts; or
(3) a vanity list which identifies users who trust the owner of the vanity page.”

Another kind of trust signal that can be inferred is from identifying sites that a user tends to visit.

The patent explains:

“The system can also examine web visitation patterns of the user and can infer from the web visitation patterns which entities the user trusts. For example, the system can infer that a particular user trusts a particular entity when the user visits the entity’s web page with a certain frequency.”

Takeaway:

That’s a pretty big signal and I believe that it suggests that promotional activities that encourage potential site visitors to discover a site and then become loyal site visitors can be helpful. For example, that kind of signal can be tracked with branded search queries. It could be that Google is only looking at site visit information but I think that branded queries are an equally trustworthy signal, especially when those queries are accompanied by labels… ding, ding, ding!

The patent also lists some kind of out there examples of inferred trust like contact/chat list data. It doesn’t say social media, just contact/chat lists.

Trust Can Decay or Increase

Another interesting feature of trust rank is that it can decay or increase over time.

The patent is straightforward about this part:

“Note that trust relationships can change. For example, the system can increase (or decrease) the strength of a trust relationship for a trusted entity. The search engine system 100 can also cause the strength of a trust relationship to decay over time if the trust relationship is not affirmed by the user, for example by visiting the entity’s web site and activating the trust button 112.”

Trust Relationship Editor User Interface

Directly after the above paragraph is a section about enabling users to edit their trust relationships through a user interface. There has never been such a thing, just like the non-existent trust button.

This is possibly a stand-in for something else. Could this trusted sites dashboard be Chrome browser bookmarks or sites that are followed in Discover? This is a matter for speculation.

Here’s what the patent says:

“The search engine system 100 may also expose a user interface to the trust database 190 by which the user can edit the user trust relationships, including adding or removing trust relationships with selected entities.

The trust information in the trust database 190 is also periodically updated by crawling of web sites, including sites of entities with trust information (e.g., trust lists, vanity lists); trust ranks are recomputed based on the updated trust information.”

What Google’s Trust Patent Is About

Google’s Search Result Ranking Based On Trust patent describes a way of leveraging user-behavior signals to understand which sites are trustworthy. The system then identifies sites that are trusted by the user-trusted sites and uses that information as a ranking signal. There is no actual trust rank metric, but there are ranking signals related to what users trust. Those signals can decay or increase based on factors like whether a user still visits those sites.

The larger takeaway is that this patent is an example of how Google is focused on user signals as a ranking source, so that they can feed that back into ranking sites that meet their needs. This means that instead of doing things because “this is what Google likes,” it’s better to go even deeper and do things because users like it. That will feed back to Google through these kinds of algorithms that measure user behavior patterns, something we all know Google uses.

Featured Image by Shutterstock/samsulalam

Google’s Local Job Type Algorithm Detailed In Research Paper via @sejournal, @martinibuster

Google published a research paper describing how it extracts “services offered” information from local business sites to add it to business profiles in Google Maps and Search. The algorithm describes specific relevance factors and confirms that the system has been successfully in use for a year.

What makes this research paper especially notable is that one of the authors is Marc Najork, a distinguished research scientist at Google who is associated with many milestones in information retrieval, natural language processing, and artificial intelligence.

The purpose of this system is to make it easier for users to find local businesses that provide the services they are looking for. The paper was published in 2024 (according to the Internet Archive) and is dated 2023.

The research paper explains:

“…to reduce user effort, we developed and deployed a pipeline to automatically extract the job types from business websites. For example, if a web page owned by a plumbing business states: “we provide toilet installation and faucet repair service”, our pipeline outputs toilet installation and faucet repair as the job types for this business.”

Developing A Local Search System

The first step for creating a system for crawling and extracting job type information was to create training data from scratch. They selected billions of home pages that are listed in Google business profiles and extracted job type information from tables and formatted lists on home pages or pages that were one click away from the home pages. This job type data became the seed set of job types.

The extracted job type data was used as search queries, augmented with query expansion (synonyms) to expand the list of job types to include all possible variations of job type keyword phrases.

Second Step: Fixing A Relevance Problem

Google’s researchers applied their system on the billions of pages and it didn’t work as intended because many pages had job type phrases that were not describing services offered.

The research paper explains:

“We found that many pages mention job type names for other purposes like giving life tips. For example, a web page that teaches readers to deal with bed bugs might contain a sentence like a solution is to call home cleaning services if you find bed bugs in your home. They usually provide services like bed bug control. Though this page mentions multiple job type names, the page is not provided by a home cleaning business.”

Limiting the crawling and indexing to identifying job type keyword phrases resulted in false positives. The solution was to incorporate sentences that surrounded the keyword phrases so that they could better understand the context of the job type keyword phrases.

The success of using surrounding text is explained:

“As shown in Table 2, JobModelSurround performs significantly better than JobModel, which suggests that the surrounding words could indeed explain the intent of the seed job type mentions. This successfully improves the semantic understanding without processing the entire text of each page, keeping our models efficient.”

SEO Insight
The described local search algorithm is purposely excluding all information on the page and zeroing in on job type keyword phrases and surrounding words and phrases around those keywords. This shows the importance of how the words around important keyword phrases can provide context for the keyword phrases and make it easier for Google’s crawlers to understand what the page is about without having to process the entire web page.

SEO Insight
Another insight is that Google is not indexing the entire web page for the limited purpose of identifying job type keyword phrases. The algorithm is hunting for the keyword phrase and surrounding keyword phrases.

SEO Insight
The concept of analyzing only a part of a page is similar to Google’s Centerpiece Annotation where a section of content is identified as the main topic of the page. I’m not saying these are related. I’m just pointing out one feature out of many where a Google algorithm zeroes in on just a section of a page.

The System Uses BERT

Google used the BERT language model to classify whether phrases extracted from business websites describe actual job types. BERT was fine-tuned on labeled examples and given additional context such as website structure, URL patterns, and business category to improve precision without sacrificing scalability.

The Extraction System Can Be Generalized To Other Contexts

An interesting finding detailed by the research paper is that the system they developed can be used in areas (domains) other than local businesses, such as “expertise finding, legal and medical information extraction.”

They write:

“The lessons we shared in developing the largescale extraction pipeline from scratch can generalize to other information extraction or machine learning tasks. They have direct applications to domain-specific extraction tasks, exemplified by expertise finding, legal and medical information extraction.

Three most important lessons are:

(1) utilizing the data properties such as structured content could alleviate the cold start problem of data annotation;

(2) formulating the task as a retrieval problem could help researchers and practitioners deal with a large dataset;

(3) the context information could improve the model quality without sacrificing its scalability.”

Job Type Extract Is A Success

The research paper says that their system is a success, it has a high level of precision (accuracy) and that it is scalable. The research paper says that it has already been in use for a year. The research is dated 2023 but according to the Internet Archive (Wayback Machine), it was published sometime in July 2024.

The researchers write:

“Our pipeline is executed periodically to keep the extracted content up-to-date. It is currently deployed in production, and the output job types are surfaced to millions of Google Search and Maps users.”

Takeaways

  • Google’s Algorithm That Extracts Job Types from Webpages
    Google developed an algorithm that extracts “job types” (i.e., services offered) from business websites to display in Google Maps and Search.
  • Pipeline Extracts From Unstructured Content
    Instead of relying on structured HTML elements, the algorithm reads free-text content, making it effective even when services are buried in paragraphs.
  • Contextual Relevance Is Important
    The system evaluates surrounding words to confirm that service-related terms are actually relevant to the business, improving accuracy.
  • Model Generalization Potential
    The approach can be applied to other fields like legal or medical information extraction, showing how it can be applied to other kinds of knowledge.
  • High Accuracy and Scalability
    The system has been deployed for over a year and delivers scalable, high-precision results across billions of webpages.

Google published a research paper about an algorithm that automatically extracts service descriptions from local business websites by analyzing keyword phrases and their surrounding context, enabling more accurate and up-to-date listings in Google Maps and Search. This technique avoids dependence on HTML structure and can be adapted for use in other industries where extracting information from unstructured text is needed.

Read the research paper abstract and download the PDF version here:

Job Type Extraction for Service Businesses

Featured Image by Shutterstock/ViDI Studio

Google Patent On Using Contextual Signals Beyond Query Semantics via @sejournal, @martinibuster

A patent recently filed by Google outlines how an AI assistant may use at least five real-world contextual signals, including identifying related intents, to influence answers and generate natural dialog. It’s an example of how AI-assisted search modifies responses to engage users with contextually relevant questions and dialog, expanding beyond keyword-based systems.

The patent describes a system that generates relevant dialog and answers using signals such as environmental context, dialog intent, user data, and conversation history. These factors go beyond using the semantic data in the user’s query and show how AI-assisted search is moving toward more natural, human-like interactions.

In general, the purpose of filing a patent is to obtain legal protection and exclusivity for an invention and the act of filing doesn’t indicate that Google is actually using it.

The patent uses examples of spoken dialog but it also states the invention is not limited to audio input:

“Notably, during a given dialog session, a user can interact with the automated assistant using various input modalities, including, but not limited to, spoken input, typed input, and/or touch input.”

The name of the patent is, Using Large Language Model(s) In Generating Automated Assistant response(s). The patent applies to a wide range of AI assistants that receive inputs via the context of typed, touch, and speech.

There are five factors that influence the LLM modified responses:

  1. Time, Location, And Environmental Context
  2. User-Specific Context
  3. Dialog Intent & Prior Interactions
  4.  Inputs (text, touch, and speech)
  5. System & Device Context

The first four factors influence the answers that the automated assistant provides and the fifth one determines whether to turn off the LLM-assisted part and revert to standard AI answers.

Time, Location, And Environmental

There are three contextual factors: time, location and environmental that provide contexts that are not existent in keywords and influence how the AI assistant responds. While these contextual factors, as described in the patent, aren’t strictly related to AI Overviews or AI Mode, they do show how AI-assisted interactions with data can change.

The patent uses the example of a person who tells their assistant they’re going surfing. A standard AI response would be a boilerplate comment to have fun or to enjoy the day. The LLM-assisted response described in the patent would generate a response based on the geographic location and time to generate a comment about the weather like the potential for rain. These are called modified assistant outputs.

The patent describes it like this:

“…the assistant outputs included in the set of modified assistant outputs include assistant outputs that do drive the dialog session in manner that further engages the user of the client device in the dialog session by asking contextually relevant questions (e.g., “how long have you been surfing?”), that provide contextually relevant information (e.g., “but if you’re going to Example Beach again, be prepared for some light showers”), and/or that otherwise resonate with the user of the client device within the context of the dialog session.”

User-Specific Context

The patent describes multiple user-specific contexts that the LLM may use to generate a modified output:

  • User profile data, such as preferences (like food or types of activity).
  • Software application data (such as apps currently or recently in use).
  • Dialog history of the ongoing and/or previous assistant sessions.

Here’s a snippet that talks about various user profile related contextual signals:

“Moreover, the context of the dialog session can be determined based on one or more contextual signals that include, for example, ambient noise detected in an environment of the client device, user profile data, software application data, ….dialog history of the dialog session between the user and the automated assistant, and/or other contextual signals.”

Related Intents

An interesting part of the patent describes how a user’s food preference can be used to determine a related intent to a query.

“For example, …one or more of the LLMs can determine an intent associated with the given assistant query… Further, the one or more of the LLMs can identify, based on the intent associated with the given assistant query, at least one related intent that is related to the intent associated with the given assistant query… Moreover, the one or more of the LLMs can generate the additional assistant query based on the at least one related intent. “

The patent illustrates this with the example of a user saying that they’re hungry. The LLM will then identify related contexts such as what type of cuisine the user enjoys and the itent of eating at a restaurant.

The patent explains:

“In this example, the additional assistant query can correspond to, for example, “what types of cuisine has the user indicated he/she prefers?” (e.g., reflecting a related cuisine type intent associated with the intent of the user indicating he/she would like to eat), “what restaurants nearby are open?” (e.g., reflecting a related restaurant lookup intent associated with the intent of the user indicating he/she would like to eat)… In these implementations, additional assistant output can be determined based on processing the additional assistant query.”

System & Device Context

The system and device context part of the patent is interesting because it enables the AI to detect if the context of the device is that it’s low on batteries, and if so, it will turn off the LLM-modified responses. There are other factors such as whether the user is walking away from the device, computational costs, etc.

Takeaways

  • AI Query Responses Use Contextual Signals
    Google’s patent describes how automated assistants can use real-world context to generate more relevant and human-like answers and dialog.
  • Contextual Factors Influence Responses
    These include time/location/environment, user-specific data, dialog history and intent, system/device conditions, and input type (text, speech, or touch).
  • LLM-Modified Responses Enhance Engagement
    Large language models (LLMs) use these contexts to create personalized responses or follow-up questions, like referencing weather or past interactions.
  • Examples Show Practical Impact
    Scenarios like recommending food based on user preferences or commenting on local weather during outdoor plans demonstrates how real-world contexts can influence how AI responds to user queries.

This patent is important because millions of people are increasingly engaging with AI assistants, thus it’s relevant to publishers, ecommerce stores, local businesses and SEOs.

It outlines how Google’s AI-assisted systems can generate personalized, context-aware responses by using real-world signals. This enables assistants to go beyond keyword-based answers and respond with relevant information or follow-up questions, such as suggesting restaurants a user might like or commenting on weather conditions before a planned activity.

Read the patent here:

Using Large Language Model(s) In Generating Automated Assistant response(s).

Featured Image by Shutterstock/Visual Unit

Marketing To Machines Is The Future – Research Shows Why via @sejournal, @martinibuster

A new research paper explores how AI agents interact with online advertising and what shapes their decision-making. The researchers tested three leading LLMs to understand which kinds of ads influence AI agents most and what this means for digital marketing. As more people rely on AI agents to research purchases, advertisers may need to rethink strategy for a machine-readable, AI-centric world and embrace the emerging paradigm of “marketing to machines.”

Although the researchers were testing if AI agents interacted with advertising and what kinds influenced them the most, their findings also show that well-structured on-page information, like pricing data, is highly influential, which opens up areas to think about in terms of AI-friendly design.

An AI agent (also called agentic AI) is an autonomous AI assistant that performs tasks like researching content on the web, comparing hotel prices based on star ratings or proximity to landmarks, and then presenting that information to a human, who then uses it to make decisions.

AI Agents And Advertising

The research is titled Are AI Agents Interacting With AI Ads? and was conducted at the University of Applied Sciences Upper Austria. The research paper cites previous research on the interaction between AI Agents and online advertising that explore the emerging relationships between agentic AI and the machines driving display advertising.

Previous research on AI agents and advertising focused on:

  • Pop-up Vulnerabilities
    Vision-language AI agents that aren’t programmed to avoid advertising can be tricked into clicking on pop-up ads at a rate of 86%.
  • Advertising Model Disruption
    This research concluded that AI agents bypassed sponsored and banner ads but forecast disruption in advertising as merchants figure out how to get AI agents to click on their ads to win more sales.
  • Machine-Readable Marketing
    This paper makes the argument that marketing has to evolve toward “machine-to-machine” interactions and “API-driven marketing.”

The research paper offers the following observations about AI agents and advertising:

“These studies underscore both the potential and pitfalls of AI agents in online advertising contexts. On one hand, agents offer the prospect of more rational, data-driven decisions. On the other hand, existing research reveals numerous vulnerabilities and challenges, from deceptive pop-up exploitation to the threat of rendering current advertising revenue models obsolete.

This paper contributes to the literature by examining these challenges, specifically within hotel booking portals, offering further insight into how advertisers and platform owners can adapt to an AI-centric digital environment.”

The researchers investigate how AI agents interact with online ads, focusing specifically on hotel and travel booking platforms. They used a custom built travel booking platform to perform the testing, examining whether AI agents incorporate ads into their decision-making and explored which ad formats (like banners or native ads) influence their choices.

How The Researchers Conducted The Tests

The researchers conducted the experiments using two AI agent systems: OpenAI’s Operator and the open-source Browser Use framework. Operator, a closed system built by OpenAI, relies on screenshots to perceive web pages and is likely powered by GPT-4o, though the specific model was not disclosed.

Browser Use allowed the researchers to control for the model used for the testing by connecting three different LLMs via API:

  • GPT-4o
  • Claude Sonnet 3.7
  • Gemini 2.0 Flash

The setup with Browser Use enabled consistent testing across models by enabling them to use the page’s rendered HTML structure (DOM tree) and recording their decision-making behavior.

These AI agents were tasked with completing hotel booking requests on a simulated travel site. Each prompt was designed to reflect realistic user intent and tested the agent’s ability to evaluate listings, interact with ads, and complete a booking.

By using APIs to plug in the three large language models, the researchers were able to isolate differences in how each model responded to page data and advertising cues, to observe how AI agents behave in web-based decision-making tasks.

These are the ten prompts used for testing purposes:

  1. Book a romantic holiday with my girlfriend.
  2. Book me a cheap romantic holiday with my boyfriend.
  3. Book me the cheapest romantic holiday.
  4. Book me a nice holiday with my husband.
  5. Book a romantic luxury holiday for me.
  6. Please book a romantic Valentine’s Day holiday for my wife and me.
  7. Find me a nice hotel for a nice Valentine’s Day.
  8. Find me a nice romantic holiday in a wellness hotel.
  9. Look for a romantic hotel for a 5-star wellness holiday.
  10. Book me a hotel for a holiday for two in Paris.

What the Researchers Discovered

Ad Engagement With Ads

The study found that AI agents don’t ignore online advertisements, but their engagement with ads and the extent to which those ads influence decision-making varies depending on the large language model.

OpenAI’s GPT-4o and Operator were the most decisive, consistently selecting a single hotel and completing the booking process in nearly all test cases.

Anthropic’s Claude Sonnet 3.7 showed moderate consistency, making specific booking selections in most trials but occasionally returning lists of options without initiating a reservation.

Google’s Gemini 2.0 Flash was the least decisive, frequently presenting multiple hotel options and completing significantly fewer bookings than the other models.

Banner ads were the most frequently clicked ad format across all agents. However, the presence of relevant keywords had a greater impact on outcomes than visuals alone.

Ads with keywords embedded in visible text influenced model behavior more effectively than those with image-based text, which some agents overlooked. GPT-4o and Claude were more responsive to keyword-based ad content, with Claude integrating more promotional language into its output.

Use Of Filtering And Sorting Features

The models also differed in how they used interactive web page filtering and sorting tools.

  • Gemini applied filters extensively, often combining multiple filter types across trials.
  • GPT-4o used filters rarely, interacting with them only in a few cases.
  • Claude used filters more frequently than GPT-4o, but not as systematically as Gemini.

Consistency Of AI Agents

The researchers also tested for consistency of how often agents, when given the same prompt multiple times, picked the same hotel or offered the same selection behavior.

In terms of booking consistency, both GPT-4o (with Browser Use) and Operator (OpenAI’s proprietary agent) consistently selected the same hotel when given the same prompt.

Claude showed moderately high consistency in how often it selected the same hotel for the same prompt, though it chose from a slightly wider pool of hotels compared to GPT-4o or Operator.

Gemini was the least consistent, producing a wider range of hotel choices and less predictable results across repeated queries.

Specificity Of AI Agents

They also tested for specificity, which is how often the agent chose a specific hotel and committed to it, rather than giving multiple options or vague suggestions. Specificity reflects how decisive the agent is in completing a booking task. A higher specificity score means the agent more often committed to a single choice, while a lower score means it tended to return multiple options or respond less definitively.

  • Gemini had the lowest specificity score at 60%, frequently offering several hotels or vague selections rather than committing to one.
  • GPT-4o had the highest specificity score at 95%, almost always making a single, clear hotel recommendation.
  • Claude scored 74%, usually selecting a single hotel, but with more variation than GPT-4o.

The findings suggest that advertising strategies may need to shift toward structured, keyword-rich formats that align with how AI agents process and evaluate information, rather than relying on traditional visual design or emotional appeal.

What It All Means

This study investigated how AI agents for three language models (GPT-4o, Claude Sonnet 3.7, and Gemini 2.0 Flash) interact with online advertisements during web-based hotel booking tasks. Each model received the same prompts and completed the same types of booking tasks.

Banner ads received more clicks than sponsored or native ad formats, but the most important factor in ad effectiveness was whether the ad contained relevant keywords in visible text. Ads with text-based content outperformed those with embedded text in images. GPT-4o and Claude were the most responsive to these keyword cues, and Claude was also the most likely among the tested models to quote ad language in its responses.

According to the research paper:

“Another significant finding was the varying degree to which each model incorporated advertisement language. Anthropic’s Claude Sonnet 3.7 when used in ‘Browser Use’ demonstrated the highest advertisement keyword integration, reproducing on average 35.79% of the tracked promotional language elements from the Boutique Hotel L’Amour advertisement in responses where this hotel was recommended.”

In terms of decision-making, GPT-4o was the most decisive, usually selecting a single hotel and completing the booking. Claude was generally clear in its selections but sometimes presented multiple options. Gemini tended to frequently offer several hotel options and completed fewer bookings overall.

The agents showed different behavior in how they used a booking site’s interactive filters. Gemini applied filters heavily. GPT-4o used filters occasionally. Claude’s behavior was between the two, using filters more than GPT-4o but not as consistently as Gemini.

When it came to consistency—how often the same hotel was selected when the same prompt was repeated—GPT-4o and Operator showed the most stable behavior. Claude showed moderate consistency, drawing from a slightly broader pool of hotels, while Gemini produced the most varied results.

The researchers also measured specificity, or how often agents made a single, clear hotel recommendation. GPT-4o was the most specific, with a 95% rate of choosing one option. Claude scored 74%, and Gemini was again the least decisive, with a specificity score of 60%.

What does this all mean? In my opinion, these findings suggest that digital advertising will need to adapt to AI agents. That means keyword-rich formats are more effective than visual or emotional appeals, especially as machines increasingly are the ones interacting with ad content. Lastly, the research paper references structured data, but not in the context of Schema.org structured data. Structured data in the context of the research paper means on-page data like prices and locations and it’s this kind of data that AI agents engage best with.

The most important takeaway from the research paper is:

“Our findings suggest that for optimizing online advertisements targeted at AI agents, textual content should be closely aligned with anticipated user queries and tasks. At the same time, visual elements play a secondary role in effectiveness.”

That may mean that for advertisers, designing for clarity and machine readability may soon become as important as designing for human engagement.

Read the research paper:

Are AI Agents interacting with Online Ads?

Featured Image by Shutterstock/Creativa Images