Sundar Pichai, Google’s CEO, was interviewed by Andrew Ross Sorkin at the New York Times DealBook Summit, where he discussed what to expect from Google Search in 2025 but also struggled to articulate Google’s concern for content creators.
When asked to compare where Google is today relative to the rest of the industry and whether Google should be the “default winner” Pichai reminded the interviewer that these were “the earliest stages of a profound shift” and underlined that Google is a leader in AI and not the follower. The entire AI industry is built on top of Google research discoveries that were subsequently open sourced, particularly transformers, without which the AI industry would not exist as it is today.
Pichai answered:
“Look, it’s a such a dynamic moment in the industry. When I look at what’s coming ahead, we are in the earliest stages of a profound shift. We have taken such a deep full stack approach to AI.
…we do world class research. We are the most cited, when you look at gen AI, the most cited… institution in the world, foundational research, we build AI infrastructure and when I’m saying AI infrastructure all the way from silicon, we are in our sixth generation of tensor processing units. You mentioned our product reach, we have 15 products at half a billion users, we are building foundational models, and we use it internally, we provide it to over three million developers and it’s a deep full stack investment.
We are getting ready for our next generation of models, I just think there’s so much innovation ahead, we are committed to being at the state of the art in this field and I think we are. Just coming today, we announced groundbreaking research on a text and image prompt creating a 3D scene. And so the frontier is moving pretty fast, so looking forward to 2025.”
Blue Link Economy And AI
It was pointed out by the interviewer that Google was the first mover on AI and then it wasn’t (a reference to OpenAI’s breakout in 2022 and subsequent runaway success). He asked Pichai how much of that was Google protecting the “blue link economy” so as not “to hurt or cannibalize that business” which is worth hundreds of billions of dollars.
Pichai answered that out of all the projects at Google, AI was applied the most to Search, citing BERT, MUM and multimodal search as helping close the gaps in search quality. Something that some in the search industry fail to understand is that AI has has been a part of Google since 2012 when it used Deep Neural Networks for identifying images and speech recognition and in 2014 when it introduced the world to sequence to sequence learning (PDF) for understanding strings of text. In 2015 Google introduced RankBrain, an AI system directly related to ranking search results.
Pichai answered:
“The area where we applied AI the most aggressively, if anything in the company was in search, the gaps in search quality was all based on Transformers internally. We call it BERT and MUM and you know, we made search multimodal, the search quality improvements, we were improving the language understanding of search. That’s why we built Transformers in the company.
So and if you look at the last couple of years, we have with AI overviews, Gemini is being used by over a billion users in search alone.”
Search Will Change Profoundly In 2025
Pichai continued his answer, stating directly that Search will profoundly change not just in 2025, but in early 2025. He also said that progress is going to get harder because the easier things to innovate have been done (low hanging fruit).
He said:
“And I just feel like we are getting started. Search itself will continue to change profoundly in 2025. I think we are going to be able to tackle more complex questions than ever before. You know, I think we’ll be surprised even early in 2025, the kind of newer things search can do compared to where it is today… “
Pichai also said that progress wouldn’t be easy:
“I think the progress is going to get harder when I look at 2025, the low hanging fruit is gone.
But I think where the breakthroughs need to come from where the differentiation needs to come from is is your ability to achieve technical breakthroughs, algorithmic breakthroughs, how do you make the systems work, you know, from a planning standpoint or from a reasoning standpoint, how do you make these systems better? Those are the technical breakthroughs ahead.”
Is Search Going Away?
The interviewer asked Pichai if Google has leaned into AI enough, quoting an author who suggested that Google’s “core business is under siege” because people are increasingly getting answers from AI and other platforms outside of search, and that the value of search would be “deteriorating” because so much of the content online will be AI-generated.
He answered that it’s precisely in a scenario where the Internet is flooded with inauthentic content that search becomes even more valuable.
Pichai answered:
“In a world in which you’re flooded with like lot of content …if anything, something like search becomes more valuable. In a world in which you’re inundated with content, you’re trying to find trustworthy content, content that makes sense to you in a way reliably you can use it, I think it becomes more valuable.
To your previous part about there’s a lot of information out there, people are getting it in many different ways. Look, information is the essence of humanity. We’ve been on a curve on information… when Facebook came around, people had an entirely new way of getting information, YouTube, Facebook, Tik… I can keep going on and on.
…I think the problem with a lot of those constructs is they are zero sum in their inherent outlook. They just feel like people are consuming information in a certain limited way and people are all dividing that up. But that’s not the reality of what people are doing. “
Pichai Stumbles On Question About Impact On Creators
The interviewer next asked if content is being devalued. He used the example of someone who researches a topic for a book, reads twenty books, cites those sources in the bibliography and then gets it published. Whereas Google ingests everything and then “spits” out content all day long, defeating the human who in earlier times would write a book.
Andrew Ross Sorkin said:
“You get to spit it out a million times. A million times a day. And I just wonder what the economics of that should be for the folks that create it in the beginning.”
Sundar Pichai defended Google by saying that Google spends a lot of time thinking about the impact to the “ecosystem” of publishers and how much traffic it sends to them. The interviewer listened to Sundar’s answer without mentioning the elephant in the room, search results stuffed with Reddit and advertising that crowds out content created by actual experts, and the de-prioritization of news content which has negatively impacted traffic to news organizations around the world.
It was at this point that Pichai appeared to stumble as he tried to find the words to respond. He avoids mentioning websites, speaking in the abstract about the “ecosystem” and then when he runs out of things to say changes course and begins speaking about how Google compensates copyright holders who sign up for YouTube’s Content ID program.
He answered:
“Look I… uh… It’s a… very important question… uhm… look I… I… think… I think more than any other company… look you know… we for a long time through… you know… be it in search making sure… while it’s often debated, we spend a lot of time thinking about the traffic we send to the ecosystem.
Even through the moment through the transition over the past couple of years. It’s an important priority for us.”
At this point he started talking about Google’s content platform YouTube and how they use “Content ID” which is used to identify copyright-protected content. Content ID is a program that benefits the corporate music, film, and television industries, copyright owners who “own exclusive rights to a substantial body of original material that is frequently uploaded to YouTube.”
Pichai continued:
“In YouTube we put a lot of effort into understanding and you know identifying content and with content ID and uh creating monetization for creators.
I think… I think those are important principles, right. I think um… there’s always going to be a balance between understanding what is fair use uh… when new technology comes versus how do you… give value back proportionate to the value of the IP, the hard work people have put in.”
Insightful Interview Of Alphabet’s CEO
The interviewer did a great job at asking the hard questions but I think many in the search marketing community who are more familiar with the search results would have asked follow up questions about content creators who are not on Google’s YouTube platform or the non-expert content that pushes down content by actual experts.
Watch the New York Times Interview here:
Featured Image by Shutterstock/Shutterstock AI Generator (no irony intended)
Our question comes from Madeline, who asked during a recent webinar:
“How does the metadata on photos help increase rankings?”
That’s a great question, and it’s something that is overlooked in SEO.
What Is Image Metadata?
For anyone in SEO, the concept of “metadata” will be familiar to you – it’s information that describes aspects of the page.
In SEO, we talk about page titles, page descriptions, and other information in the
of the page as “metadata.”
Images also have metadata.
This information describes the aspects of the image. It includes the name of the image creator, credits, and any copyright associated with it.
People can use it to understand more about the image they are looking at. It also helps to convey that information to the search engines.
Types Of Metadata
There are several different ways to communicate information about the image. The following are methods of labeling or conveying information used specifically for images.
Structured Data
As with any structured data you would use for other elements on your webpages, image structured data can be in JSON-LD, Microdata, or RDFa format.
It is contained on the page itself, rather than the image, and should be used on every page the image is on.
Just using the structured data markup on one page does not guarantee that Google will know to use it again for another page where the image appears.
The type to use is ImageObject. From there, Google requires the following property to be used: contentURL.
In addition to this, you must use one of the following properties:
creator (or).
creditText (or).
copyrightNotice (or).
license.
Google also recommends using the following properties:
acquireLicensePage.
creator.
name.
creditText.
copyrightNotice.
license.
I want to clarify something about this structured markup information.
As with structured data used elsewhere on a page, it’s really not used for ranking purposes. It is used more to help search engines understand information about images so that they can enhance the image SERP results.
For example, the “licensable” label that appears over some images in Google’s Image SERPs. This allows Google to display the license conditions for that image.
When clicking on the image, the side panel then extends to give an opportunity for the user to visit the site and also find more information about this image. This information is captured through structured data.
According to The IPTC, the standard “is the most widely used standard because of its universal acceptance among photographers, distributors, news organizations, archivists, and developers. The schema defines metadata structure, properties, and fields, so that images are optimally described and easily accessed later.”
Google has announced in the past that it will use IPTC metadata to identify and signal that an image has been created using artificial intelligence.
Using this metadata could make an image eligible to display an “AI-generated” label in Google Images.
EXIF Data
EXIF (Exchangeable Image File Format) is a data standard covering more specific information about how an image was captured.
For example, the camera settings, pixel dimensions, location information, and the date/time the photo was captured.
In fact, if you look at the photos you have taken on your phone, you will likely see some of this EXIF data for yourself.
Back in 2014, Matt Cutts (then Google’s head of search spam) said Google “reserve the right to use [EXIF] in rankings.”
However, there is no evidence that it acted on that right to use it. In fact, over the years, nothing much has changed in terms of Google’s assertions about the use of EXIF in rankings.
However, reports from the SMX Advanced conference in September 2024 suggested that Martin Splitt of Google denied its use in rankings.
How Does It Affect Rankings?
So, now that we’ve covered what image metadata is, let’s get back to the question. Does metadata affect rankings?
No. Not directly.
But there is some nuance to that answer.
Because of the way metadata can enhance an image in Google Image SERPs, it may have an effect on click-through. That alone can be enough of a reason to utilize it.
After all, clicking on the website to view or license an image is likely the goal of optimizing it.
There is, of course, the suggestion that user behavior affects Google’s rankings. If the metadata-inspired labels on the images in the SERPs cause improved click-through, then arguably, there is a link to rankings improvements.
What Affects Image Ranking?
If you want to make the most of the images on your site, then utilizing metadata is a smart move – especially if your images are your products.
If you want your images to rank purely to give your product or service more exposure to potential customers, then you may want to focus more on aspects that will directly impact their ranking.
There are a range of factors that you will want to consider, including choosing the right image file type: JPEG, WebP, PNG, BMP, GIF, or SV.
For a full guide on how to optimize your images for ranking, take a look at these tips.
As an absolute minimum, the following aspects are worth considering whenever you add an image to a page.
Speed
As with everything on a webpage these days, load speed matters. If your image is slow to load and render, it is likely to affect the Core Web Vitals of the page on which it’s on.
Alt Text
The alt tag is a text alternative to an image. If the image doesn’t display, or a screen reader is used to understand the image, it can be read instead of the image being viewed.
Search engines have historically struggled to understand the content of images.
Although they have taken significant leaps forward in this regard, the alt tag is still used to explain what’s in an image that the search engines will definitely understand.
As such, it is a good place to accurately describe your image while using language that searchers will likely use.
File Name
Did you not expect that what name you save your photo as will have a ranking impact?
Well, surprisingly, it does.
Don’t settle for Helen-save-1 or IMG1239. Instead, consider using a similar file name to the alt text. Essentially, give the search bots another clue as to what the image is of.
In Summary: Image Metadata Matters For SEO
There is little to say that metadata has a direct ranking impact. However, as with any factor that may or may not have an impact, I suggest you test it where you can!
Although there may be little impetus to add metadata for ranking purposes, there are many other reasons, even SEO ones, why you should consider what metadata you are or aren’t using with your imagery.
Most marketers don’t realize that Google has been losing search market share in EU countries.
Image Credit: Kevin Indig
The drop in market share comes at a time when Google’s business is under siege:
The DoJ recommended separating Google from Chrome and Android amid a lawsuit against Alphabet. (I summarized the lawsuit and potential outcomes in Monopoly.)
The Justice Department runs a separate lawsuit against Google’s advertising business.
Canada just sued Google over anti-competitive practices in online ads.
ChatGPT, Perplexity & Co are growing mind and market share. (I covered the meteoric rise of ChatGPT in ChatGPT Search.)
Google faces heavy regulation in the EU from the DMA (Digital Marketing Act), which I wrote about in 2 Internets.
So, the question is two-fold: How much does the drop in market share matter, and what is the driver?
The short answer is that the drop matters more than Alphabet might like to admit.
It gives oxygen to competitors and weakens the body in the fight against external agents. Google’s revenue is still strong, but advertising market share is declining.
A mix of regulation, competitors, and negative sentiment toward Google seem responsible for the drop.
The implication is that marketers increasingly need to track and optimize for more search engines, but a more fragmented playing field could also be an opportunity for more referral traffic from search engines to websites.
What Is Going On With Google?
Image Credit: Kevin Indig
Google’s market share over the last 10 years dipped by -5.6 pp (percentage points) in France and -3.3 pp in Germany.
StatCounter has never recorded such a low share since measuring data in January 2009.
France and Germany are not the only ones. Most EU countries saw Google’s market share drip in the last five years (mobile):
Austria: -4.1 pp.
Poland: -3.1 pp.
Switzerland: -2.3 pp.
Netherlands: -2.1 pp.
Denmark: -1.5 pp.
Zooming further in also doesn’t change things. Google market share over the last 12 months (mobile):
France: -4.6 pp.
Austria: -3.2 pp.
Poland: -2.4 pp.
Germany: -2.1 pp.
Switzerland: -1.3 pp.
Netherlands: -1.0 pp.
Denmark: -1.0 pp.
What’s going on? The picture becomes clearer when we look at when the trend changes. There are two inflection points: November 2018 and April 2024.
Image Credit: Kevin Indig
The data shows a shift away from Google starting around April, a month after Android and Apple introduced choice screens for browsers and search engines.
In other words, Google can no longer be the default search engine on mobile and desktop devices. We’re starting to see the results.
However, not all countries see a dip. Why?
Why Are Some Countries Flat?
Google’s market share isn’t down in every EU country, e.g.:
Portugal.
Spain.
Italy.
Ireland.
How come? These countries are part of Europe, and users see a choice screen.
The answer is devices. The countries listed above lost market share on desktop but not mobile.
Image Credit: Kevin Indig
This happens everywhere in the EU. Over the last five years, Google lost -2.1% market share on mobile compared to -10% on desktop in the EU.
Why?
A big part of the reason is the exclusive distribution agreement with Apple.
Windows is the dominant desktop operating system, with +75% in the EU, largely because of its domination in corporate computing. MacOS has only 15.1%.
While Android (Google’s operating system) also has the majority market share on mobile with 66.5%, Apple’s iOS has 33%.
And since Google is the default search engine on Apple devices by paying a $20 billion fee, its position is more solid in the EU on mobile – until the DMA forced choice screens in March.
Image Credit: Kevin Indig
But what about countries that show a decline in Google’s market share before March? Way before!
Why Does The Dip Start Earlier In Some Countries?
Image Credit: Kevin Indig
Google lost market share in countries like Germany and Portugal as early as November 2018. So, there must be something else going on besides choice screens and device-specific dynamics.
Two things happened in 2018: First, GDPR, the European data protection law, came into effect in May 2018. Second, the EU fined Alphabet €4.34 billion for antitrust violations related to Android’s market dominance.
Both events didn’t directly decrease Google’s market share but set off a period of Google mistrust that gave space to smaller competitors like DuckDuckGo and Bing.
Europeans are much more privacy-sensitive, which means regulatory fines and privacy laws influence consumer behavior much more than the U.S.
For example, the European privacy search engine StartPage gets 56% of searches from the EU and 21% from the U.S.
Users visit Google less because of privacy concerns. France declared not to use Google as a default search engine for some ministries in November 2018.
Choice screens and public perception are the biggest drivers behind Google’s decline. Google sends less referral traffic to websites. So, what is the effect?
Who Wins What Google Loses?
Image Credit: Kevin Indig
The biggest winner of Google’s decline is Bing. The ever-second search engine is the biggest beneficiary of Google’s decline.
It’s very possible that ChatGPT and its close affiliation with Microsoft gave its search engine a bigger boost in Europe than originally assumed, but Bing is also the second choice in cons’ minds.
Now, these numbers are still peanuts, and search engines like DuckDuckGo, Ecosia, and QWANT license search results from Bing and Google. So, you could say that Google and Bing win, after all.
However, Ecosia and QWANT are working on a joint web index to become independent from other search engines.
How much longer until DuckDuckGo and others announce their own index as well? When the alpha gets weaker, the smaller animals smell the opportunity.
Despite the decline in market share, Google’s search revenue is still growing impressively fast at its scale. Why?
Market share doesn’t have to correlate with search volume or monetizable queries.
There are more mobile than desktop searches, and mobile searches drop to a smaller degree.
Google still dominates in other markets – the EU might not be enough to put a dent into Google’s revenue that the company couldn’t compensate.
Google has been more aggressive in search monetization than the drop in market share.
Relative ad revenue growth, which is predicted to fall below 50% next year, could be a better indicator than absolute growth.
I also want to point out a caveat in the data: StatCounter gathers data by measuring referral traffic on 1.5 million sites. There is a chance that Google sending out less traffic to websites and keeping it to themselves affects the numbers.
What Are The Implications?
Google’s dropping market share in the EU, combined with potential antitrust remedies (like a forced end to the distribution agreement with Apple) and more competition, will likely fragment Search further.
In other words, we might optimize for more search engines (again). Most of them might function similarly in ranking but might need site owners to take dedicated indexing actions, such as integrating with Bing’s IndexNOW.
We’ve already dusted off our Bing Webmaster Tools when it turned out ChatGPT is using Bing results for its search feature. What’s next? Perplexity webmaster tools? Boosted by growing market share, SEO professionals should pay more attention to Bing.
Other search engines don’t have webmaster tools yet – to my surprise. What better way to foster a relationship with site owners than a portal? But with increasingly independent indices, that could become a reality soon.
Ironically, the monopoly lawsuit against Google comes just as the company gets more competition. A 1% market share of a giant like Alphabet can create a unicorn with $1.75 billion in ARR.
Browsers play a critical role in the search engine wars. The DoJ is pushing for Chrome to divest from Google, and OpenAI is working on its own browser.
In my opinion, OpenAI should buy Arc. Either way, browsers are the ultimate internet user interface and offer more user information than search engines can chew.
I want to be clear that I don’t think Google is doomed to fail. Google has all the ingredients to come out on top in the “new AI world.” The only reason it will fail is by standing in its own way.
The structured data landscape has undergone significant transformation in 2024, driven by the rise of AI-powered search, the growing importance of machine-readable content, and the need to ground large language models in factual data.
According to the latest HTTP Archive’s Web Almanac, analyzing structured data across 16.9 million websites reveals a clear shift from traditional SEO implementation to more sophisticated knowledge graph development that powers AI discovery systems.
While Google deprecated certain rich results like FAQs and HowTos in 2023, it simultaneously introduced an unprecedented number of new structured data types, including vehicle listings, course info, vacation rentals, profile pages, and 3D product models.
This rapid evolution signals a maturing ecosystem where structured data serves not just search visibility but also forms the foundation for factual AI responses, training language models, and enhanced digital product experiences.
Analysis and Methodology
The insights presented in this article are based on the 2024 edition of the Structured Data chapter of the HTTP Archive’s Web Almanac. The annual report analyzes the state of the web by evaluating structured data implementation across 16.9 million websites. These datasets are publicly queryable on BigQuery in tables in the `httparchive.all.*` tables for the date date = '2024-06-01' and relies on tools like WebPageTest, Lighthouse, and Wappalyzer to capture metrics on structured data formats, adoption trends, and performance.
Structured Data Adoption Trends
The analysis reveals compelling growth across major structured data formats:
RDFa maintains leadership with 66% presence (+3% YoY).
Open Graph implementation grows to 64% (+5% YoY).
X (Twitter) meta tag usage increases to 45% (+8% YoY).
This widespread adoption indicates that organizations are investing in structured data not just for search visibility, but also to enable AI and crawlers to understand and enhance their digital experiences.
AI Discovery And Knowledge Graphs
The relationship between structured data and AI systems is evolving in complex ways.
While many generative AI search engines are still developing their approach to leveraging structured data, established platforms like Bing Copilot, Google Gemini, and specialized tools like SearchGPT already seem to demonstrate the value of entity-based understanding, particularly for local queries and factual validation.
Training And Entity Understanding
Generative AI search engines are trained on vast datasets that include structured data markup, influencing how they:
Recognize and categorize entities (products, locations, organizations).
Ground responses. We see this in systems like DataGemma that use structured data to ground responses in verifiable facts.
Understand relationships between different data points. This is particularly evident when schema.org is used for aggregating datasets from authoritative sources worldwide.
Process-specific query types like local business and product searches.
This training shapes how AI systems interpret and respond to queries, particularly visible in:
Local business queries where entity attributes match structured data patterns.
Product queries that reflect merchant-provided structured data.
Knowledge panel information that aligns with entity definitions.
Search Engine Integration
Different platforms demonstrate structured data influence through:
Traditional Search: Rich results and knowledge panels directly powered by structured data.
AI Search Integration:
Bing Copilot showing enhanced results for structured entities.
Google Gemini reflecting knowledge graph information.
Specialized engines like Perplexity.ai demonstrating entity understanding in location queries.
Latest Google’s experiment of an AI Sales Assistant integrated into the SERP for shopping queries (This is huge! Here is on X, spotted by SERP Alert).
WordLift’s Entity Knowledge Graph Panel on Google Search – Foundation Year.
Asking “When was WordLift founded?” to Google Gemini.
Here is an example of Gemini and Google Search sharing the same factoid.
AI Sales Assistant through a ‘Shop’ CTA on branded sitelinks.
Data Validation And Verification
Structured data provides verification mechanisms through:
Knowledge Graphs: Systems like Google’s Data Commons use structured data for fact verification.
Training Sets: Schema.org markup creates reliable training examples for entity recognition.
Validation Pipelines: Content generation tools, like WordLift, use structured data to verify AI outputs.
The key distinction is that structured data doesn’t directly influence LLM responses, but rather shapes AI search engines through:
Training data that includes structured markup.
Entity class definitions that guide understanding.
Integration with traditional search rich results.
This makes structured data implementation increasingly important for visibility across both traditional and AI-powered search platforms.
As we enter this new era of AI Discovery, investing in structured data isn’t just about SEO anymore – it’s about building the semantic layer that enables machines to truly understand and accurately represent who you are.
Semantic SEO Evolution: From Structured Data To Semantic Data
The practice of SEO has evolved into Semantic SEO, going beyond traditional keyword optimization to embrace semantic understanding:
Entity-Based Optimization
Focus on clear entity definitions and relationships.
Implementation of comprehensive entity attributes.
Strategic use of sameAs properties for entity disambiguation.
Content Networks
Development of interconnected content clusters.
Clear attribution and authorship markup.
Rich media relationship definitions.
Key Implementation Patterns In JSON-LD
Content Publishing
Analysis of structured data patterns across millions of websites reveals three dominant implementation trends for content publishers.
JSON-LD patterns for content publishers. (Image from author, November 2024)
Website Structure & Navigation (+6 Million Implementations)
The dominance of WebPage → isPartOf → WebSite (5.8 million) and WebPage → breadcrumb → BreadcrumbList (4.8 million) relationships demonstrates that major websites prioritize clear site architecture and navigation paths.
Site structure remains the foundation of structured data implementation, suggesting that search engines heavily rely on these signals for understanding content hierarchy.
Content Attribution & Authority
Strong patterns emerge around content attribution:
Article → author → Person (925,000).
Article → publisher → Organization (597,000).
BlogPosting → author → Person (217,000).
This focus on authorship and organizational attribution reflects the increasing importance of E-E-A-T signals and content authority in search algorithms.
Rich Media Integration
Consistent implementation of image markup across content types:
The high frequency of media relationships indicates that publishers recognize the value of structured visual content for both search visibility and user experience.
The data suggests publishers are moving beyond basic SEO markup to create comprehensive machine-readable content graphs that support both traditional search and emerging AI discovery systems.
Local Business & Retail
Analysis of local business structured data implementation reveals three critical pattern groups that dominate location-based markup.
JSON-LD patterns for local business and retail. (Image from author, November 2024)
Location & Accessibility (+1.4 Million Implementations)
High adoption of physical location markup demonstrates its fundamental importance:
While less frequently implemented, these trust-building elements create richer local business entities that support both search visibility and user decision-making.
Ecommerce (Expanded List)
Analysis of ecommerce structured data reveals sophisticated implementation patterns that focus on product discovery and conversion optimization.
JSON-LD patterns for ecommerce websites. (Image from author, November 2024)
Core Product Information (+4.7 Million Implementations)
The dominance of basic product markup shows its fundamental importance:
This layered approach to product attributes creates comprehensive product entities that support both search visibility and user decision-making.
Future Outlook
The role of structured data is expanding beyond its traditional function as an SEO tool for powering rich snippets and specific search features. In the age of AI discovery, structured data is becoming a critical enabler for machine understanding, transforming how content is interpreted and connected across the web. This shift is driving the industry to think beyond Google-centric optimization, embracing structured data as a core component of a semantic and AI-integrated web.
Structured data provides the scaffolding for creating interconnected, machine-readable frameworks, which are vital for emerging AI applications such as conversational search, knowledge graphs, and (Graph) retrieval-augmented generation (GraphRAG or RAG) systems. This evolution calls for a dual approach: leveraging actionable schema types for immediate SEO benefits (rich results) while investing in comprehensive, descriptive schemas that build a broader data ecosystem.
The future lies in the intersection of structured data, semantic modeling, and AI-driven content discovery systems. By adopting a more holistic view, organizations can move from using structured data as a tactical SEO addition to positioning it as a strategic layer for powering AI interactions and ensuring findability across diverse platforms.
Credits And Acknowledgements
This analysis wouldn’t be possible without the dedicated work of the HTTP Archive team and Web Almanac contributors. Special thanks to:
The complete Web Almanac Structured Data chapter offers even deeper insights into the evolving landscape of structured data implementation.
As we move toward an AI-powered future, the strategic importance of structured data will continue to grow.
An analysis of 140,000 sites hosted on managed WordPress host Kinsta revealed the WordPress plugins that users judge to be the best. These findings highlight how publishers prioritize performance, SEO, and user experience.
10. Schema.org Structured Data – Schema Pro – 1.75%
Adding structured data is critical for SEO and in general for making it clear for search engines and AI what the content is about. Only 1.75% of the 140,000 sites scanned by Kinsta use a standalone Schema plugin. The reason may be that users are satisfied with the structured data functionalities offered by SEO Plugins.
The Schema Pro WordPress plugin offers a wider selection of structured data types than most SEO plugins and it also offers the capability to add custom structured data automatically across the entire site targeted to specific kinds of posts or at the individual page level.
9. XML Sitemap Generator for Google Plugin – 2.17%
Sitemaps are helpful for encouraging search engines to crawl web pages efficiently in a timely manner. But only 2% of sites use it, likely because a basic version of this functionality is native to the WordPress core and it’s provided by all WordPress SEO Plugins.
Like the dedicated Schema Pro Structured Data plugin, the XML Sitemap Generator for Google Plugin offers greater flexibility than built-in XML site generators found in most SEO plugins but with only 2.17% use it’s clear that SEO plugins are a perfect fit for most WordPress users. The advantage of using Schema Pro Structured Data plugin is that it offers greater flexibility but that might be just for edge cases.
8. Broken Link Checker – 3.27%
The Broken Link Checker is a plugin that checks for broken links but is not commonly used in this sample of sites. Google Search Console offers a report of 404 errors discovered by Googlebot which indicates broken internal and external links. The broken link check can also be accomplished with a software app like Screaming Frog.
The Broken Link Checker plugin offers a cloud-based scanner and a local checker that uses website server resources to monitor the entire website for broken links.
7. SEOPress – 4.81%
SEOPress is the seventh most popular plugin in the sample of 140,000 sites that are hosted on Kinsta. It’s a fairly popular plugin with 300,000+ installations. all-in-one SEO plugin that facilitates content optimization, schema implementation, and redirection management.
6. All in One SEO – 5.11%
The sixth most popular plugin is highly popular online, with 3+ million installations. On Kinsta it’s installed on 5.11% of sites in this sample.
5. Imagify – 11.62%
Imagify is am image optimizer that reduces image file sizes to improve website loading time. The popularity of these kinds of plugins may reflect the lack of image optimizing skills of the average WordPress user as it’s an easy thing to optimize an image before uploading it. Less than 12% of sites on Kinsta have installed it.
4. Rank Math – 18.32%
Rank Math is a highly popular SEO plugin with over 3 million installations worldwide. So it’s not surprising to see that almost 20% of sites hosted on Kinsta use it.
3. WP Rocket – 19.10%
The #3 most popular plugin is for performance optimization, demonstrating how important website performance is for publishers. WP Rocket performs file minification (makes code smaller but removing blank spaces), lazy loading and database optimization. WP Rocket made their plugin compatible with Kinsta so that caching is handled at the server level by Kinsta instead of at the PHP level by the plugin. Handling caching at the server level is faster and uses less server resources. Caching at the server level is one of the benefits of a managed WordPress server.
2. Redirection – 26.85%
The Redirection plugin is used by almost 27% of users, which is curious because redirection is something that can be handled in the built-in redirection manager tool in Kinsta’s dashboard. I use the Redirection plugin on some of my sites and it does a lot more than redirects. The plugin features 404 error reporting which alerts users to a problem like a typo in the URL of an external or internal link, which can be fixed by redirecting the typo to the correct URL. The plugin can also set security headers, which is useful for strengthening site security.
1. Yoast – 57.95%
Yoast is the most popular plugin installed on sites according to the scan Kinsta performed on 140,000 sites. In a way it’s not surprising because Yoast is installed on over 10+ million websites and is a trusted brand.
Takeaways:
The choice of plugins suggest what concerns WordPress users the most, SEO, website performance, and proper site functioning.
Search Optimization
SEO is a strong concern to WordPress users, with a combined total of 86.19% of users employing an SEO plugin. The small percentage of users that install a dedicated structured data plugin (1.75%) or XML sitemap generator (2.17%) indicates that most users are satisfied with the built-in features of their SEO plugins.
Website Performance
Over 30% of managed WordPress host users in the surveyed sample of 140,000 sites are concerned enough about performance optimization to install plugins (WP Rocket 19.10% and Imagify 11.62%).
Site Health Maintenance
Over 30% of sites are concerned that their sites are working properly, as evidenced by the amount of users that install the Redirection and Broken Link Checker plugins.
Brand Trustworthiness
Over 95% of WordPress users turn to a name brand plugin:
Yoast 57.95%
WP Rocket 19.10%
Rank Math 18.32%
These findings suggest that trust and reliability, comprehensive functionality, and ease of use are important factors guiding the choice of WordPress plugins. It’s possible that trustworthy word of mouth recommendations and brand awareness also play a role in plugin choices.
Notable in this survey of plugins that users consider the best is that security plugins did not make the list. This is likely because Kinsta provides built-in WordPress security, including two firewalls and enterprise-level DDoS protection.
It’s a cornerstone of your business’s online success that impacts everything from site speed and uptime to customer trust and overall branding.
Yet, many businesses stick with subpar hosting providers, often unaware of how much it’s costing them in time, money, and lost opportunities.
The reality is that bad hosting doesn’t just frustrate you. It frustrates your customers, hurts conversions, and can even damage your brand reputation.
The good news?
Choosing the right host can turn hosting into an investment that works for you, not against you.
Let’s explore how hosting affects your bottom line, identify common problems, and discuss what features you should look for to maximize your return on investment.
1. Start By Auditing Your Website’s Hosting Provider
The wrong hosting provider can quickly eat away at your time & efficiency.
In fact, time is the biggest cost of an insufficient hosting provider.
To start out, ask yourself:
Is Your Bounce Rate High?
Are Customers Not Converting?
Is Revenue Down?
If you answered yes to any of those questions, and no amount of on-page optimization seems to make a difference, it may be time to audit your website host.
Why Audit Your Web Host?
Frequent downtime, poor support, and slow server response times can disrupt workflows and create frustration for both your team and your visitors.
From an SEO & marketing perspective, a sluggish website often leads to:
Increased bounce rates.
Missed customer opportunities.
Wasted time troubleshooting technical issues.
Could you find workarounds for some of these problems? Sure. But they take time and money, too.
The more dashboards and tools you use, the more time you spend managing it all, and the more opportunities you’ll miss out on.
Bluehost’s integrated domain services simplify website management by bringing all your hosting and domain tools into one intuitive platform.
2. Check If Your Hosting Provider Is Causing Slow Site Load Speeds
Your website is often the first interaction a customer has with your brand.
A fast, reliable website reflects professionalism and trustworthiness.
Customers associate smooth experiences with strong brands, while frequent glitches or outages send a message that you’re not dependable.
Your hosting provider should enhance your brand’s reputation, not detract from it.
How To Identify & Measure Slow Page Load Speeds
Identifying and measuring slow site and page loading speeds starts with using tools designed to analyze performance, such as Google PageSpeed Insights, GTmetrix, or Lighthouse.
These tools provide metrics like First Contentful Paint (FCP) and Largest Contentful Paint (LCP), which help you see how quickly key elements of your page load.
Pay attention to your site’s Time to First Byte (TTFB), a critical indicator of how fast your server responds to requests.
Regularly test your site’s performance across different devices, browsers, and internet connections to identify bottlenecks. High bounce rates or short average session durations in analytics reports can also hint at speed issues.
Bandwidth limitations can create bottlenecks for growing websites, especially during traffic spikes.
How To Find A Fast Hosting Provider
Opt for hosting providers that offer unmetered or scalable bandwidth to ensure seamless performance even during periods of high demand.
Cloud hosting is designed to deliver exceptional site and page load speeds, ensuring a seamless experience for your visitors and boosting your site’s SEO.
With advanced caching technology and optimized server configurations, Bluehost Cloud accelerates content delivery to provide fast, reliable performance even during high-traffic periods.
Its scalable infrastructure ensures your website maintains consistent speeds as your business grows, while a global Content Delivery Network (CDN) helps reduce latency for users around the world.
With Bluehost Cloud, you can trust that your site will load quickly and keep your audience engaged.
3. Check If Your Site Has Frequent Or Prolonged Downtime
Measuring and identifying downtime starts with having the right tools and a clear understanding of your site’s performance.
Tools like uptime monitoring services can track when your site is accessible and alert you to outages in real time.
You should also look at patterns.
Frequent interruptions or prolonged periods of unavailability are red flags. Check your server logs for error codes and timestamps that indicate when the site was down.
Tracking how quickly your hosting provider responds and resolves issues is also helpful, as slow resolutions can compound the problem.
Remember, even a few minutes of downtime during peak traffic hours can lead to lost revenue and customer trust, so understanding and monitoring downtime is critical for keeping your site reliable.
No matter how feature-packed your hosting provider is, unreliable uptime or poor support can undermine its value. These two factors are critical for ensuring a high-performing, efficient website.
What Your Hosting Server Should Have For Guaranteed Uptime
A Service Level Agreement (SLA) guarantees uptime, response time, and resolution time, ensuring that your site remains online and functional. Look for hosting providers that back their promises with a 100% uptime SLA.
Bluehost Cloud offers a 100% uptime SLA and 24/7 priority support, giving you peace of mind that your website will remain operational and any issues will be addressed promptly.
Our team of WordPress experts ensures quick resolutions to technical challenges, reducing downtime and optimizing your hosting ROI.
4. Check Your Host For Security Efficacy
Strong security measures protect your customers and show them you value their privacy and trust.
A single security breach can ruin your brand’s image, especially if customer data is compromised.
Hosts that lack built-in security features like SSL certificates, malware scanning, and regular backups leave your site vulnerable.
How Hosting Impacts Security
Security breaches don’t just affect your website. They affect your customers.
Whether it’s stolen data, phishing attacks, or malware, these breaches can erode trust and cause long-term damage to your business.
Recovering from a security breach is expensive and time-consuming. It often involves hiring specialists, paying fines, and repairing the damage to your reputation.
Is Your Hosting Provider Lacking Proactive Security Measures?
Assessing and measuring security vulnerabilities or a lack of proactive protection measures begins with a thorough evaluation of your hosting provider’s features and practices.
Review Included Security Tools
Start by reviewing whether your provider includes essential security tools such as SSL certificates, malware scanning, firewalls, and automated backups in their standard offerings.
If these are missing or come as costly add-ons, your site may already be at risk.
Leverage Brute Force Tools To Check For Vulnerabilities
Next, use website vulnerability scanning tools like Sucuri, Qualys SSL Labs, or SiteLock to identify potential weaknesses, such as outdated software, unpatched plugins, or misconfigured settings.
These tools can flag issues like weak encryption, exposed directories, or malware infections.
Monitor your site for unusual activity, such as unexpected traffic spikes or changes to critical files, which could signal a breach.
Make Sure The Host Also Routinely Scans For & Eliminates Threats
It’s also crucial to evaluate how your hosting provider handles updates and threat prevention.
Do they offer automatic updates to patch vulnerabilities?
Do they monitor for emerging threats and take steps to block them proactively?
A good hosting provider takes a proactive approach to security, offering built-in protections that reduce your risks.
Look for hosting providers that include automatic SSL encryption, regular malware scans, and daily backups. These features not only protect your site but also give you peace of mind.
Bluehost offers robust security tools as part of its standard WordPress hosting package, ensuring your site stays protected without extra costs. With built-in SSL certificates and daily backups, Bluehost Cloud keeps your site secure and your customers’ trust intact.
5. Audit Your WordPress Hosting Provider’s Customer Support
Is your host delivering limited or inconsistent customer support?
Limited or inconsistent customer support can turn minor issues into major roadblocks. When hosting providers fail to offer timely, knowledgeable assistance, you’re left scrambling to resolve problems that could have been easily fixed.
Delayed responses or unhelpful support can lead to prolonged downtime, slower page speeds, and unresolved security concerns, all of which impact your business and reputation.
Reliable hosting providers should offer 24/7 priority support through multiple channels, such as chat and phone, so you can get expert help whenever you need it.
Consistent, high-quality support is essential for keeping your website running smoothly and minimizing disruptions.
Bluehost takes customer service to the next level with 24/7 priority support available via phone, chat, and email. Our team of knowledgeable experts specializes in WordPress, providing quick and effective solutions to keep your site running smoothly.
Whether you’re troubleshooting an issue, setting up your site, or optimizing performance, Bluehost’s dedicated support ensures you’re never left navigating challenges alone.
Bonus: Check Your Host For Hidden Costs For Essential Hosting Features
Hidden costs for essential hosting features like:
Backups.
SSL certificates.
Additional bandwidth can quickly erode the value of a seemingly affordable hosting plan.
What Does This Look Like?
For example, daily backups, which are vital for recovery after data loss or cyberattacks, may come with an unexpected monthly fee.
Similarly, SSL certificates, which are essential for encrypting data and maintaining trust with visitors, are often sold as expensive add-ons.
If your site experiences traffic spikes, additional bandwidth charges can catch you off guard, adding to your monthly costs.
Many providers, as you likely have seen, lure customers in with low entry prices, only to charge extra for services that are critical to your website’s functionality and security.
These hidden expenses not only strain your budget but also create unnecessary complexity in managing your site.
A reliable hosting provider includes these features as part of their standard offering, ensuring you have the tools you need without the surprise bills.
Which Hosting Provider Does Not Charge For Essential Features?
Bluehost is a great option, as their pricing is upfront.
Bluehost includes crucial tools like daily automated backups, SSL certificates, and unmetered bandwidth in their standard plans.
This means you won’t face surprise fees for the basic functionalities your website needs to operate securely and effectively.
Whether you’re safeguarding your site from potential data loss or ensuring encrypted, trustworthy connections for your visitors, or need unmetered bandwidth to ensure your site can handle traffic surges without penalty, you’ll gain the flexibility to scale without worrying about extra charges.
We even give WordPress users the option to bundle premium plugins together to help you save even more.
By including these features upfront, Bluehost simplifies your WordPress hosting experience and helps you maintain a predictable budget, freeing you to focus on growing your business instead of worrying about unexpected hosting costs.
Transitioning To A Better Hosting Solution: What To Consider
Switching hosting providers might seem daunting, but the right provider can make the process simple and cost-effective. Here are key considerations for transitioning to a better hosting solution:
Migration Challenges
Migrating your site to a new host can involve technical hurdles, including transferring content, preserving configurations, and minimizing downtime. A hosting provider with dedicated migration support can make this process seamless.
Cost of Switching Providers
Many businesses hesitate to switch hosts due to the cost of ending a contract early. To offset these expenses, search for hosting providers that offer migration incentives, such as contract buyouts or credit for remaining fees.
Why Bluehost Cloud Stands Out
Bluehost Cloud provides comprehensive migration support, handling every detail of the transfer to ensure a smooth transition.
Plus, our migration promotion includes $0 switching costs and credit for remaining contracts, making the move to Bluehost not only hassle-free but also financially advantageous.
Your hosting provider plays a pivotal role in the success of your WordPress site. By addressing performance issues, integrating essential features, and offering reliable support, you can maximize your hosting ROI and create a foundation for long-term success.
If your current hosting provider is falling short, it’s time to evaluate your options. Bluehost Cloud delivers performance-focused features, 100% uptime, premium support, and cost-effective migration services, ensuring your WordPress site runs smoothly and efficiently.
In addition, Bluehost has been a trusted partner of WordPress since 2005, working closely to create a hosting platform tailored to the unique needs of WordPress websites.
Beyond hosting, Bluehost empowers users through education, offering webinars, masterclasses, and resources like the WordPress Academy to help you maximize your WordPress experience and build successful websites.
Take control of your website’s performance and ROI. Visit the Bluehost Migration Page to learn how Bluehost Cloud can elevate your hosting experience.
This article has been sponsored by Bluehost, and the views presented herein represent the sponsor’s perspective.
In the world of ecommerce platforms, plugins, and shopping carts, there are a lot of technology options. WooCommerce for WordPress leads the way in terms of market share.
All of the various ecommerce platforms have their own pros and cons in terms of features, content management, and overall integration with your business.
Many of the benefits of WooCommerce come from the fact that it is a plugin for WordPress, which is also the most popular website platform technology in the world as well.
My website team utilizes WooCommerce with WordPress for the work we do for clients, and we continue to invest in our processes centered around that technology for digital marketing and driving sales for our clients’ businesses.
We’ve used it for over a decade, and while other popular platforms have emerged, we find that it has the flexibility and opportunities we need to implement the SEO tactics we need in alignment with our broader SEO strategies.
Why Does Any Of This Matter?
You may already be using WooCommerce or another ecommerce platform.
I’m all for whatever platform works best for you. There are definite SEO ceilings that you’ll hit in what you can do on different platforms.
WooCommerce will have ceilings, too, if you aren’t leveraging how you can set it up, how you handle your WordPress optimization as a whole, and how your overall SEO strategy is defined.
I hope that if you’re in WooCommerce or are deciding which platform to choose and have SEO in mind, this article will help you on that journey.
What Makes WooCommerce SEO Unique
WooCommerce SEO is unique because it is within WordPress. Much of what you’ll do to optimize a WooCommerce ecommerce site falls in line with what you’d do for a WordPress site overall.
Overall, SEO-friendly benefits of WooCommerce within WordPress out of the box or with light configuration include:
Analytics: WooCommerce has extensive analytics and connects easily to Google Analytics, so you can blend first and third-party visitor data.
That includes managing the technical, on-page, and off-page aspects of ecommerce SEO within an overall strategy and at a tactical level.
If you’re new to SEO or want to ensure you’re not missing anything, I recommend checking out SEJ’s SEO intro guide.
Getting Started
Before you optimize, you’ll want to ensure you’re ready.
I highly recommend working on developing your action plan and goals before you start.
Knowing your current performance and researching what keywords and topics you want to target are big parts of both.
WooCommerce Analytics
I recommend using Google Analytics (GA4) as your primary analytics data source and platform for WordPress.
Going deeper and specifically into ecommerce analytics that you can integrate into GA4 from WooCommerce, the GTM4WP plugin is a great way to get that data.
Don’t skip out on measuring the data you want and need from your site for your SEO and broader marketing goal tracking.
I recommend prioritizing data before you get deep into optimizing so you can capture baseline data to measure against if you don’t already have it in a good place.
Transactional Emails
Another foundational thing you’ll want to do is set up transactional emails. Several email platforms integrate with WordPress and WooCommerce.
A favorite of my team’s for ease of use and doing the job well is Mailchimp’s transactional email functionality.
It was formerly called Mandrill and can handle post-purchase email communications like order and shipping confirmations.
Mailchimp can also be used to create automated email campaigns based on customer journey or shopping behavior, such as cart abandonment emails, win back, etc.
Functionality like this is essential to get the most out of our SEO investment, and for traffic, you work hard to drive to the site and into the shopping cart.
Keyword Research
Knowing what words, phrases, topics, and terms are related to the subject matter you want to rank for is critical. Beyond that, validate that people searching for those topics are your potential desired audience.
They are paid tools with varying subscription levels but are leaders. They have their respective strengths in helping you research topics that align with your content, products, and categories and dive deep into the right targets for your SEO plan.
Build your lists, map them out to your content, and use them as context as you work through the optimization best practices to follow.
Technical SEO
Like with any site, and to follow broader UX best practices, you want your site to load quickly, be indexable, and not have anything holding it back.
Several specific technical factors you need to consider, configure, and monitor can hold back or unlock your opportunity for rankings compared to peer sites.
Indexing
It is essential to have your content found.
That starts by ensuring you have a clean XML sitemap and robots.txt file. Plus, go into Google Search Console, Bing Webmaster Tools, and third-party validation tools to ensure everything is as intended.
Use the Yoast plugin (or similar) to adjust settings for your XML sitemap and robots.txt files.
Yoast is great at giving you options to include or remove from those files, so you don’t have to touch the code or manually adjust those files at all. You can get the settings to your liking and then submit them for validation through the Console/Webmaster Tools.
Image from author, November 2024
Page Experience
There are a lot of data points and best practices on page load times, site speed, and other factors that Google looks at for “page experience.”
Overall, you want to pay attention to core web vitals and page load times to ensure that you have fast-loading pages that don’t harm image quality and content richness for users.
Imagify and WP Rocket are recommended plugins for image optimization and caching to improve page load times and overall site performance.
Screenshot from Imagify, November 2024
Accessibility
Making your content accessible to all, including those with visual impairments, is important.
That includes coding to common ADA standards and ensuring that alt attributes and other cues are included.
Not a plugin recommendation here – I recommend using a third-party tool like PowerMapper.com to audit pages to get the helpful information you need to adjust page elements to meet the standard that your legal counsel advises (I’m not a lawyer).
Structured Data
Using extra context cues and opportunities to categorize, catalog, and mark up your subject matter is important. Leverage it where possible to get specific information for your industry, especially using specific product attributes.
Again, you can tap into the power of the Yoast plugin to add basic schema markup to pages on your site.
I recommend reading more about Schema and how it works before diving into the implementation if it is a new concept.
Screenshot from WordPress, November 2024
Canonical URLs And Permalinks
Web stores inherently can have complexities and struggles with duplicate content.
Whether you have a product that appears in multiple categories or are just dealing with the “out of the box” way that WordPress and WooCommerce generate many separate URLs for a single page, you need to include a single “canonical” version for the search engines to index, show in the search results, and aggregate all link value to.
I recommend Yoast here again for handling canonicals.
I also recommend the Redirection plugin if you have pages that move, discontinued products, or need to permanently 301 redirect a specific page to another.
Be mindful of how you use canonicals and redirects, and always validate with tools like Screaming Frog or other lightweight redirect testing tools.
You want to avoid conflicts between multiple plugins that can send the wrong signal to the search engines or provide a bad experience for your users (sending to 404s, redirect loops, etc.).
Screenshot from WordPress, November 2024
Breadcrumbs
Breadcrumbs are links on interior pages that show a user (as well as a search engine) where they are on a site in terms of the navigational path or depth.
They allow users to see how far they are drilled down into a specific product category, blog category, or other interior section and a way to click to go back upstream.
They are typically coded into your WordPress site theme as a default element. The Yoast plugin is great for adding schema markup to them for WordPress/WooCommerce.
Screenshot from WordPress, November 2024
On-Page SEO
On-page ranking factors and SEO aspects for ecommerce SEO that you’ll want to have covered in your WooCommerce site include:
URLs
Beyond the technical aspects of implementing canonical tags and trying to manage duplicate content to get the search engines to index and rank a single version of your pages – including categories and products – you don’t want to miss the opportunity to include important contextual keywords in your URLs.
Use WordPress’ native page naming conventions and tools to put meaningful keywords (without going overboard or stuffing) into the URL string.
Like any large or enterprise site, if you have many products, find ways to scale tag creation with data-driven content where possible.
Use Yoast to create custom titles and meta descriptions on each page.
Much like copy and URLs, though, also look at how the defaults are set up to pull in dynamic elements and set any that you can use.
That way, you can build formulas for how the tags will be created that don’t require you to write custom tags for each page to reach your unique tags per page goals.
Screenshot from WordPress, November 2024
Copy
A unique, optimized copy can be a challenge for ecommerce sites.
Much like tags, you might have trouble doing it at scale. Or, you may have a lot of similar products.
Find ways to invest in the manual time to write to best practices, avoid duplicate content, and scale it programmatically where possible while maintaining high quality.
Images
Image file attributes are an area where you can include relevant, contextual keywords describing the image’s subject matter.
This is important for product images, product category-level images, and any content on your site.
They are important in terms of meeting accessibility standards – and also, to the search engines – to understand the context of an image.
Manage these in the media center in WordPress at upload or later by editing images through the media tab or going into the page and clicking on the image to review and edit properties.
Screenshot from WordPress, November 2024
Product Reviews
User-generated, unique content can help add contextual copy, supplementing the copy on a product page.
Added context and another type of potential schema element can be added to product reviews.
My team leverages and recommends the stamped.io plugin for easy management and implementation of reviews.
However, many great review management plugins are available, and they vary in cost, implementation ease, and complexity.
As a bonus, Stamped will also send out post-purchase requests for reviews.
Screenshot from WordPress, November 2024
Screenshot from WordPress, November 2024
Off-Page SEO
Ecommerce SEO, like most SEO, requires off-page factors to build upon your technical and on-page/content-focused tactics.
These factors are more general and least tied specifically to WooCommerce, but shouldn’t be left out of your SEO plan:
Links
Seek high-quality, industry/context-relevant inbound links to your products, categories, and content.
That includes natural associations like manufacturers, partners, affiliates, PR-related mentions, and other quality natural sources.
Regardless, link to your site from social media content to build context and connections and seek out areas of opportunity across the social media landscape to gain links and mentions.
Engagement
Seek out other opportunities for engagement and mentions online.
Whether part of a PR plan, influencer strategy, or other ways your brand gets mentioned, leverage them.
Seek them out, and look for high-quality content to reference yours.
Popular SEO Plugins For WooCommerce
You can boost WooCommerce with other WordPress plugins, many of which are free.
Here’s a recap of the plugins I noted that are related to individual items you’ll want to optimize.
My team’s recommended WordPress plugins to use with WooCommerce (and in many cases in general for WordPress) SEO include:
Yoast: SEO plugin that will create an editable sitemap and robots.txt files, help you change product metadata from product pages, add basic schema, handle canonicalization, breadcrumbs, etc.
Imagify: For image optimization for page load time and site speed optimization.
WP Rocket: For caching to improve site performance.
Redirection: For creating any 301 redirects you need as part of an SEO strategy.
Stamped.io (Or similar service): For managing customer product reviews.
GTM4WP: Allowing you to implement enhanced ecommerce tracking for Google Analytics.
The great thing, for the most part, about these plugins is that if you have some WordPress experience, you may not need a developer to set them up.
Like any plugin, your WordPress infrastructure might impact your access level and any custom aspects required to implement depending on how they interact with other plugins or functionality.
Wrapping Up
At this point, it is probably pretty clear that a lot of the great things about SEO that we can manage in WordPress also translate over to WooCommerce.
And more broadly, you can implement ecommerce SEO best practices in WooCommerce as a whole.
I made it clear that my team uses WordPress and WooCommerce pretty exclusively right now.
We have had plenty of experiences with Magento, Shopify, and other platforms that left us frustrated as there were things locked down that we couldn’t control or optimize.
Or, as an admin or user, we weren’t able to edit content and manage the site as efficiently as we could with the more user-friendly controls within WordPress.
I’m not saying the other platforms aren’t right for you and your business. I would put each of them through an honest test before you create a new store or consider re-platforming.
There are definitely pros and cons to any platform, and my goal is for you to find the right one. If it is WooCommerce, great – and happy optimizing with the information I shared in this guide!
As we enter the holiday season, October’s data reveals significant shifts and stabilization across industries in AI Overviews (AIOs). Critical insights from October reveal growth in certain sectors, stability in others, and strategic changes in content types and sources. These insights offer actionable strategies for marketers aiming to optimize for AIOs during this critical period.
YouTube Citations In AI Overviews: September Through October
YouTube AI Overviews citations surged in September by 400 – 450% more than the baseline from August when YouTube citations were first tracked. The level then stabilized in October at a level of about 110% to 115% of the August baseline. This gives the impression that this level of YouTube AIO citations may represent a new normal.
The kinds of video content that Google AIO tended to cite were:
How-to’s
In-depth reviews
Product comparisons
BrightEdge’s report observed that YouTube AIO citations in November continued to be stable:
Current State (November): Stabilized at approximately 115-120% with minimal day-to-day variation (±3%).
The next few months will show how satisfied users are with YouTube citations. Presumably Google tested YouTube citations before rolling them out so expectations for dramatic a change should be kept in check because the volatility of YouTube AIO citations was low, indicating that Google may have found the sweet spot for these kinds of citations. So don’t expect this level of YouTube citations to drop although anything is possible.
This trend highlights the continued importance of YouTube video channel as a way to expand reach and the continued evolution away from purely text content. If you embed video on web pages then it’s important to use Video Schema.org structured data.
Massive Growth In Travel Industry AIO Citations
Travel AIO citations surged by 700% from September through October. This may reflect Google’s confidence in AI for making travel recommendations.
BrightEdge offered this advice:
“To capture AIO visibility, travel brands should optimize content around seasonal travel, local events, and specific activities. Many of the keywords that are part of this surge start with “Things to do” which then triggers an unordered list.”
Localized and Activity-Specific Travel Queries
Google AIO is showing citations for more localized travel related queries that are more specific and longtail, which may mean that AI Overviews is handling more of the local travel type queries as opposed to the big destination queries that drilled down to the neighborhood level.
BrightEdge explained:
“Initially, travel AIOs were dominated by broad, general queries focused on major tourist destinations. However, as the month progressed, there was an increase in more localized, activity specific, and seasonal travel searches, reflecting a deeper level of user intent. By November, AIOs were increasingly focused on niche travel queries covering smaller cities, specific neighborhoods, and unique local activities.”
Examples of the pattern of travel queries that triggered AIO are:
Top attractions in
Things to do in
Family friendly activities in
Fall festivals in
AIO Is Stabilizing And Maturing
Another interesting insight from the BrightEdge data is that the daily growth of AIO citations slowed down to 1.3%, indicating that we are now entering a more stable phase.
BrightEdge offers this insight:
“We are now six months into the AIO era and seeing macro-changes in AI overviews that are gerng smaller and smaller”
Another statistic that confirms that AIO are here to stay is that volatility in AIO citations decreased by 42%, another sign of stability. This is good news because it means more predictability for what keyword phrases will trigger AIO citations.
BrightEdge notes:
“The stabilization in AIO appearance allows brands to optimize for a consistent presence, par:cularly for evergreen holiday keywords. This benefit campaigns where a steady AIO presence can drive significant traffic and conversions. As AIOs stabilize, planning and incorporating them into strategies becomes easier. This is pivotal insight for marketers who wish to make AI Overviews part of their 2025 strategy.”
Education Topic Performance
Education topics were on a steady growth trajectory of a 5% increase in keyword that trigger AIO, representing 45-50% of keywords. The growth was seen in more complex educational queries like:
cybersecurity certification prerequisites
career options with a psychology degree
psyd vs phd comparison
B2B queries experienced modest growth of 2%, representing 45-50% of keywords and with less volatility in October than September. Healthcare AIO citations were similarly stable with only a 1% change in October and with 73-75% of keywords triggering AIO citations.
Google Chrome collects site engagement metrics, and Chromium project documentation explains exactly what they are and how they are used.
Site Engagement Metrics
The documentation for the Site Engagement Metrics shares that typing the following into the browser address bar exposes the metrics:
chrome://site-engagement/
What shows up is a list of sites that the browser has visited and Site Engagement Metrics.
Site Engagement Metrics
The Site Engagement Metrics documentation explains that the metrics measure user engagement with a site and that the primary factor used is active time spent. It also offers examples of other signals that may contribute to the measurement.
This is what documentation says:
“The Site Engagement Service provides information about how engaged a user is with a site. The primary signal is the amount of active time the user spends on the site but various other signals may be incorporated (e.g whether a site is added to the homescreen).”
It also shares the following properties of the Chrome Site Engagement Scores:
The score is a double from 0-100. The highest number in the range represents a site the user engages with heavily, and the lowest number represents zero engagement.
Scores are keyed by origin.
Activity on a site increases its score, up to some maximum amount per day.
After a period of inactivity the score will start to decay.
What Chrome Site Engagement Scores Are Used For
Google is transparent about the Chrome Site Engagement metrics because the Chromium Project is open source. The documentation explicitly outlines what the site engagement metrics are, the signals used, how they are calculated, and their intended purposes. There is no ambiguity about their function or use. It’s all laid out in detail.
There are three main uses for the site engagement scores and all three are explicitly for improving the user experience within Chromium-based browsers.
Site engagement metrics are used internally by the browser for these three purposes:
Prioritize Resources: Allocate resources like storage or background sync to sites with higher engagement.
Enable Features: Determine thresholds for enabling specific browser features (e.g., app banners, autoplay).
Sort Sites: Organize lists, such as the most-used sites on the New Tab Page or which tabs to discard when memory is low, based on engagement levels.
The documentation states that the engagement scores were specifically designed for the above three use cases.
Prioritize Resources
Google’s documentation explains that Chrome allocates resources (such as storage space) to websites based on their site engagement levels. Sites with higher user engagement scores are given a greater share of these resources within their browser. The purpose is so that the browser prioritizes sites that are more important or frequently used by the user.
This is what the documentation says:
“Allocating resources based on the proportion of overall engagement a site has (e.g storage, background sync)”
Takeaway: One of the reasons for the site engagement score is to prioritize resources to improve the browser user experience.
Role Of Engagement Metrics For Enabling Features
This part of the documentation explains that Chromium uses site engagement scores to determine whether certain browser features are enabled for a website. Examples of features are app banners and video autoplay.
The site engagement metrics are used to determine whether to let videos autoplay on a given site, if the site is above a specific threshold of engagement. This improves the user experience by preventing annoying video autoplay on sites that have low engagement scores.
This is what the documentation states:
“Setting engagement cutoff points for features (e.g app banner, video autoplay, window.alert())”
Takeaway: The site engagement metrics play a role in determining whether certain features like video autoplay are enabled. The purpose of this metric is to improve the browser user experience.
Sort Sites
The document explicitly says that site engagement scores are used to rank sites for browser functions like tab discarding (when memory is tight) or creating lists of the most-used sites on the New Tab Page (NTP).
“Sorting or prioritizing sites in order of engagement (e.g tab discarding, most used list on NTP)”
Takeaway: Sorting sites based on engagement ensures that the user’s most important and frequently interacted-with sites are prioritized in their browser. It also improves usability through tab management and quick access so that it matches user behavior and preferences.
Privacy
There is absolutely nothing that implies that Google Search uses these site engagement metrics. There is nothing in the documentation that explicitly mentions or implicitly alludes to any other purpose for the site engagement metrics except for improving the user experience and usability of the Chrome browser and Chromium-based devices like the Chromebook.
The engagement scores are limited to a device. The scores aren’t shared between the devices of a single user.
The documentation states:
“The user engagement score are not synced, so decisions made on a given device are made based on the users’ activity on that device alone.”
The user engagement scores are further isolated when users are in Incognito Mode:
“When in incognito mode, site engagement will be copied from the original profile and then allowed to decay and grow independently. There will be no information flow from the incognito profile back to the original profile. Incognito information is deleted when the browser is shut down.”
User engagement scores are deleted when the browser history is cleared:
“Engagement scores are cleared with browsing history.
Origins are deleted when the history service deletes URLs and subsequently reports zero URLs belonging to that origin are left in history.”
The engagement score for a website decreases over time if the user doesn’t interact with the site. This is called “decay” when the user engagement score drops in time. Engagement scores are forgotten which improves the relevance of the scores and how the browser optimizes itself for usability and the user experience.
The impact of user engagement scores that “decay to zero” is that the URLs are completely removed from the browser:
“URLs are cleared when scores decay to zero.”
Takeaway: What Could Google Do With This Data?
It’s understandable that some people, when presented with the facts about Chrome site engagement metrics, will ask, “What if Google is using it?”
Asking “what if” is a powerful way to innovate and explore how a service or a product can be improved or invented. However, basing business decisions on speculative ‘what if’ questions that contradict established facts is counterproductive.
These metrics are solely for improving browser user experience and usability, the scores are not synched and are limited to the device, the scores are further isolated in Incognito Mode and the scores are completely erased when users stop interacting with a site.
That means that the question, “What if Chrome shared site engagement signals with Google?” has no basis in fact. The purpose of these signals and their documented use cases are fully transparent and well understood to be limited to browser usability.
Robots.txt just turned 30 – cue the existential crisis! Like many hitting the big 3-0, it’s wondering if it’s still relevant in today’s world of AI and advanced search algorithms.
Spoiler alert: It definitely is!
Let’s take a look at how this file still plays a key role in managing how search engines crawl your site, how to leverage it correctly, and common pitfalls to avoid.
What Is A Robots.txt File?
A robots.txt file provides crawlers like Googlebot and Bingbot with guidelines for crawling your site. Like a map or directory at the entrance of a museum, it acts as a set of instructions at the entrance of the website, including details on:
What crawlers are/aren’t allowed to enter?
Any restricted areas (pages) that shouldn’t be crawled.
Priority pages to crawl – via the XML sitemap declaration.
Its primary role is to manage crawler access to certain areas of a website by specifying which parts of the site are “off-limits.” This helps ensure that crawlers focus on the most relevant content rather than wasting the crawl budget on low-value content.
While a robots.txt guides crawlers, it’s important to note that not all bots follow its instructions, especially malicious ones. But for most legitimate search engines, adhering to the robots.txt directives is standard practice.
What Is Included In A Robots.txt File?
Robots.txt files consist of lines of directives for search engine crawlers and other bots.
Valid lines in a robots.txt file consist of a field, a colon, and a value.
Robots.txt files also commonly include blank lines to improve readability and comments to help website owners keep track of directives.
Image from author, November 2024
To get a better understanding of what is typically included in a robots.txt file and how different sites leverage it, I looked at robots.txt files for 60 domains with a high share of voice across health, financial services, retail, and high-tech.
Excluding comments and blank lines, the average number of lines across 60 robots.txt files was 152.
Large publishers and aggregators, such as hotels.com, forbes.com, and nytimes.com, typically had longer files, while hospitals like pennmedicine.org and hopkinsmedicine.com typically had shorter files. Retail site’s robots.txt files typically fall close to the average of 152.
All sites analyzed include the fields user-agent and disallow within their robots.txt files, and 77% of sites included a sitemap declaration with the field sitemap.
Fields leveraged less frequently were allow (used by 60% of sites) and crawl-delay (used by 20%) of sites.
Field
% of Sites Leveraging
user-agent
100%
disallow
100%
sitemap
77%
allow
60%
crawl-delay
20%
Robots.txt Syntax
Now that we’ve covered what types of fields are typically included in a robots.txt, we can dive deeper into what each one means and how to use it.
The user-agent field specifies what crawler the directives (disallow, allow) apply to. You can use the user-agent field to create rules that apply to specific bots/crawlers or use a wild card to indicate rules that apply to all crawlers.
For example, the below syntax indicates that any of the following directives only apply to Googlebot.
user-agent: Googlebot
If you want to create rules that apply to all crawlers, you can use a wildcard instead of naming a specific crawler.
user-agent: *
You can include multiple user-agent fields within your robots.txt to provide specific rules for different crawlers or groups of crawlers, for example:
user-agent: *
#Rules here would apply to all crawlers
user-agent: Googlebot
#Rules here would only apply to Googlebot
user-agent: otherbot1
user-agent: otherbot2
user-agent: otherbot3
#Rules here would apply to otherbot1, otherbot2, and otherbot3
Disallow And Allow
The disallow field specifies paths that designated crawlers should not access. The allow field specifies paths that designated crawlers can access.
Because Googlebot and other crawlers will assume they can access any URLs that aren’t specifically disallowed, many sites keep it simple and only specify what paths should not be accessed using the disallow field.
For example, the below syntax would tell all crawlers not to access URLs matching the path /do-not-enter.
user-agent: *
disallow: /do-not-enter
#All crawlers are blocked from crawling pages with the path /do-not-enter
If you’re using both allow and disallow fields within your robots.txt, make sure to read the section on order of precedence for rules in Google’s documentation.
Generally, in the case of conflicting rules, Google will use the more specific rule.
For example, in the below case, Google won’t crawl pages with the path/do-not-enter because the disallow rule is more specific than the allow rule.
user-agent: *
allow: /
disallow: /do-not-enter
If neither rule is more specific, Google will default to using the less restrictive rule.
In the instance below, Google would crawl pages with the path/do-not-enter because the allow rule is less restrictive than the disallow rule.
user-agent: *
allow: /do-not-enter
disallow: /do-not-enter
Note that if there is no path specified for the allow or disallow fields, the rule will be ignored.
user-agent: *
disallow:
This is very different from only including a forward slash (/) as the value for the disallow field, which would match the root domain and any lower-level URL (translation: every page on your site).
If you want your site to show up in search results, make sure you don’t have the following code. It will block all search engines from crawling all pages on your site.
user-agent: *
disallow: /
This might seem obvious, but believe me, I’ve seen it happen.
URL Paths
URL paths are the portion of the URL after the protocol, subdomain, and domain beginning with a forward slash (/). For the example URL https://www.example.com/guides/technical/robots-txt, the path would be /guides/technical/robots-txt.
Image from author, November 2024
URL paths are case-sensitive, so be sure to double-check that the use of capitals and lower cases in the robot.txt aligns with the intended URL path.
A special character is a symbol that has a unique function or meaning instead of just representing a regular letter or number. Special characters supported by Google in robots.txt are:
Asterisk (*) – matches 0 or more instances of any character.
Dollar sign ($) – designates the end of the URL.
To illustrate how these special characters work, assume we have a small site with the following URLs:
A common use of robots.txt is to block internal site search results, as these pages typically aren’t valuable for organic search results.
For this example, assume when users conduct a search on https://www.example.com/search, their query is appended to the URL.
If a user searched “xml sitemap guide,” the new URL for the search results page would be https://www.example.com/search?search-query=xml-sitemap-guide.
When you specify a URL path in the robots.txt, it matches any URLs with that path, not just the exact URL. So, to block both the URLs above, using a wildcard isn’t necessary.
The following rule would match both https://www.example.com/search and https://www.example.com/search?search-query=xml-sitemap-guide.
user-agent: *
disallow: /search
#All crawlers are blocked from crawling pages with the path /search
If a wildcard (*) were added, the results would be the same.
user-agent: *
disallow: /search*
#All crawlers are blocked from crawling pages with the path /search
Example Scenario 2: Block PDF files
In some cases, you may want to use the robots.txt file to block specific types of files.
Imagine the site decided to create PDF versions of each guide to make it easy for users to print. The result is two URLs with exactly the same content, so the site owner may want to block search engines from crawling the PDF versions of each guide.
In this case, using a wildcard (*) would be helpful to match the URLs where the path starts with /guides/ and ends with .pdf, but the characters in between vary.
user-agent: *
disallow: /guides/*.pdf
#All crawlers are blocked from crawling pages with URL paths that contain: /guides/, 0 or more instances of any character, and .pdf
The above directive would prevent search engines from crawling the following URLs:
For the last example, assume the site created category pages for technical and content guides to make it easier for users to browse content in the future.
However, since the site only has three guides published right now, these pages aren’t providing much value to users or search engines.
The site owner may want to temporarily prevent search engines from crawling the category page only (e.g., https://www.example.com/guides/technical), not the guides within the category (e.g., https://www.example.com/guides/technical/robots-txt).
To accomplish this, we can leverage “$” to designate the end of the URL path.
user-agent: *
disallow: /guides/technical$
disallow: /guides/content$
#All crawlers are blocked from crawling pages with URL paths that end with /guides/technical and /guides/content
The above syntax would prevent the following URLs from being crawled:
The sitemap field is used to provide search engines with a link to one or more XML sitemaps.
While not required, it’s a best practice to include XML sitemaps within the robots.txt file to provide search engines with a list of priority URLs to crawl.
The value of the sitemap field should be an absolute URL (e.g., https://www.example.com/sitemap.xml), not a relative URL (e.g., /sitemap.xml). If you have multiple XML sitemaps, you can include multiple sitemap fields.
Example robots.txt with a single XML sitemap:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap.xml
Example robots.txt with multiple XML sitemaps:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap-1.xml
sitemap: https://www.example.com/sitemap-2.xml
sitemap: https://www.example.com/sitemap-3.xml
Crawl-Delay
As mentioned above, 20% of sites also include the crawl-delay field within their robots.txt file.
The crawl delay field tells bots how fast they can crawl the site and is typically used to slow down crawling to avoid overloading servers.
The value for crawl-delay is the number of seconds crawlers should wait to request a new page. The below rule would tell the specified crawler to wait five seconds after each request before requesting another URL.
user-agent: FastCrawlingBot
crawl-delay: 5
Google has stated that it does not support the crawl-delay field, and it will be ignored.
Other major search engines like Bing and Yahoo respect crawl-delay directives for their web crawlers.
Search Engine
Primary user-agent for search
Respects crawl-delay?
Google
Googlebot
No
Bing
Bingbot
Yes
Yahoo
Slurp
Yes
Yandex
YandexBot
Yes
Baidu
Baiduspider
No
Sites most commonly include crawl-delay directives for all user agents (using user-agent: *), search engine crawlers mentioned above that respect crawl-delay, and crawlers for SEO tools like Ahrefbot and SemrushBot.
The number of seconds crawlers were instructed to wait before requesting another URL ranged from one second to 20 seconds, but crawl-delay values of five seconds and 10 seconds were the most common across the 60 sites analyzed.
Testing Robots.txt Files
Any time you’re creating or updating a robots.txt file, make sure to test directives, syntax, and structure before publishing.
The below example shows that Googlebot smartphone is allowed to crawl the tested URL.
Image from author, November 2024
If the tested URL is blocked, the tool will highlight the specific rule that prevents the selected user agent from crawling it.
Image from author, November 2024
To test new rules before they are published, switch to “Editor” and paste your rules into the text box before testing.
Common Uses Of A Robots.txt File
While what is included in a robots.txt file varies greatly by website, analyzing 60 robots.txt files revealed some commonalities in how it is leveraged and what types of content webmasters commonly block search engines from crawling.
Preventing Search Engines From Crawling Low-Value Content
Many websites, especially large ones like ecommerce or content-heavy platforms, often generate “low-value pages” as a byproduct of features designed to improve the user experience.
For example, internal search pages and faceted navigation options (filters and sorts) help users find what they’re looking for quickly and easily.
While these features are essential for usability, they can result in duplicate or low-value URLs that aren’t valuable for search.
The robots.txt is typically leveraged to block these low-value pages from being crawled.
Common types of content blocked via the robots.txt include:
Parameterized URLs: URLs with tracking parameters, session IDs, or other dynamic variables are blocked because they often lead to the same content, which can create duplicate content issues and waste the crawl budget. Blocking these URLs ensures search engines only index the primary, clean URL.
Filters and sorts: Blocking filter and sort URLs (e.g., product pages sorted by price or filtered by category) helps avoid indexing multiple versions of the same page. This reduces the risk of duplicate content and keeps search engines focused on the most important version of the page.
Internal search results: Internal search result pages are often blocked because they generate content that doesn’t offer unique value. If a user’s search query is injected into the URL, page content, and meta elements, sites might even risk some inappropriate, user-generated content getting crawled and indexed (see the sample screenshot in this post by Matt Tutt). Blocking them prevents this low-quality – and potentially inappropriate – content from appearing in search.
User profiles: Profile pages may be blocked to protect privacy, reduce the crawling of low-value pages, or ensure focus on more important content, like product pages or blog posts.
Testing, staging, or development environments: Staging, development, or test environments are often blocked to ensure that non-public content is not crawled by search engines.
Campaign sub-folders: Landing pages created for paid media campaigns are often blocked when they aren’t relevant to a broader search audience (i.e., a direct mail landing page that prompts users to enter a redemption code).
Checkout and confirmation pages: Checkout pages are blocked to prevent users from landing on them directly through search engines, enhancing user experience and protecting sensitive information during the transaction process.
User-generated and sponsored content: Sponsored content or user-generated content created via reviews, questions, comments, etc., are often blocked from being crawled by search engines.
Media files (images, videos): Media files are sometimes blocked from being crawled to conserve bandwidth and reduce the visibility of proprietary content in search engines. It ensures that only relevant web pages, not standalone files, appear in search results.
APIs: APIs are often blocked to prevent them from being crawled or indexed because they are designed for machine-to-machine communication, not for end-user search results. Blocking APIs protects their usage and reduces unnecessary server load from bots trying to access them.
Blocking “Bad” Bots
Bad bots are web crawlers that engage in unwanted or malicious activities such as scraping content and, in extreme cases, looking for vulnerabilities to steal sensitive information.
Other bots without any malicious intent may still be considered “bad” if they flood websites with too many requests, overloading servers.
Additionally, webmasters may simply not want certain crawlers accessing their site because they don’t stand to gain anything from it.
For example, you may choose to block Baidu if you don’t serve customers in China and don’t want to risk requests from Baidu impacting your server.
Though some of these “bad” bots may disregard the instructions outlined in a robots.txt file, websites still commonly include rules to disallow them.
Out of the 60 robots.txt files analyzed, 100% disallowed at least one user agent from accessing all content on the site (via the disallow: /).
Blocking AI Crawlers
Across sites analyzed, the most blocked crawler was GPTBot, with 23% of sites blocking GPTBot from crawling any content on the site.
Orginality.ai’s live dashboard that tracks how many of the top 1,000 websites are blocking specific AI web crawlers found similar results, with 27% of the top 1,000 sites blocking GPTBot as of November 2024.
Reasons for blocking AI web crawlers may vary – from concerns over data control and privacy to simply not wanting your data used in AI training models without compensation.
The decision on whether or not to block AI bots via the robots.txt should be evaluated on a case-by-case basis.
If you don’t want your site’s content to be used to train AI but also want to maximize visibility, you’re in luck. OpenAI is transparent on how it uses GPTBot and other web crawlers.
At a minimum, sites should consider allowing OAI-SearchBot, which is used to feature and link to websites in the SearchGPT – ChatGPT’s recently launched real-time search feature.
Blocking OAI-SearchBot is far less common than blocking GPTBot, with only 2.9% of the top 1,000 sites blocking the SearchGPT-focused crawler.
Getting Creative
In addition to being an important tool in controlling how web crawlers access your site, the robots.txt file can also be an opportunity for sites to show their “creative” side.
While sifting through files from over 60 sites, I also came across some delightful surprises, like the playful illustrations hidden in the comments on Marriott and Cloudflare’s robots.txt files.
Screenshot of marriot.com/robots.txt, November 2024
Screenshot of cloudflare.com/robots.txt, November 2024
Multiple companies are even turning these files into unique recruitment tools.
TripAdvisor’s robots.txt doubles as a job posting with a clever message included in the comments:
“If you’re sniffing around this file, and you’re not a robot, we’re looking to meet curious folks such as yourself…
Run – don’t crawl – to apply to join TripAdvisor’s elite SEO team[.]”
If you’re looking for a new career opportunity, you might want to consider browsing robots.txt files in addition to LinkedIn.
How To Audit Robots.txt
Auditing your Robots.txt file is an essential part of most technical SEO audits.
Conducting a thorough robots.txt audit ensures that your file is optimized to enhance site visibility without inadvertently restricting important pages.
To audit your Robots.txt file:
Crawl the site using your preferred crawler. (I typically use Screaming Frog, but any web crawler should do the trick.)
Filter crawl for any pages flagged as “blocked by robots.txt.” In Screaming Frog, you can find this information by going to the response codes tab and filtering by “blocked by robots.txt.”
Review the list of URLs blocked by the robots.txt to determine whether they should be blocked. Refer to the above list of common types of content blocked by robots.txt to help you determine whether the blocked URLs should be accessible to search engines.
Open your robots.txt file and conduct additional checks to make sure your robots.txt file follows SEO best practices (and avoids common pitfalls) detailed below.
Image from author, November 2024
Robots.txt Best Practices (And Pitfalls To Avoid)
The robots.txt is a powerful tool when used effectively, but there are some common pitfalls to steer clear of if you don’t want to harm the site unintentionally.
The following best practices will help set yourself up for success and avoid unintentionally blocking search engines from crawling important content:
Create a robots.txt file for each subdomain. Each subdomain on your site (e.g., blog.yoursite.com, shop.yoursite.com) should have its own robots.txt file to manage crawling rules specific to that subdomain. Search engines treat subdomains as separate sites, so a unique file ensures proper control over what content is crawled or indexed.
Don’t block important pages on the site. Make sure priority content, such as product and service pages, contact information, and blog content, are accessible to search engines. Additionally, make sure that blocked pages aren’t preventing search engines from accessing links to content you want to be crawled and indexed.
Don’t block essential resources. Blocking JavaScript (JS), CSS, or image files can prevent search engines from rendering your site correctly. Ensure that important resources required for a proper display of the site are not disallowed.
Include a sitemap reference. Always include a reference to your sitemap in the robots.txt file. This makes it easier for search engines to locate and crawl your important pages more efficiently.
Don’t only allow specific bots to access your site. If you disallow all bots from crawling your site, except for specific search engines like Googlebot and Bingbot, you may unintentionally block bots that could benefit your site. Example bots include:
FacebookExtenalHit – used to get open graph protocol.
GooglebotNews – used for the News tab in Google Search and the Google News app.
AdsBot-Google – used to check webpage ad quality.
Don’t block URLs that you want removed from the index. Blocking a URL in robots.txt only prevents search engines from crawling it, not from indexing it if the URL is already known. To remove pages from the index, use other methods like the “noindex” tag or URL removal tools, ensuring they’re properly excluded from search results.
Don’t block Google and other major search engines from crawling your entire site. Just don’t do it.
TL;DR
A robots.txt file guides search engine crawlers on which areas of a website to access or avoid, optimizing crawl efficiency by focusing on high-value pages.
Key fields include “User-agent” to specify the target crawler, “Disallow” for restricted areas, and “Sitemap” for priority pages. The file can also include directives like “Allow” and “Crawl-delay.”
Websites commonly leverage robots.txt to block internal search results, low-value pages (e.g., filters, sort options), or sensitive areas like checkout pages and APIs.
An increasing number of websites are blocking AI crawlers like GPTBot, though this might not be the best strategy for sites looking to gain traffic from additional sources. To maximize site visibility, consider allowing OAI-SearchBot at a minimum.
To set your site up for success, ensure each subdomain has its own robots.txt file, test directives before publishing, include an XML sitemap declaration, and avoid accidentally blocking key content.