Google stresses the importance of using actual user data to assess Core Web Vitals instead of relying only on lab data from tools like PageSpeed Insights (PSI) and Lighthouse.
This reminder comes as the company prepares to update the throttling settings in PSI. These updates are expected to increase the performance scores of websites in Lighthouse.
Field Data vs. Lab Data
Core Web Vitals measure a website’s performance in terms of loading speed, interactivity, and visual stability from the user’s perspective.
Field data shows users’ actual experiences, while lab data comes from tests done in controlled environments using tools like Lighthouse.
Barry Pollard, a Web Performance Developer Advocate at Google, recently emphasized focusing on field data.
“You should concentrate on your field Core Web Vitals (the top part of PageSpeed Insights), and only use the lab Lighthouse Score as a very rough guide of whether Lighthouse has recommendations to improve performance or not…
The Lighthouse Score is best for comparing two tests made on the same Lighthouse (e.g. to test and compare fixes).
Performance IS—and hence LH Scores also ARE—highly variable. LH is particularly affected by where it is run from (PSI, DevTools, CI…), but also on the lots of other factors.
Lighthouse is a GREAT tool but it also can only test some things, under certain conditions.
So while it’s great to see people interested in improving webperf, make sure you’re doing just that (improve performance) and not just improving the score”
Upcoming Changes To PageSpeed Insights
Pollard discussed user concerns about PageSpeed Insights’s slow servers, which can cause Lighthouse tests to take longer than expected.
To fix this, Google is changing the throttling settings in PageSpeed Insights, which should lead to better performance scores when the update is released in the coming weeks.
These changes will affect both the web interface and the API but will not impact other versions of Lighthouse.
However, Pollard reminds users that “a score of 100 doesn’t mean perfect; it just means Lighthouse can’t help anymore.”
Goodhart’s Law & Web Performance
Pollard referenced Goodhart’s Law, which says that when a measure becomes a goal, it stops being a good measure.
In the web performance context, focusing only on improving Lighthouse scores may not improve actual user experience.
Lighthouse is a helpful tool, but it can only assess certain aspects of performance in specific situations.
Alon Kochba, Web Performance and Software Engineer at Wix, added context to the update, stating:
“Lighthouse scores may not be the most important – but this is a big deal for Lighthouse scores in PageSpeed Insights.
4x -> 1.2x CPU throttling for Mobile device simulation, which was way off for quite a while.”
Lighthouse scores may not be the most important – but this is a big deal for Lighthouse scores in PageSpeed Insights.
4x -> 1.2x CPU throttling for Mobile device simulation, which was way off for quite a while. https://t.co/ZLrwsDQGmO
In a recent episode of Google’s Search Off the Record podcast, Allan Scott from the “Dups” team explained how Google decides which URL to consider as the main one when there are duplicate pages.
He revealed that Google looks at about 40 different signals to pick the main URL from a group of similar pages.
Around 40 Signals For Canonical URL Selection
Duplicate content is a common problem for search engines because many websites have multiple pages with the same or similar content.
To solve this, Google uses a process called canonicalization. This process allows Google to pick one URL as the main version to index and show in search results.
Google has discussed the importance of using signals like rel=”canonical” tags, sitemaps, and 301 redirects for canonicalization. However, the number of signals involved in this process is more than you may expect.
Scott revealed during the podcast:
“I’m not sure what the exact number is right now because it goes up and down, but I suspect it’s somewhere in the neighborhood of 40.”
Some of the known signals mentioned include:
rel=”canonical” tags
301 redirects
HTTPS vs. HTTP
Sitemaps
Internal linking
URL length
The weight and importance of each signal may vary, and some signals, like rel=”canonical” tags, can influence both the clustering and canonicalization process.
Balancing Signals
With so many signals at play, Allan acknowledged the challenges in determining the canonical URL when signals conflict.
He stated:
“If your signals conflict with each other, what’s going to happen is the system will start falling back on lesser signals.”
This means that while strong signals like rel=”canonical” tags and 301 redirects are crucial, other factors can come into play when these signals are unclear or contradictory.
As a result, Google’s canonicalization process involves a delicate balancing act to determine the most appropriate canonical URL.
Best Practices For Canonicalization
Clear signals help Google identify the preferred canonical URL.
Best practices include:
Use rel=”canonical” tags correctly.
Implement 301 redirects for permanently moved content.
Ensure HTTPS versions of pages are accessible and linked.
Submit sitemaps with preferred canonical URLs.
Keep internal linking consistent.
These signals help Google find the correct canonical URLs, improving your site’s crawling, indexing, and search visibility.
Mistakes To Avoid
Here are a few common mistakes to watch out for.
1. Incorrect or conflicting canonical tags:
Pointing to non-existent or 404 pages
Multiple canonical tags with different URLs on one page
Pointing to a different domain entirely
Fix: Double-check canonical tags, use only one per page, and use absolute URLs.
2. Canonical chains or loops
When Page A points to Page B as canonical, but Page B points back to A or another page, creating a loop.
Fix: Ensure canonical URLs always point to the final, preferred version of the page.
3. Using noindex and canonical tags together
Sending mixed signals to search engines. Noindex means don’t index the page at all, making canonicals irrelevant.
Fix: Use canonical tags for consolidation and noindex for exclusion.
4. Canonicalizing to redirect or noindex pages
Pointing canonicals to redirected or noindex pages confuses search engines.
Fix: Canonical URLs should be 200 status and indexable.
5. Ignoring case sensitivity
Inconsistent URL casing can cause duplicate content issues.
Fix: Keep URL and canonical tag casing consistent.
6. Overlooking pagination and parameters
Paginated content and parameter-heavy URLs can cause duplication if mishandled.
Fix: Use canonical tags pointing to the first page or “View All” for pagination, and keep parameters consistent.
Key Takeaways
It’s unlikely the complete list of 40+ signals used to determine canonical URLs will be made publicly available.
However, this was still an insightful discussion worth highlighting.
Here are the key takeaways:
Google uses approximately 40 different signals to determine canonical URLs, with rel=”canonical” tags and 301 redirects being among the strongest indicators
When signals conflict, Google falls back on secondary signals to make its determination
Clear, consistent implementation of canonicalization signals (tags, redirects, sitemaps, internal linking) is crucial
Common mistakes like canonical chains, mixed signals, or incorrect implementations can confuse search engines
Google’s “Search Off the Record” podcast recently highlighted an SEO issue that can make web pages disappear from search results.
In the latest episode, Google Search team member Allan Scott discussed “marauding black holes” formed by grouping similar-looking error pages.
Google’s system can accidentally cluster error pages that look alike, causing regular pages to get included in these groups.
This means Google may not crawl these pages again, which can lead to them being de-indexed, even after fixing the errors.
The podcast explained how this happens, its effects on search traffic, and how website owners can keep their pages from getting lost.
How Google Handles Duplicate Content
To understand content black holes, you must first know how Google handles duplicate content.
Scott explains this happens in two steps:
Clustering: Google groups pages that have the same or very similar content.
Canonicalization: Google then chooses the best URL from each group.
After clustering, Google stops re-crawling these pages. This saves resources and avoids unnecessary indexing of duplicate content.
How Error Pages Create Black Holes
The black hole problem happens when error pages group together because they have similar content, such as generic “Page Not Found” messages. Regular pages with occasional errors or temporary outages can get stuck in these error clusters.
The duplication system prevents the re-crawling of pages within a cluster. This makes it hard for mistakenly grouped pages to escape the “black hole,” even after fixing the initial errors. As a result, these pages can get de-indexed, leading to a loss of organic search traffic.
Scott explained:
“Only the things that are very towards the top of the cluster are likely to get back out. Where this really worries me is sites with transient errors… If those fail to fetch, they might break your render, in which case we’ll look at your page, and we’ll think it’s broken.”
How To Avoid Black Holes
To avoid problems with duplicate content black holes, Scott shared the following advice:
Use the Right HTTP Status Codes: For error pages, use proper status codes (like 404, 403, and 503) instead of a 200 OK status. Only pages marked as 200 OK may be grouped together.
Create Unique Content for Custom Error Pages: If you have custom error pages that use a 200 OK status (common in single-page apps), make sure these pages contain specific content to prevent grouping. For example, include the error code and name in the text.
Caution with Noindex Tags: Do not use noindex tags on error pages unless you want them permanently removed from search results. This tag strongly indicates that you want the pages removed, more so than using error status codes.
Following these tips can help ensure regular pages aren’t accidentally mixed with error pages, keeping them in Google’s index.
Regularly checking your site’s crawl coverage and indexation can help catch duplication issues early.
In Summary
Google’s “Search Off the Record” podcast highlighted a potential SEO issue where error pages can be seen as duplicate content. This can cause regular pages to be grouped with errors and removed from Google’s index, even if the errors are fixed.
To prevent duplicate content issues, website owners should:
Use the correct HTTP status codes for error pages.
Ensure custom error pages have unique content.
Monitor their site’s crawl coverage and indexation.
Following technical SEO best practices is essential for maintaining strong search performance, as emphasized by Google’s Search team.
Hear the full discussion in the video below:
Featured Image: Nazarii_Neshcherenskyi/Shutterstock
In a recent YouTube video, Google’s Martin Splitt explained the differences between the “noindex” tag in robots meta tags and the “disallow” command in robots.txt files.
Splitt, a Developer Advocate at Google, pointed out that both methods help manage how search engine crawlers work with a website.
However, they have different purposes and shouldn’t be used in place of each other.
When To Use Noindex
The “noindex” directive tells search engines not to include a specific page in their search results. You can add this instruction in the HTML head section using the robots meta tag or the X-Robots HTTP header.
Use “noindex” when you want to keep a page from showing up in search results but still allow search engines to read the page’s content. This is helpful for pages that users can see but that you don’t want search engines to display, like thank-you pages or internal search result pages.
When To Use Disallow
The “disallow” directive in a website’s robots.txt file stops search engine crawlers from accessing specific URLs or patterns. When a page is disallowed, search engines will not crawl or index its content.
Splitt advises using “disallow” when you want to block search engines completely from retrieving or processing a page. This is suitable for sensitive information, like private user data, or for pages that aren’t relevant to search engines.
Common Mistakes to Avoid
One common mistake website owners make is using “noindex” and “disallow” for the same page. Splitt advises against this because it can cause problems.
If a page is disallowed in the robots.txt file, search engines cannot see the “noindex” command in the page’s meta tag or X-Robots header. As a result, the page might still get indexed, but with limited information.
To stop a page from appearing in search results, Splitt recommends using the “noindex” command without disallowing the page in the robots.txt file.
Google provides a robots.txt report in Google Search Console to test and monitor how robots.txt files affect search engine indexing.
Why This Matters
Understanding the proper use of “noindex” and “disallow” directives is essential for SEO professionals.
Following Google’s advice and using the available testing tools will help ensure your content appears in search results as intended.
Google Search Central has launched a new series called “Crawling December” to provide insights into how Googlebot crawls and indexes webpages.
Google will publish a new article each week this month exploring various aspects of the crawling process that are not often discussed but can significantly impact website crawling.
The first post in the series covers the basics of crawling and sheds light on essential yet lesser-known details about how Googlebot handles page resources and manages crawl budgets.
Crawling Basics
Today’s websites are complex due to advanced JavaScript and CSS, making them harder to crawl than old HTML-only pages. Googlebot works like a web browser but on a different schedule.
When Googlebot visits a webpage, it first downloads the HTML from the main URL, which may link to JavaScript, CSS, images, and videos. Then, Google’s Web Rendering Service (WRS) uses Googlebot to download these resources to create the final page view.
Here are the steps in order:
Initial HTML download
Processing by the Web Rendering Service
Resource fetching
Final page construction
Crawl Budget Management
Crawling extra resources can reduce the main website’s crawl budget. To help with this, Google says that “WRS tries to cache every resource (JavaScript and CSS) used in the pages it renders.”
It’s important to note that the WRS cache lasts up to 30 days and is not influenced by the HTTP caching rules set by developers.
This caching strategy helps to save a site’s crawl budget.
Recommendations
This post gives site owners tips on how to optimize their crawl budget:
Reduce Resource Use: Use fewer resources to create a good user experience. This helps save crawl budget when rendering a page.
Host Resources Separately: Place resources on a different hostname, like a CDN or subdomain. This can help shift the crawl budget burden away from your main site.
Use Cache-Busting Parameters Wisely: Be careful with cache-busting parameters. Changing resource URLs can make Google recheck them, even if the content is the same. This can waste your crawl budget.
Also, Google warns that blocking resource crawling with robots.txt can be risky.
If Google can’t access a necessary resource for rendering, it may have trouble getting the page content and ranking it properly.
Monitoring Tools
The Search Central team says the best way to see what resources Googlebot is crawling is by checking a site’s raw access logs.
You can identify Googlebot by its IP address using the ranges published in Google’s developer documentation.
Why This Matters
This post clarifies three key points that impact how Google finds and processes your site’s content:
Resource management directly affects your crawl budget, so hosting scripts and styles on CDNs can help preserve it.
Google caches resources for 30 days regardless of your HTTP cache settings, which helps conserve your crawl budget.
Blocking critical resources in robots.txt can backfire by preventing Google from properly rendering your pages.
Understanding these mechanics helps SEOs and developers make better decisions about resource hosting and accessibility – choices that directly impact how well Google can crawl and index their sites.
It’s a cornerstone of your business’s online success that impacts everything from site speed and uptime to customer trust and overall branding.
Yet, many businesses stick with subpar hosting providers, often unaware of how much it’s costing them in time, money, and lost opportunities.
The reality is that bad hosting doesn’t just frustrate you. It frustrates your customers, hurts conversions, and can even damage your brand reputation.
The good news?
Choosing the right host can turn hosting into an investment that works for you, not against you.
Let’s explore how hosting affects your bottom line, identify common problems, and discuss what features you should look for to maximize your return on investment.
1. Start By Auditing Your Website’s Hosting Provider
The wrong hosting provider can quickly eat away at your time & efficiency.
In fact, time is the biggest cost of an insufficient hosting provider.
To start out, ask yourself:
Is Your Bounce Rate High?
Are Customers Not Converting?
Is Revenue Down?
If you answered yes to any of those questions, and no amount of on-page optimization seems to make a difference, it may be time to audit your website host.
Why Audit Your Web Host?
Frequent downtime, poor support, and slow server response times can disrupt workflows and create frustration for both your team and your visitors.
From an SEO & marketing perspective, a sluggish website often leads to:
Increased bounce rates.
Missed customer opportunities.
Wasted time troubleshooting technical issues.
Could you find workarounds for some of these problems? Sure. But they take time and money, too.
The more dashboards and tools you use, the more time you spend managing it all, and the more opportunities you’ll miss out on.
Bluehost’s integrated domain services simplify website management by bringing all your hosting and domain tools into one intuitive platform.
2. Check If Your Hosting Provider Is Causing Slow Site Load Speeds
Your website is often the first interaction a customer has with your brand.
A fast, reliable website reflects professionalism and trustworthiness.
Customers associate smooth experiences with strong brands, while frequent glitches or outages send a message that you’re not dependable.
Your hosting provider should enhance your brand’s reputation, not detract from it.
How To Identify & Measure Slow Page Load Speeds
Identifying and measuring slow site and page loading speeds starts with using tools designed to analyze performance, such as Google PageSpeed Insights, GTmetrix, or Lighthouse.
These tools provide metrics like First Contentful Paint (FCP) and Largest Contentful Paint (LCP), which help you see how quickly key elements of your page load.
Pay attention to your site’s Time to First Byte (TTFB), a critical indicator of how fast your server responds to requests.
Regularly test your site’s performance across different devices, browsers, and internet connections to identify bottlenecks. High bounce rates or short average session durations in analytics reports can also hint at speed issues.
Bandwidth limitations can create bottlenecks for growing websites, especially during traffic spikes.
How To Find A Fast Hosting Provider
Opt for hosting providers that offer unmetered or scalable bandwidth to ensure seamless performance even during periods of high demand.
Cloud hosting is designed to deliver exceptional site and page load speeds, ensuring a seamless experience for your visitors and boosting your site’s SEO.
With advanced caching technology and optimized server configurations, Bluehost Cloud accelerates content delivery to provide fast, reliable performance even during high-traffic periods.
Its scalable infrastructure ensures your website maintains consistent speeds as your business grows, while a global Content Delivery Network (CDN) helps reduce latency for users around the world.
With Bluehost Cloud, you can trust that your site will load quickly and keep your audience engaged.
3. Check If Your Site Has Frequent Or Prolonged Downtime
Measuring and identifying downtime starts with having the right tools and a clear understanding of your site’s performance.
Tools like uptime monitoring services can track when your site is accessible and alert you to outages in real time.
You should also look at patterns.
Frequent interruptions or prolonged periods of unavailability are red flags. Check your server logs for error codes and timestamps that indicate when the site was down.
Tracking how quickly your hosting provider responds and resolves issues is also helpful, as slow resolutions can compound the problem.
Remember, even a few minutes of downtime during peak traffic hours can lead to lost revenue and customer trust, so understanding and monitoring downtime is critical for keeping your site reliable.
No matter how feature-packed your hosting provider is, unreliable uptime or poor support can undermine its value. These two factors are critical for ensuring a high-performing, efficient website.
What Your Hosting Server Should Have For Guaranteed Uptime
A Service Level Agreement (SLA) guarantees uptime, response time, and resolution time, ensuring that your site remains online and functional. Look for hosting providers that back their promises with a 100% uptime SLA.
Bluehost Cloud offers a 100% uptime SLA and 24/7 priority support, giving you peace of mind that your website will remain operational and any issues will be addressed promptly.
Our team of WordPress experts ensures quick resolutions to technical challenges, reducing downtime and optimizing your hosting ROI.
4. Check Your Host For Security Efficacy
Strong security measures protect your customers and show them you value their privacy and trust.
A single security breach can ruin your brand’s image, especially if customer data is compromised.
Hosts that lack built-in security features like SSL certificates, malware scanning, and regular backups leave your site vulnerable.
How Hosting Impacts Security
Security breaches don’t just affect your website. They affect your customers.
Whether it’s stolen data, phishing attacks, or malware, these breaches can erode trust and cause long-term damage to your business.
Recovering from a security breach is expensive and time-consuming. It often involves hiring specialists, paying fines, and repairing the damage to your reputation.
Is Your Hosting Provider Lacking Proactive Security Measures?
Assessing and measuring security vulnerabilities or a lack of proactive protection measures begins with a thorough evaluation of your hosting provider’s features and practices.
Review Included Security Tools
Start by reviewing whether your provider includes essential security tools such as SSL certificates, malware scanning, firewalls, and automated backups in their standard offerings.
If these are missing or come as costly add-ons, your site may already be at risk.
Leverage Brute Force Tools To Check For Vulnerabilities
Next, use website vulnerability scanning tools like Sucuri, Qualys SSL Labs, or SiteLock to identify potential weaknesses, such as outdated software, unpatched plugins, or misconfigured settings.
These tools can flag issues like weak encryption, exposed directories, or malware infections.
Monitor your site for unusual activity, such as unexpected traffic spikes or changes to critical files, which could signal a breach.
Make Sure The Host Also Routinely Scans For & Eliminates Threats
It’s also crucial to evaluate how your hosting provider handles updates and threat prevention.
Do they offer automatic updates to patch vulnerabilities?
Do they monitor for emerging threats and take steps to block them proactively?
A good hosting provider takes a proactive approach to security, offering built-in protections that reduce your risks.
Look for hosting providers that include automatic SSL encryption, regular malware scans, and daily backups. These features not only protect your site but also give you peace of mind.
Bluehost offers robust security tools as part of its standard WordPress hosting package, ensuring your site stays protected without extra costs. With built-in SSL certificates and daily backups, Bluehost Cloud keeps your site secure and your customers’ trust intact.
5. Audit Your WordPress Hosting Provider’s Customer Support
Is your host delivering limited or inconsistent customer support?
Limited or inconsistent customer support can turn minor issues into major roadblocks. When hosting providers fail to offer timely, knowledgeable assistance, you’re left scrambling to resolve problems that could have been easily fixed.
Delayed responses or unhelpful support can lead to prolonged downtime, slower page speeds, and unresolved security concerns, all of which impact your business and reputation.
Reliable hosting providers should offer 24/7 priority support through multiple channels, such as chat and phone, so you can get expert help whenever you need it.
Consistent, high-quality support is essential for keeping your website running smoothly and minimizing disruptions.
Bluehost takes customer service to the next level with 24/7 priority support available via phone, chat, and email. Our team of knowledgeable experts specializes in WordPress, providing quick and effective solutions to keep your site running smoothly.
Whether you’re troubleshooting an issue, setting up your site, or optimizing performance, Bluehost’s dedicated support ensures you’re never left navigating challenges alone.
Bonus: Check Your Host For Hidden Costs For Essential Hosting Features
Hidden costs for essential hosting features like:
Backups.
SSL certificates.
Additional bandwidth can quickly erode the value of a seemingly affordable hosting plan.
What Does This Look Like?
For example, daily backups, which are vital for recovery after data loss or cyberattacks, may come with an unexpected monthly fee.
Similarly, SSL certificates, which are essential for encrypting data and maintaining trust with visitors, are often sold as expensive add-ons.
If your site experiences traffic spikes, additional bandwidth charges can catch you off guard, adding to your monthly costs.
Many providers, as you likely have seen, lure customers in with low entry prices, only to charge extra for services that are critical to your website’s functionality and security.
These hidden expenses not only strain your budget but also create unnecessary complexity in managing your site.
A reliable hosting provider includes these features as part of their standard offering, ensuring you have the tools you need without the surprise bills.
Which Hosting Provider Does Not Charge For Essential Features?
Bluehost is a great option, as their pricing is upfront.
Bluehost includes crucial tools like daily automated backups, SSL certificates, and unmetered bandwidth in their standard plans.
This means you won’t face surprise fees for the basic functionalities your website needs to operate securely and effectively.
Whether you’re safeguarding your site from potential data loss or ensuring encrypted, trustworthy connections for your visitors, or need unmetered bandwidth to ensure your site can handle traffic surges without penalty, you’ll gain the flexibility to scale without worrying about extra charges.
We even give WordPress users the option to bundle premium plugins together to help you save even more.
By including these features upfront, Bluehost simplifies your WordPress hosting experience and helps you maintain a predictable budget, freeing you to focus on growing your business instead of worrying about unexpected hosting costs.
Transitioning To A Better Hosting Solution: What To Consider
Switching hosting providers might seem daunting, but the right provider can make the process simple and cost-effective. Here are key considerations for transitioning to a better hosting solution:
Migration Challenges
Migrating your site to a new host can involve technical hurdles, including transferring content, preserving configurations, and minimizing downtime. A hosting provider with dedicated migration support can make this process seamless.
Cost of Switching Providers
Many businesses hesitate to switch hosts due to the cost of ending a contract early. To offset these expenses, search for hosting providers that offer migration incentives, such as contract buyouts or credit for remaining fees.
Why Bluehost Cloud Stands Out
Bluehost Cloud provides comprehensive migration support, handling every detail of the transfer to ensure a smooth transition.
Plus, our migration promotion includes $0 switching costs and credit for remaining contracts, making the move to Bluehost not only hassle-free but also financially advantageous.
Your hosting provider plays a pivotal role in the success of your WordPress site. By addressing performance issues, integrating essential features, and offering reliable support, you can maximize your hosting ROI and create a foundation for long-term success.
If your current hosting provider is falling short, it’s time to evaluate your options. Bluehost Cloud delivers performance-focused features, 100% uptime, premium support, and cost-effective migration services, ensuring your WordPress site runs smoothly and efficiently.
In addition, Bluehost has been a trusted partner of WordPress since 2005, working closely to create a hosting platform tailored to the unique needs of WordPress websites.
Beyond hosting, Bluehost empowers users through education, offering webinars, masterclasses, and resources like the WordPress Academy to help you maximize your WordPress experience and build successful websites.
Take control of your website’s performance and ROI. Visit the Bluehost Migration Page to learn how Bluehost Cloud can elevate your hosting experience.
This article has been sponsored by Bluehost, and the views presented herein represent the sponsor’s perspective.
Robots.txt just turned 30 – cue the existential crisis! Like many hitting the big 3-0, it’s wondering if it’s still relevant in today’s world of AI and advanced search algorithms.
Spoiler alert: It definitely is!
Let’s take a look at how this file still plays a key role in managing how search engines crawl your site, how to leverage it correctly, and common pitfalls to avoid.
What Is A Robots.txt File?
A robots.txt file provides crawlers like Googlebot and Bingbot with guidelines for crawling your site. Like a map or directory at the entrance of a museum, it acts as a set of instructions at the entrance of the website, including details on:
What crawlers are/aren’t allowed to enter?
Any restricted areas (pages) that shouldn’t be crawled.
Priority pages to crawl – via the XML sitemap declaration.
Its primary role is to manage crawler access to certain areas of a website by specifying which parts of the site are “off-limits.” This helps ensure that crawlers focus on the most relevant content rather than wasting the crawl budget on low-value content.
While a robots.txt guides crawlers, it’s important to note that not all bots follow its instructions, especially malicious ones. But for most legitimate search engines, adhering to the robots.txt directives is standard practice.
What Is Included In A Robots.txt File?
Robots.txt files consist of lines of directives for search engine crawlers and other bots.
Valid lines in a robots.txt file consist of a field, a colon, and a value.
Robots.txt files also commonly include blank lines to improve readability and comments to help website owners keep track of directives.
Image from author, November 2024
To get a better understanding of what is typically included in a robots.txt file and how different sites leverage it, I looked at robots.txt files for 60 domains with a high share of voice across health, financial services, retail, and high-tech.
Excluding comments and blank lines, the average number of lines across 60 robots.txt files was 152.
Large publishers and aggregators, such as hotels.com, forbes.com, and nytimes.com, typically had longer files, while hospitals like pennmedicine.org and hopkinsmedicine.com typically had shorter files. Retail site’s robots.txt files typically fall close to the average of 152.
All sites analyzed include the fields user-agent and disallow within their robots.txt files, and 77% of sites included a sitemap declaration with the field sitemap.
Fields leveraged less frequently were allow (used by 60% of sites) and crawl-delay (used by 20%) of sites.
Field
% of Sites Leveraging
user-agent
100%
disallow
100%
sitemap
77%
allow
60%
crawl-delay
20%
Robots.txt Syntax
Now that we’ve covered what types of fields are typically included in a robots.txt, we can dive deeper into what each one means and how to use it.
The user-agent field specifies what crawler the directives (disallow, allow) apply to. You can use the user-agent field to create rules that apply to specific bots/crawlers or use a wild card to indicate rules that apply to all crawlers.
For example, the below syntax indicates that any of the following directives only apply to Googlebot.
user-agent: Googlebot
If you want to create rules that apply to all crawlers, you can use a wildcard instead of naming a specific crawler.
user-agent: *
You can include multiple user-agent fields within your robots.txt to provide specific rules for different crawlers or groups of crawlers, for example:
user-agent: *
#Rules here would apply to all crawlers
user-agent: Googlebot
#Rules here would only apply to Googlebot
user-agent: otherbot1
user-agent: otherbot2
user-agent: otherbot3
#Rules here would apply to otherbot1, otherbot2, and otherbot3
Disallow And Allow
The disallow field specifies paths that designated crawlers should not access. The allow field specifies paths that designated crawlers can access.
Because Googlebot and other crawlers will assume they can access any URLs that aren’t specifically disallowed, many sites keep it simple and only specify what paths should not be accessed using the disallow field.
For example, the below syntax would tell all crawlers not to access URLs matching the path /do-not-enter.
user-agent: *
disallow: /do-not-enter
#All crawlers are blocked from crawling pages with the path /do-not-enter
If you’re using both allow and disallow fields within your robots.txt, make sure to read the section on order of precedence for rules in Google’s documentation.
Generally, in the case of conflicting rules, Google will use the more specific rule.
For example, in the below case, Google won’t crawl pages with the path/do-not-enter because the disallow rule is more specific than the allow rule.
user-agent: *
allow: /
disallow: /do-not-enter
If neither rule is more specific, Google will default to using the less restrictive rule.
In the instance below, Google would crawl pages with the path/do-not-enter because the allow rule is less restrictive than the disallow rule.
user-agent: *
allow: /do-not-enter
disallow: /do-not-enter
Note that if there is no path specified for the allow or disallow fields, the rule will be ignored.
user-agent: *
disallow:
This is very different from only including a forward slash (/) as the value for the disallow field, which would match the root domain and any lower-level URL (translation: every page on your site).
If you want your site to show up in search results, make sure you don’t have the following code. It will block all search engines from crawling all pages on your site.
user-agent: *
disallow: /
This might seem obvious, but believe me, I’ve seen it happen.
URL Paths
URL paths are the portion of the URL after the protocol, subdomain, and domain beginning with a forward slash (/). For the example URL https://www.example.com/guides/technical/robots-txt, the path would be /guides/technical/robots-txt.
Image from author, November 2024
URL paths are case-sensitive, so be sure to double-check that the use of capitals and lower cases in the robot.txt aligns with the intended URL path.
A special character is a symbol that has a unique function or meaning instead of just representing a regular letter or number. Special characters supported by Google in robots.txt are:
Asterisk (*) – matches 0 or more instances of any character.
Dollar sign ($) – designates the end of the URL.
To illustrate how these special characters work, assume we have a small site with the following URLs:
A common use of robots.txt is to block internal site search results, as these pages typically aren’t valuable for organic search results.
For this example, assume when users conduct a search on https://www.example.com/search, their query is appended to the URL.
If a user searched “xml sitemap guide,” the new URL for the search results page would be https://www.example.com/search?search-query=xml-sitemap-guide.
When you specify a URL path in the robots.txt, it matches any URLs with that path, not just the exact URL. So, to block both the URLs above, using a wildcard isn’t necessary.
The following rule would match both https://www.example.com/search and https://www.example.com/search?search-query=xml-sitemap-guide.
user-agent: *
disallow: /search
#All crawlers are blocked from crawling pages with the path /search
If a wildcard (*) were added, the results would be the same.
user-agent: *
disallow: /search*
#All crawlers are blocked from crawling pages with the path /search
Example Scenario 2: Block PDF files
In some cases, you may want to use the robots.txt file to block specific types of files.
Imagine the site decided to create PDF versions of each guide to make it easy for users to print. The result is two URLs with exactly the same content, so the site owner may want to block search engines from crawling the PDF versions of each guide.
In this case, using a wildcard (*) would be helpful to match the URLs where the path starts with /guides/ and ends with .pdf, but the characters in between vary.
user-agent: *
disallow: /guides/*.pdf
#All crawlers are blocked from crawling pages with URL paths that contain: /guides/, 0 or more instances of any character, and .pdf
The above directive would prevent search engines from crawling the following URLs:
For the last example, assume the site created category pages for technical and content guides to make it easier for users to browse content in the future.
However, since the site only has three guides published right now, these pages aren’t providing much value to users or search engines.
The site owner may want to temporarily prevent search engines from crawling the category page only (e.g., https://www.example.com/guides/technical), not the guides within the category (e.g., https://www.example.com/guides/technical/robots-txt).
To accomplish this, we can leverage “$” to designate the end of the URL path.
user-agent: *
disallow: /guides/technical$
disallow: /guides/content$
#All crawlers are blocked from crawling pages with URL paths that end with /guides/technical and /guides/content
The above syntax would prevent the following URLs from being crawled:
The sitemap field is used to provide search engines with a link to one or more XML sitemaps.
While not required, it’s a best practice to include XML sitemaps within the robots.txt file to provide search engines with a list of priority URLs to crawl.
The value of the sitemap field should be an absolute URL (e.g., https://www.example.com/sitemap.xml), not a relative URL (e.g., /sitemap.xml). If you have multiple XML sitemaps, you can include multiple sitemap fields.
Example robots.txt with a single XML sitemap:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap.xml
Example robots.txt with multiple XML sitemaps:
user-agent: *
disallow: /do-not-enter
sitemap: https://www.example.com/sitemap-1.xml
sitemap: https://www.example.com/sitemap-2.xml
sitemap: https://www.example.com/sitemap-3.xml
Crawl-Delay
As mentioned above, 20% of sites also include the crawl-delay field within their robots.txt file.
The crawl delay field tells bots how fast they can crawl the site and is typically used to slow down crawling to avoid overloading servers.
The value for crawl-delay is the number of seconds crawlers should wait to request a new page. The below rule would tell the specified crawler to wait five seconds after each request before requesting another URL.
user-agent: FastCrawlingBot
crawl-delay: 5
Google has stated that it does not support the crawl-delay field, and it will be ignored.
Other major search engines like Bing and Yahoo respect crawl-delay directives for their web crawlers.
Search Engine
Primary user-agent for search
Respects crawl-delay?
Google
Googlebot
No
Bing
Bingbot
Yes
Yahoo
Slurp
Yes
Yandex
YandexBot
Yes
Baidu
Baiduspider
No
Sites most commonly include crawl-delay directives for all user agents (using user-agent: *), search engine crawlers mentioned above that respect crawl-delay, and crawlers for SEO tools like Ahrefbot and SemrushBot.
The number of seconds crawlers were instructed to wait before requesting another URL ranged from one second to 20 seconds, but crawl-delay values of five seconds and 10 seconds were the most common across the 60 sites analyzed.
Testing Robots.txt Files
Any time you’re creating or updating a robots.txt file, make sure to test directives, syntax, and structure before publishing.
The below example shows that Googlebot smartphone is allowed to crawl the tested URL.
Image from author, November 2024
If the tested URL is blocked, the tool will highlight the specific rule that prevents the selected user agent from crawling it.
Image from author, November 2024
To test new rules before they are published, switch to “Editor” and paste your rules into the text box before testing.
Common Uses Of A Robots.txt File
While what is included in a robots.txt file varies greatly by website, analyzing 60 robots.txt files revealed some commonalities in how it is leveraged and what types of content webmasters commonly block search engines from crawling.
Preventing Search Engines From Crawling Low-Value Content
Many websites, especially large ones like ecommerce or content-heavy platforms, often generate “low-value pages” as a byproduct of features designed to improve the user experience.
For example, internal search pages and faceted navigation options (filters and sorts) help users find what they’re looking for quickly and easily.
While these features are essential for usability, they can result in duplicate or low-value URLs that aren’t valuable for search.
The robots.txt is typically leveraged to block these low-value pages from being crawled.
Common types of content blocked via the robots.txt include:
Parameterized URLs: URLs with tracking parameters, session IDs, or other dynamic variables are blocked because they often lead to the same content, which can create duplicate content issues and waste the crawl budget. Blocking these URLs ensures search engines only index the primary, clean URL.
Filters and sorts: Blocking filter and sort URLs (e.g., product pages sorted by price or filtered by category) helps avoid indexing multiple versions of the same page. This reduces the risk of duplicate content and keeps search engines focused on the most important version of the page.
Internal search results: Internal search result pages are often blocked because they generate content that doesn’t offer unique value. If a user’s search query is injected into the URL, page content, and meta elements, sites might even risk some inappropriate, user-generated content getting crawled and indexed (see the sample screenshot in this post by Matt Tutt). Blocking them prevents this low-quality – and potentially inappropriate – content from appearing in search.
User profiles: Profile pages may be blocked to protect privacy, reduce the crawling of low-value pages, or ensure focus on more important content, like product pages or blog posts.
Testing, staging, or development environments: Staging, development, or test environments are often blocked to ensure that non-public content is not crawled by search engines.
Campaign sub-folders: Landing pages created for paid media campaigns are often blocked when they aren’t relevant to a broader search audience (i.e., a direct mail landing page that prompts users to enter a redemption code).
Checkout and confirmation pages: Checkout pages are blocked to prevent users from landing on them directly through search engines, enhancing user experience and protecting sensitive information during the transaction process.
User-generated and sponsored content: Sponsored content or user-generated content created via reviews, questions, comments, etc., are often blocked from being crawled by search engines.
Media files (images, videos): Media files are sometimes blocked from being crawled to conserve bandwidth and reduce the visibility of proprietary content in search engines. It ensures that only relevant web pages, not standalone files, appear in search results.
APIs: APIs are often blocked to prevent them from being crawled or indexed because they are designed for machine-to-machine communication, not for end-user search results. Blocking APIs protects their usage and reduces unnecessary server load from bots trying to access them.
Blocking “Bad” Bots
Bad bots are web crawlers that engage in unwanted or malicious activities such as scraping content and, in extreme cases, looking for vulnerabilities to steal sensitive information.
Other bots without any malicious intent may still be considered “bad” if they flood websites with too many requests, overloading servers.
Additionally, webmasters may simply not want certain crawlers accessing their site because they don’t stand to gain anything from it.
For example, you may choose to block Baidu if you don’t serve customers in China and don’t want to risk requests from Baidu impacting your server.
Though some of these “bad” bots may disregard the instructions outlined in a robots.txt file, websites still commonly include rules to disallow them.
Out of the 60 robots.txt files analyzed, 100% disallowed at least one user agent from accessing all content on the site (via the disallow: /).
Blocking AI Crawlers
Across sites analyzed, the most blocked crawler was GPTBot, with 23% of sites blocking GPTBot from crawling any content on the site.
Orginality.ai’s live dashboard that tracks how many of the top 1,000 websites are blocking specific AI web crawlers found similar results, with 27% of the top 1,000 sites blocking GPTBot as of November 2024.
Reasons for blocking AI web crawlers may vary – from concerns over data control and privacy to simply not wanting your data used in AI training models without compensation.
The decision on whether or not to block AI bots via the robots.txt should be evaluated on a case-by-case basis.
If you don’t want your site’s content to be used to train AI but also want to maximize visibility, you’re in luck. OpenAI is transparent on how it uses GPTBot and other web crawlers.
At a minimum, sites should consider allowing OAI-SearchBot, which is used to feature and link to websites in the SearchGPT – ChatGPT’s recently launched real-time search feature.
Blocking OAI-SearchBot is far less common than blocking GPTBot, with only 2.9% of the top 1,000 sites blocking the SearchGPT-focused crawler.
Getting Creative
In addition to being an important tool in controlling how web crawlers access your site, the robots.txt file can also be an opportunity for sites to show their “creative” side.
While sifting through files from over 60 sites, I also came across some delightful surprises, like the playful illustrations hidden in the comments on Marriott and Cloudflare’s robots.txt files.
Screenshot of marriot.com/robots.txt, November 2024
Screenshot of cloudflare.com/robots.txt, November 2024
Multiple companies are even turning these files into unique recruitment tools.
TripAdvisor’s robots.txt doubles as a job posting with a clever message included in the comments:
“If you’re sniffing around this file, and you’re not a robot, we’re looking to meet curious folks such as yourself…
Run – don’t crawl – to apply to join TripAdvisor’s elite SEO team[.]”
If you’re looking for a new career opportunity, you might want to consider browsing robots.txt files in addition to LinkedIn.
How To Audit Robots.txt
Auditing your Robots.txt file is an essential part of most technical SEO audits.
Conducting a thorough robots.txt audit ensures that your file is optimized to enhance site visibility without inadvertently restricting important pages.
To audit your Robots.txt file:
Crawl the site using your preferred crawler. (I typically use Screaming Frog, but any web crawler should do the trick.)
Filter crawl for any pages flagged as “blocked by robots.txt.” In Screaming Frog, you can find this information by going to the response codes tab and filtering by “blocked by robots.txt.”
Review the list of URLs blocked by the robots.txt to determine whether they should be blocked. Refer to the above list of common types of content blocked by robots.txt to help you determine whether the blocked URLs should be accessible to search engines.
Open your robots.txt file and conduct additional checks to make sure your robots.txt file follows SEO best practices (and avoids common pitfalls) detailed below.
Image from author, November 2024
Robots.txt Best Practices (And Pitfalls To Avoid)
The robots.txt is a powerful tool when used effectively, but there are some common pitfalls to steer clear of if you don’t want to harm the site unintentionally.
The following best practices will help set yourself up for success and avoid unintentionally blocking search engines from crawling important content:
Create a robots.txt file for each subdomain. Each subdomain on your site (e.g., blog.yoursite.com, shop.yoursite.com) should have its own robots.txt file to manage crawling rules specific to that subdomain. Search engines treat subdomains as separate sites, so a unique file ensures proper control over what content is crawled or indexed.
Don’t block important pages on the site. Make sure priority content, such as product and service pages, contact information, and blog content, are accessible to search engines. Additionally, make sure that blocked pages aren’t preventing search engines from accessing links to content you want to be crawled and indexed.
Don’t block essential resources. Blocking JavaScript (JS), CSS, or image files can prevent search engines from rendering your site correctly. Ensure that important resources required for a proper display of the site are not disallowed.
Include a sitemap reference. Always include a reference to your sitemap in the robots.txt file. This makes it easier for search engines to locate and crawl your important pages more efficiently.
Don’t only allow specific bots to access your site. If you disallow all bots from crawling your site, except for specific search engines like Googlebot and Bingbot, you may unintentionally block bots that could benefit your site. Example bots include:
FacebookExtenalHit – used to get open graph protocol.
GooglebotNews – used for the News tab in Google Search and the Google News app.
AdsBot-Google – used to check webpage ad quality.
Don’t block URLs that you want removed from the index. Blocking a URL in robots.txt only prevents search engines from crawling it, not from indexing it if the URL is already known. To remove pages from the index, use other methods like the “noindex” tag or URL removal tools, ensuring they’re properly excluded from search results.
Don’t block Google and other major search engines from crawling your entire site. Just don’t do it.
TL;DR
A robots.txt file guides search engine crawlers on which areas of a website to access or avoid, optimizing crawl efficiency by focusing on high-value pages.
Key fields include “User-agent” to specify the target crawler, “Disallow” for restricted areas, and “Sitemap” for priority pages. The file can also include directives like “Allow” and “Crawl-delay.”
Websites commonly leverage robots.txt to block internal search results, low-value pages (e.g., filters, sort options), or sensitive areas like checkout pages and APIs.
An increasing number of websites are blocking AI crawlers like GPTBot, though this might not be the best strategy for sites looking to gain traffic from additional sources. To maximize site visibility, consider allowing OAI-SearchBot at a minimum.
To set your site up for success, ensure each subdomain has its own robots.txt file, test directives before publishing, include an XML sitemap declaration, and avoid accidentally blocking key content.
Here are seven essential features to look for in an SEO-friendly WordPress host that will help you:
1. Reliable Uptime & Speed for Consistent Performance
A website’s uptime and speed can significantly influence your site’s rankings and the success of your SEO strategies.
Users don’t like sites that suffer from significant downtime or sluggish load speeds. Not only are these sites inconvenient, but they also reflect negatively on the brand and their products and services, making them appear less trustworthy and of lower quality.
For these reasons, Google values websites that load quickly and reliably. So, if your site suffers from significant downtime or sluggish load times, it can negatively affect your site’s position in search results as well as frustrate users.
Reliable hosting with minimal downtime and fast server response times helps ensure that both users and search engines can access your content seamlessly.
Performance-focused infrastructure, optimized for fast server responses, is essential for delivering a smooth and engaging user experience.
When evaluating hosting providers, look for high uptime guarantees through a robust Service Level Agreement (SLA), which assures site availability and speed.
Bluehost Cloud, for instance, offers a 100% SLA for uptime, response time, and resolution time.
Built specifically with WordPress users in mind, Bluehost Cloud leverages an infrastructure optimized to deliver the speed and reliability that WordPress sites require, enhancing both SEO performance and user satisfaction. This guarantee provides you with peace of mind.
Your site will remain accessible and perform optimally around the clock, and you’ll spend less time troubleshooting and dealing with your host’s support team trying to get your site back online.
2. Data Center Locations & CDN Options For Global Reach
Fast load times are crucial not only for providing a better user experience but also for reducing bounce rates and boosting SEO rankings.
Since Google prioritizes websites that load quickly for users everywhere, having data centers in multiple locations and Content Delivery Network (CDN) integration is essential for WordPress sites with a global audience.
To ensure your site loads quickly for all users, no matter where they are, choose a WordPress host with a distributed network of data centers and CDN support. Consider whether it offers CDN options and data center locations that align with your audience’s geographic distribution
This setup allows your content to reach users swiftly across different regions, enhancing both user satisfaction and search engine performance.
Bluehost Cloud integrates with a CDN to accelerate content delivery across the globe. This means that whether your visitors are in North America, Europe, or Asia, they’ll experience faster load times.
By leveraging global data centers and a CDN, Bluehost Cloud ensures your site’s SEO remains strong, delivering a consistent experience for users around the world.
3. Built-In Security Features To Protect From SEO-Damaging Attacks
Security is essential for your brand, your SEO, and overall site health.
Websites that experience security breaches, malware, or frequent hacking attempts can be penalized by search engines, potentially suffering from ranking drops or even removal from search indexes.
Therefore, it’s critical to select a host that offers strong built-in security features to safeguard your website and its SEO performance.
When evaluating hosting providers, look for options that include additional security features.
Bluehost Cloud, for example, offers comprehensive security features designed to protect WordPress sites, including free SSL certificates to encrypt data, automated daily backups, and regular malware scans.
These features help maintain a secure environment, preventing security issues from impacting your potential customers, your site’s SEO, and ultimately, your bottom line.
With Bluehost Cloud, your site’s visitors, data, and search engine rankings remain secure, providing you with peace of mind and a safe foundation for SEO success.
4. Optimized Database & File Management For Fast Site Performance
A poorly managed database can slow down site performance, which affects load times and visitor experience. Therefore, efficient data handling and optimized file management are essential for fast site performance.
Choose a host with advanced database and file management tools, as well as caching solutions that enhance site speed. Bluehost Cloud supports WordPress sites with advanced database optimization, ensuring quick, efficient data handling even as your site grows.
With features like server-level caching and optimized databases, Bluehost Cloud is built to handle WordPress’ unique requirements, enabling your site to perform smoothly without additional plugins or manual adjustments.
Bluehost Cloud contributes to a better user experience and a stronger SEO foundation by keeping your WordPress site fast and efficient.
5. SEO-Friendly, Scalable Bandwidth For Growing Sites
As your site’s popularity grows, so does its bandwidth requirements. Scalable or unmetered bandwidth is vital to handle traffic spikes without slowing down your site and impacting your SERP performance.
High-growth websites, in particular, benefit from hosting providers that offer flexible bandwidth options, ensuring consistent speed and availability even during peak traffic.
To avoid disaster, select a hosting provider that offers scalable or unmetered bandwidth as part of their package. Bluehost Cloud’s unmetered bandwidth, for instance, is designed to accommodate high-traffic sites without affecting load times or user experience.
This ensures that your site remains responsive and accessible during high-traffic periods, supporting your growth and helping you maintain your SEO rankings.
For websites anticipating growth, unmetered bandwidth with Bluehost Cloud provides a reliable, flexible solution to ensure long-term performance.
6. WordPress-Specific Support & SEO Optimization Tools
WordPress has unique needs when it comes to SEO, making specialized hosting support essential.
Hosts that cater specifically to WordPress provide an added advantage by offering tools and configurations such as staging environments and one-click installations specifically for WordPress.
WordPress-specific hosting providers also have an entire team of knowledgeable support and technical experts who can help you significantly improve your WordPress site’s performance.
Bluehost Cloud is a WordPress-focused hosting solution that offers priority, 24/7 support from WordPress experts, ensuring any issue you encounter is dealt with effectively.
Additionally, Bluehost’s staging environments enable you to test changes and updates before going live, reducing the risk of SEO-impacting errors.
Switching to Bluehost is easy, affordable, and stress-free, too.
Bluehost offers a seamless migration service designed to make switching hosts simple and stress-free. Our dedicated migration support team handles the entire transfer process, ensuring your WordPress site’s content, settings, and configurations are moved safely and accurately.
Currently, Bluehost also covers all migration costs, so you can make the switch with zero out-of-pocket expenses. We’ll credit the remaining cost of your existing contract, making the transition financially advantageous.
You can actually save money or even gain credit by switching
7. Integrated Domain & Site Management For Simplified SEO Administration
SEO often involves managing domain settings, redirects, DNS configurations, and SSL updates, which can become complicated without centralized management.
An integrated hosting provider that allows you to manage your domain and hosting in one place simplifies these SEO tasks and makes it easier to maintain a strong SEO foundation.
When selecting a host, look for providers that integrate domain management with hosting. Bluehost offers a streamlined experience, allowing you to manage both domains and hosting from a single dashboard.
SEO-related site administration becomes more manageable, and you can focus on the things you do best: growth and optimization.
Find A SEO-Friendly WordPress Host
Choosing an SEO-friendly WordPress host can have a significant impact on your website’s search engine performance, user experience, and long-term growth.
By focusing on uptime, global data distribution, robust security, optimized database management, scalable bandwidth, WordPress-specific support, and integrated domain management, you create a solid foundation that supports both SEO and usability.
Ready to make the switch?
As a trusted WordPress partner with over 20 years of experience, Bluehost offers a hosting solution designed to meet the unique demands of WordPress sites big and small.
Our dedicated migration support team handles every detail of your transfer, ensuring your site’s content, settings, and configurations are moved accurately and securely.
Plus, we offer eligible customers a credit toward their remaining contracts, making the transition to Bluehost not only seamless but also cost-effective.
Learn how Bluehost Cloud can elevate your WordPress site. Visit us today to get started.
HTTP Archive published 12 chapters of its annual Web Almanac, revealing disparities between mobile and desktop web performance.
The Almanac analyzes data from millions of sites to track trends in web technologies, performance metrics, and user experience.
This year’s Almanac details changes in technology adoption patterns that will impact businesses and users.
Key Highlights
Mobile Performance Gap
The most significant finding centers on the growing performance gap between desktop and mobile experiences.
With the introduction of Google’s new Core Web Vital metric, Interaction to Next Paint (INP), the gap has become wider than ever.
“Web performance is tied to what devices and networks people can afford,” the report notes, highlighting the socioeconomic implications of this growing divide.
The data shows that while desktop performance remains strong, mobile users—particularly those with lower-end devices—face challenges:
Desktop sites achieve 97% “good” INP scores
Mobile sites lag at 74% “good” INP scores
Mobile median Total Blocking Time is 18 times higher than desktop
Third-Party Tracking
The report found that tracking remains pervasive across the web.
“We find that 61% of cookies are set in a third-party context,” the report states, noting that these cookies can be used for cross-site tracking and targeted advertising.
Key privacy findings include:
Google’s DoubleClick sets cookies on 44% of top websites
Only 6% of third-party cookies use partitioning for privacy protection
11% of first-party cookies have SameSite set to None, potentially enabling tracking
CMS Market Share
In the content management space, WordPress continues its dominance, with the report stating:
“Of the over 16 million mobile sites in this year’s crawl, WordPress is used by 5.7 millions sites for a total of 36% of sites.”
However, among the top 1,000 most-visited websites, only 8% use identifiable CMS platforms, suggesting larger organizations opt for custom solutions.
In the ecommerce sector, WooCommerce leads with 38% market share, followed by Shopify at 18%.
The report found that “OpenCart is the last of the 362 detected shop systems that manage to secure a share above 1% of the market.”
PayPal remains most detected payment method (3.5% of sites), followed by Apple Pay and Shop Pay.
Performance By Platform
Some platforms markedly improved Core Web Vitals scores over the past year.
Squarespace increased from 33% good scores in 2022 to 60% in 2024, while others like Magento and WooCommerce continue to face performance challenges.
The remaining chapters of the Web Almanac are expected to be published in the coming weeks.
Structured Data Trends
The deprecation of FAQ and HowTo rich results by Google hasn’t significantly impacted their implementation.
This suggests website owners find value in these features beyond search.
Google expanded support for structured data types for various verticals, including vehicles, courses, and vacation rentals.
Why This Matters
These findings highlight that mobile optimization remains a challenge for developers and businesses.
HTTP Archive researchers noted in the report:
“These results highlight the ongoing need for focused optimization efforts, particularly in mobile experience.
The performance gap between devices suggests that many users, especially those on lower-end mobile devices, may be experiencing a significantly degraded web experience.”
Additionally, as privacy concerns grow, the industry faces pressure to balance user tracking with privacy protection.
Businesses reliant on third-party tracking mechanisms may need to adapt their marketing and analytics strategies accordingly.
The 2024 Web Almanac is available on HTTP Archive’s website; the remaining chapters are expected to be published in the coming weeks.