The Expert SEO Guide To URL Parameter Handling via @sejournal, @jes_scholz

In the world of SEO, URL parameters pose a significant problem.

While developers and data analysts may appreciate their utility, these query strings are an SEO headache.

Countless parameter combinations can split a single user intent across thousands of URL variations. This can cause complications for crawling, indexing, visibility and, ultimately, lead to lower traffic.

The issue is we can’t simply wish them away, which means it’s crucial to master how to manage URL parameters in an SEO-friendly way.

To do so, we will explore:

What Are URL Parameters?

url parameter elementsImage created by author

URL parameters, also known as query strings or URI variables, are the portion of a URL that follows the ‘?’ symbol. They are comprised of a key and a value pair, separated by an ‘=’ sign. Multiple parameters can be added to a single page when separated by an ‘&’.

The most common use cases for parameters are:

  • Tracking – For example ?utm_medium=social, ?sessionid=123 or ?affiliateid=abc
  • Reordering – For example ?sort=lowest-price, ?order=highest-rated or ?so=latest
  • Filtering – For example ?type=widget, colour=purple or ?price-range=20-50
  • Identifying – For example ?product=small-purple-widget, categoryid=124 or itemid=24AU
  • Paginating – For example, ?page=2, ?p=2 or viewItems=10-30
  • Searching – For example, ?query=users-query, ?q=users-query or ?search=drop-down-option
  • Translating – For example, ?lang=fr or ?language=de

SEO Issues With URL Parameters

1. Parameters Create Duplicate Content

Often, URL parameters make no significant change to the content of a page.

A re-ordered version of the page is often not so different from the original. A page URL with tracking tags or a session ID is identical to the original.

For example, the following URLs would all return a collection of widgets.

  • Static URL: https://www.example.com/widgets
  • Tracking parameter: https://www.example.com/widgets?sessionID=32764
  • Reordering parameter: https://www.example.com/widgets?sort=latest
  • Identifying parameter: https://www.example.com?category=widgets
  • Searching parameter: https://www.example.com/products?search=widget

That’s quite a few URLs for what is effectively the same content – now imagine this over every category on your site. It can really add up.

The challenge is that search engines treat every parameter-based URL as a new page. So, they see multiple variations of the same page, all serving duplicate content and all targeting the same search intent or semantic topic.

While such duplication is unlikely to cause a website to be completely filtered out of the search results, it does lead to keyword cannibalization and could downgrade Google’s view of your overall site quality, as these additional URLs add no real value.

2. Parameters Reduce Crawl Efficacy

Crawling redundant parameter pages distracts Googlebot, reducing your site’s ability to index SEO-relevant pages and increasing server load.

Google sums up this point perfectly.

“Overly complex URLs, especially those containing multiple parameters, can cause a problems for crawlers by creating unnecessarily high numbers of URLs that point to identical or similar content on your site.

As a result, Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all the content on your site.”

3. Parameters Split Page Ranking Signals

If you have multiple permutations of the same page content, links and social shares may be coming in on various versions.

This dilutes your ranking signals. When you confuse a crawler, it becomes unsure which of the competing pages to index for the search query.

4. Parameters Make URLs Less Clickable

parameter based url clickabilityImage created by author

Let’s face it: parameter URLs are unsightly. They’re hard to read. They don’t seem as trustworthy. As such, they are slightly less likely to be clicked.

This may impact page performance. Not only because CTR influences rankings, but also because it’s less clickable in AI chatbots, social media, in emails, when copy-pasted into forums, or anywhere else the full URL may be displayed.

While this may only have a fractional impact on a single page’s amplification, every tweet, like, share, email, link, and mention matters for the domain.

Poor URL readability could contribute to a decrease in brand engagement.

Assess The Extent Of Your Parameter Problem

It’s important to know every parameter used on your website. But chances are your developers don’t keep an up-to-date list.

So how do you find all the parameters that need handling? Or understand how search engines crawl and index such pages? Know the value they bring to users?

Follow these five steps:

  • Run a crawler: With a tool like Screaming Frog, you can search for “?” in the URL.
  • Review your log files: See if Googlebot is crawling parameter-based URLs.
  • Look in the Google Search Console page indexing report: In the samples of index and relevant non-indexed exclusions, search for ‘?’ in the URL.
  • Search with site: inurl: advanced operators: Know how Google is indexing the parameters you found by putting the key in a site:example.com inurl:key combination query.
  • Look in Google Analytics all pages report: Search for “?” to see how each of the parameters you found are used by users. Be sure to check that URL query parameters have not been excluded in the view setting.

Armed with this data, you can now decide how to best handle each of your website’s parameters.

SEO Solutions To Tame URL Parameters

You have six tools in your SEO arsenal to deal with URL parameters on a strategic level.

Limit Parameter-based URLs

A simple review of how and why parameters are generated can provide an SEO quick win.

You will often find ways to reduce the number of parameter URLs and thus minimize the negative SEO impact. There are four common issues to begin your review.

1. Eliminate Unnecessary Parameters

remove unnecessary parametersImage created by author

Ask your developer for a list of every website’s parameters and their functions. Chances are, you will discover parameters that no longer perform a valuable function.

For example, users can be better identified by cookies than sessionIDs. Yet the sessionID parameter may still exist on your website as it was used historically.

Or you may discover that a filter in your faceted navigation is rarely applied by your users.

Any parameters caused by technical debt should be eliminated immediately.

2. Prevent Empty Values

no empty parameter valuesImage created by author

URL parameters should be added to a URL only when they have a function. Don’t permit parameter keys to be added if the value is blank.

In the above example, key2 and key3 add no value, both literally and figuratively.

3. Use Keys Only Once

single key usageImage created by author

Avoid applying multiple parameters with the same parameter name and a different value.

For multi-select options, it is better to combine the values after a single key.

4. Order URL Parameters

order url parametersImage created by author

If the same URL parameter is rearranged, the pages are interpreted by search engines as equal.

As such, parameter order doesn’t matter from a duplicate content perspective. But each of those combinations burns crawl budget and split ranking signals.

Avoid these issues by asking your developer to write a script to always place parameters in a consistent order, regardless of how the user selected them.

In my opinion, you should start with any translating parameters, followed by identifying, then pagination, then layering on filtering and reordering or search parameters, and finally tracking.

Pros:

  • Ensures more efficient crawling.
  • Reduces duplicate content issues.
  • Consolidates ranking signals to fewer pages.
  • Suitable for all parameter types.

Cons:

  • Moderate technical implementation time.

Rel=”Canonical” Link Attribute

rel=canonical for parameter handlingImage created by author

The rel=”canonical” link attribute calls out that a page has identical or similar content to another. This encourages search engines to consolidate the ranking signals to the URL specified as canonical.

You can rel=canonical your parameter-based URLs to your SEO-friendly URL for tracking, identifying, or reordering parameters.

But this tactic is not suitable when the parameter page content is not close enough to the canonical, such as pagination, searching, translating, or some filtering parameters.

Pros:

  • Relatively easy technical implementation.
  • Very likely to safeguard against duplicate content issues.
  • Consolidates ranking signals to the canonical URL.

Cons:

  • Wastes crawling on parameter pages.
  • Not suitable for all parameter types.
  • Interpreted by search engines as a strong hint, not a directive.

Meta Robots Noindex Tag

meta robots noidex tag for parameter handlingImage created by author

Set a noindex directive for any parameter-based page that doesn’t add SEO value. This tag will prevent search engines from indexing the page.

URLs with a “noindex” tag are also likely to be crawled less frequently and if it’s present for a long time will eventually lead Google to nofollow the page’s links.

Pros:

  • Relatively easy technical implementation.
  • Very likely to safeguard against duplicate content issues.
  • Suitable for all parameter types you do not wish to be indexed.
  • Removes existing parameter-based URLs from the index.

Cons:

  • Won’t prevent search engines from crawling URLs, but will encourage them to do so less frequently.
  • Doesn’t consolidate ranking signals.
  • Interpreted by search engines as a strong hint, not a directive.

Robots.txt Disallow

robots txt disallow for parameter handlingImage created by author

The robots.txt file is what search engines look at first before crawling your site. If they see something is disallowed, they won’t even go there.

You can use this file to block crawler access to every parameter based URL (with Disallow: /*?*) or only to specific query strings you don’t want to be indexed.

Pros:

  • Simple technical implementation.
  • Allows more efficient crawling.
  • Avoids duplicate content issues.
  • Suitable for all parameter types you do not wish to be crawled.

Cons:

  • Doesn’t consolidate ranking signals.
  • Doesn’t remove existing URLs from the index.

Move From Dynamic To Static URLs

Many people think the optimal way to handle URL parameters is to simply avoid them in the first place.

After all, subfolders surpass parameters to help Google understand site structure and static, keyword-based URLs have always been a cornerstone of on-page SEO.

To achieve this, you can use server-side URL rewrites to convert parameters into subfolder URLs.

For example, the URL:

www.example.com/view-product?id=482794

Would become:

www.example.com/widgets/purple

This approach works well for descriptive keyword-based parameters, such as those that identify categories, products, or filters for search engine-relevant attributes. It is also effective for translated content.

But it becomes problematic for non-keyword-relevant elements of faceted navigation, such as an exact price. Having such a filter as a static, indexable URL offers no SEO value.

It’s also an issue for searching parameters, as every user-generated query would create a static page that vies for ranking against the canonical – or worse presents to crawlers low-quality content pages whenever a user has searched for an item you don’t offer.

It’s somewhat odd when applied to pagination (although not uncommon due to WordPress), which would give a URL such as

www.example.com/widgets/purple/page2

Very odd for reordering, which would give a URL such as

www.example.com/widgets/purple/lowest-price

And is often not a viable option for tracking. Google Analytics will not acknowledge a static version of the UTM parameter.

More to the point: Replacing dynamic parameters with static URLs for things like pagination, on-site search box results, or sorting does not address duplicate content, crawl budget, or internal link equity dilution.

Having all the combinations of filters from your faceted navigation as indexable URLs often results in thin content issues. Especially if you offer multi-select filters.

Many SEO pros argue it’s possible to provide the same user experience without impacting the URL. For example, by using POST rather than GET requests to modify the page content. Thus, preserving the user experience and avoiding SEO problems.

But stripping out parameters in this manner would remove the possibility for your audience to bookmark or share a link to that specific page – and is obviously not feasible for tracking parameters and not optimal for pagination.

The crux of the matter is that for many websites, completely avoiding parameters is simply not possible if you want to provide the ideal user experience. Nor would it be best practice SEO.

So we are left with this. For parameters that you don’t want to be indexed in search results (paginating, reordering, tracking, etc) implement them as query strings. For parameters that you do want to be indexed, use static URL paths.

Pros:

  • Shifts crawler focus from parameter-based to static URLs which have a higher likelihood to rank.

Cons:

  • Significant investment of development time for URL rewrites and 301 redirects.
  • Doesn’t prevent duplicate content issues.
  • Doesn’t consolidate ranking signals.
  • Not suitable for all parameter types.
  • May lead to thin content issues.
  • Doesn’t always provide a linkable or bookmarkable URL.

Best Practices For URL Parameter Handling For SEO

So which of these six SEO tactics should you implement?

The answer can’t be all of them.

Not only would that create unnecessary complexity, but often, the SEO solutions actively conflict with one another.

For example, if you implement robots.txt disallow, Google would not be able to see any meta noindex tags. You also shouldn’t combine a meta noindex tag with a rel=canonical link attribute.

Google’s John Mueller, Gary Ilyes, and Lizzi Sassman couldn’t even decide on an approach. In a Search Off The Record episode, they discussed the challenges that parameters present for crawling.

They even suggest bringing back a parameter handling tool in Google Search Console. Google, if you are reading this, please do bring it back!

What becomes clear is there isn’t one perfect solution. There are occasions when crawling efficiency is more important than consolidating authority signals.

Ultimately, what’s right for your website will depend on your priorities.

url parameter handling option pros and consImage created by author

Personally, I take the following plan of attack for SEO-friendly parameter handling:

  • Research user intents to understand what parameters should be search engine friendly, static URLs.
  • Implement effective pagination handling using a ?page= parameter.
  • For all remaining parameter-based URLs, block crawling with a robots.txt disallow and add a noindex tag as backup.
  • Double-check that no parameter-based URLs are being submitted in the XML sitemap.

No matter what parameter handling strategy you choose to implement, be sure to document the impact of your efforts on KPIs.

More resources: 


Featured Image: BestForBest/Shutterstock

seo enhancements
7 common technical ecommerce SEO mistakes to prevent

Ecommerce mistakes are as common as shopping carts filled to the brim during a Black Friday sale. Why is that, you ask? Well, the ecommerce world is complex. It’s easy for online retailers to trip over unexpected — but also quite common — hurdles. But fear not! Identifying and correcting these technical SEO mistakes helps turn your online store into a success story.

Table of contents

1. Poorly structured online stores

A well-organized site structure is fundamental for both user experience and SEO. It helps search engines crawl and index your site efficiently while guiding users seamlessly through your products and categories. Yet, many ecommerce retailers make the mistake of using very complex site structures.

Common issues:

  • Inconsistent URL structures: URLs that lack a clear hierarchy or use random character strings can confuse users and search engines.
  • Lack of breadcrumbs: Without breadcrumbs, users might struggle to navigate back to previous categories or search results, leading to a frustrating user experience.
  • Ineffective internal linking: Poor internal linking can prevent search engines from understanding the relationship between pages and dilute the distribution of page authority.

Solutions:

  • Organize URLs: Develop a clear and logical URL structure that reflects your site hierarchy. For example, use /category/subcategory/product-name instead of non-descriptive strings. This structure helps search engines understand the importance and context of each page.
  • Implement breadcrumbs: Breadcrumb navigation provides a secondary navigation aid, showing users their current location within the site hierarchy. Ensure breadcrumbs are visible on all pages and structured consistently.
  • Optimize internal linking: Create a strategic internal linking plan to connect related products and categories. Use descriptive anchor text to improve keyword relevance effectively. Check 404 errors and fix these, if necessary.

More tips:

  • Faceted navigation: If your site uses faceted navigation for filtering products, ensure it’s implemented to avoid creating duplicate content or excessive crawl paths. Use canonical tags or noindex directives where necessary.
  • Site architecture depth: Keep your site architecture shallow, ideally allowing users and search engines to reach any page within as few clicks as possible from the homepage. This enhances crawlability and improves user experience.
  • XML sitemaps: Regularly update your XML sitemap to reflect your site’s current structure and submit it to search engines. This ensures all important pages are indexed efficiently.

2. Ignoring mobile optimization

It seems strange to say this in 2024, but mobile is where it’s at. Today, if your online store isn’t optimized for mobile, you’re missing out on a big chunk of potential customers. Mobile shopping is not just a trend; it’s the norm. Remember, your customers are swiping, tapping, and buying on their phones — don’t make the mistake of focusing your ecommerce business on desktop users only.

Common issues:

  • Slow mobile load times: Mobile users demand quick access to content. Slow-loading pages can drive potential customers away and negatively impact your performance.
  • Poor user interface and user experience: A website design that doesn’t adapt well to mobile screens can make navigation difficult and frustrate users.
  • Scaling issues: Websites that aren’t responsive can appear cluttered or require excessive zooming and scrolling on mobile devices.

Solutions:

  • Responsive design: Ensure your website uses a responsive design that automatically adjusts the layout, images, and text to fit any screen size seamlessly.
  • Optimize mobile performance: Improve load times by compressing images, minifying CSS and JavaScript files, and using asynchronous loading for scripts. Use tools like Google PageSpeed Insights to identify speed bottlenecks and fix mobile SEO.
  • Improve navigation: Make your mobile navigation intuitive and easy to use. Consider implementing a fixed navigation bar for easy access and use clear, concise labels for menu items.
  • Streamline the checkout process: Simplify the mobile checkout process by reducing the required fields, enabling auto-fill for forms, and providing multiple payment options such as digital wallets. Ensure that call-to-action buttons are prominently displayed and easy to tap.
  • Test across devices: Regularly test your site on various devices and screen sizes to ensure a consistent and smooth user experience.

More tips:

  • Leverage mobile-specific features: Use device-specific capabilities like geolocation to enhance user experience. For example, offer location-based services or promotions.
  • Optimize images for mobile: Use responsive images that adjust in size depending on the device. Implement lazy loading so images only load when they appear in the viewport, improving initial load times.

3. Neglecting Schema structured data markup

Schema structured data markup is code that helps search engines understand the context of your content. Among other things, search engines use this to help display rich snippets in search results. Despite its benefits, many ecommerce sites underestimate its potential, missing out on enhanced visibility and click-through rates.

Common issues:

  • Lack of implementation: Many ecommerce sites make the mistake of failing to implement schema markup, leaving potential enhancements in search results untapped.
  • Incorrect or incomplete markup: When schema markup is applied incorrectly or not fully utilized, it can lead to missed opportunities for rich snippets.
  • Overlooking updates: Schema standards and best practices evolve, and failing to update markup can result in outdated, incomplete, or ineffective data.

Solutions:

  • Implement product schema: Use product schema markup to display key details such as price and availability directly in search results. This can make your listings more attractive and informative to potential customers.
  • Utilize review and rating schema: Highlight customer reviews and ratings with schema markup to increase trust and engagement. Rich snippets featuring star ratings can significantly improve click-through rates.
  • Add breadcrumb schema: Implement breadcrumb schema to enhance navigation paths in search results. This makes it easier for search engines to understand your site’s structure.
  • Use structured data testing tools: Use Google’s Rich Results Test and Schema Markup Validator to ensure your markup is correctly implemented and eligible for rich results. Address any errors or warnings promptly.

More tips:

  • Expand to additional schema types: Beyond product and review schema, consider using additional types relevant to your site, such as FAQ schema for common questions or product variant schema.
  • JSON-LD format: Implement schema markup using JSON-LD format, which is Google’s recommended method. It’s easier to read and maintain, especially for complex data sets.

4. Inadequate page speed optimization

Page speed is critical to user experience and SEO. Google’s emphasis on Core Web Vitals and page experience underscores the importance of fast, smooth, and stable web experiences. Slow pages frustrate users and negatively impact SEO, which can reduce visibility and traffic.

Common issues:

  • Large, unoptimized images: High-resolution images that aren’t compressed can significantly increase load times. This is one of the biggest and most common ecommerce mistakes.
  • Render-blocking resources: CSS and JavaScript files that prevent a page from rendering quickly can delay visible content.
  • Poor server response times: Slow server responses can delay the initial loading of a page, affecting LCP.

Solutions:

  • Optimize images: Compress images using formats like WebP and utilize responsive image techniques to serve appropriately sized images for different devices.
  • Eliminate render-blocking resources:
    • Defer non-critical CSS: Load essential CSS inline for above-the-fold content and defer the rest.
    • Async JavaScript: Use the async attribute for scripts that can load asynchronously or defer for scripts that are not crucial for initial rendering.
  • Improve server response times:
    • Use a Content Delivery Network (CDN) to cache content closer to users.
    • Optimize server performance by upgrading hosting plans, using faster DNS providers, and implementing HTTP/2 — or the upcoming HTTP/3 standard — for faster data transfer.
  • Enhance Core Web Vitals:
    • For LCP: Optimize server performance, prioritize loading of critical resources, and consider preloading key assets.
    • For INP: Focus on minimizing CPU processing by running code asynchronously, breaking up long tasks, and ensuring the main thread is not blocked by heavy JavaScript execution.
    • For CLS: Set explicit dimensions for images and ads, avoid inserting content above existing content, and ensure fonts load without causing layout shifts.

More tips:

  • Lazy loading: This method delays the loading of off-screen images and videos until they are needed, reducing initial load times.
  • Minimise HTTP requests: Reduce the number of requests by combining CSS and JavaScript files, using CSS sprites, and eliminating unnecessary plugins or widgets.
  • Use browser caching: Implement caching policies to store static resources on users’ browsers, reducing load times for returning visitors.

5. Improper canonicalization

Canonicalization helps search engines understand the preferred version of a webpage when multiple versions exist. This is important for ecommerce sites, where similar or duplicate content is common due to variations in products, categories, and pagination. Improper canonicalization can lead to duplicate content issues.

Common issues:

  • Duplicate content: Multiple URLs displaying the same or similar content can confuse search engines, leading to reduced performance.
  • Product variations: Different URLs for product variations (e.g., size, color) can create duplicate content without proper canonical tags.
  • Pagination: Paginated pages without canonical tags can lead to indexing issues.

Solutions:

  • Use canonical tags: Implement canonical tags to specify the preferred version of a page. This tells search engines which URL to index and attribute full authority to, helping to consolidate ranking signals.
  • Handle product variations: For product pages with variations, use canonical tags to point to the main product page.
  • Optimize pagination: Link pages sequentially with clear, unique URLs (e.g., using “?page=n”) and avoid using URL fragment identifiers for pagination. Make sure that paginated URLs are not indexed if they include filters or alternative sort orders by using the “noindex” robots meta tag or robots.txt file.

More tips:

  • Consistent URL structures: To minimize the potential for duplicate content, maintain a consistent URL structure across your site. Use descriptive, keyword-rich URLs that clearly convey the page’s content.
  • Avoid session IDs in URLs: If your ecommerce platform uses session IDs, configure your system to avoid appending them to URLs, as this can create multiple versions of the same page.
  • Monitor changes: Track any changes to your site architecture or content management system that could affect canonicalization.

6. Not using XML sitemaps and robots.txt effectively

XML sitemaps and robots.txt files help guide search engines through your ecommerce site. These tools ensure that search engines can efficiently crawl and index your pages. Missing or misconfigured files can lead to indexing issues, negatively impacting your site’s SEO performance.

Common issues:

  • Incomplete or outdated XML sitemaps: Many sites fail to update their sitemaps, which excludes important pages from search engine indexing.
  • Misconfigured robots.txt files: Incorrect directives in robots.txt files can inadvertently block search engines from accessing critical pages.
  • Exclusion of important pages: Errors in these files can result in important pages, like product or category pages, being overlooked by search engines.

Solutions:

  • Create and update XML sitemaps: Generate an XML sitemap that includes all relevant pages, such as product, category, and blog pages. Regularly update the sitemap to reflect new content or structure changes and submit it to search engines via tools like Google Search Console.
  • Optimize robots.txt files: Ensure your robots.txt file includes directives to block irrelevant or duplicate pages (e.g., admin pages, filtering parameters) while allowing access to important sections. Use the file to guide crawlers efficiently through your site without wasting crawl budget.
  • Use sitemap index files: For large ecommerce sites, consider using a sitemap index file to organize multiple sitemap files. This approach helps manage extensive product catalogs and ensures comprehensive coverage.

More tips:

  • Dynamic sitemaps for ecommerce: Implement dynamic sitemaps that automatically update to reflect inventory changes, ensuring that new products are indexed quickly.
  • Leverage hreflang in sitemaps: If your site targets multiple languages or regions, include hreflang annotations within your XML sitemaps to help search engines identify the correct version for each user.

7. Failing to optimize for international SEO

For ecommerce retailers targeting global markets, international SEO is crucial for reaching audiences in different countries and languages. A good international SEO strategy ensures your site is accessible and relevant to users worldwide. International ecommerce SEO is all about maximizing your global reach and sales potential — it would be a mistake to forget about this.

Common issues:

  • Incorrect hreflang implementation: Misconfigured hreflang tags can display incorrect language versions to users, confusing search engines and visitors.
  • Uniform content across regions: Using the same content for regions without localization can make your site less appealing to local audiences.
  • Ignoring local search engines: Focusing solely on Google can mean missing out on significant portions of the market in countries where other search engines dominate.

Solutions:

  • Implement hreflang tags correctly: Use hreflang tags to indicate language and regional targeting for your pages. This helps search engines serve users the correct language version. Ensure each hreflang tag references a self-referential URL and includes all language versions.
  • Create region-specific content: Localize your content to reflect cultural nuances, currency, units of measure, and local terminology. This enhances user experience and increases relevance and engagement.
  • Optimize for local search engines: Optimize your site according to their specific algorithms and requirements in markets where other search engines like Baidu (China) or Yandex (Russia) are prevalent.
  • Use geotargeted URLs: Implement subdirectories (e.g., example.com/us, example.com/fr) or country-code top-level domains (ccTLDs) such as example.fr to effectively target specific countries.

More tips:

  • Local backlinks: Build backlinks from local websites to improve visibility in specific regions. Set up local PR campaigns and partnerships to build credibility and authority.
  • CDNs and server locations: Host your site on servers within your target market to improve load times and performance for users in that region.
  • Structured data for local business: Use structured data markup to highlight local business information, such as location and contact details.

8. Not using Yoast SEO for Shopify

This is a bit of a sneaky addition, but number eight in this list of common ecommerce mistakes to avoid is not using Yoast SEO for Shopify or our WooCommerce SEO add-on. Both come with solutions for many of the common errors described in this article. Our tools make it easier to manage ecommerce SEO and Shopify SEO so you can focus on improving many other aspects of your online store. We provide a solid groundwork in a technical SEO sense and provide tools to improve your product content. We even have AI features that make this takes easier than ever.

Be sure to check out Yoast SEO for Shopify and WooCommerce SEO!

Common ecommerce mistakes to avoid

Regular technical audits and improvements are crucial if you want your online store to keep performing well. Addressing these technical issues helps ecommerce retailers improve search visibility, user experience, and conversion rates. Just be sure not to make these common ecommerce SEO mistakes again!

Coming up next!

9 Tips To Optimize Crawl Budget For SEO via @sejournal, @ab80

Crawl budget is a vital SEO concept for large websites with millions of pages or medium-sized websites with a few thousand pages that change daily.

An example of a website with millions of pages would be eBay.com, and websites with tens of thousands of pages that update frequently would be user reviews and rating websites similar to Gamespot.com.

There are so many tasks and issues an SEO expert has to consider that crawling is often put on the back burner.

But crawl budget can and should be optimized.

In this article, you will learn:

  • How to improve your crawl budget along the way.
  • Go over the changes to crawl budget as a concept in the last couple of years.

(Note: If you have a website with just a few hundred pages, and pages are not indexed, we recommend reading our article on common issues causing indexing problems, as it is certainly not because of crawl budget.)

What Is Crawl Budget?

Crawl budget refers to the number of pages that search engine crawlers (i.e., spiders and bots) visit within a certain timeframe.

There are certain considerations that go into crawl budget, such as a tentative balance between Googlebot’s attempts to not overload your server and Google’s overall desire to crawl your domain.

Crawl budget optimization is a series of steps you can take to increase efficiency and the rate at which search engines’ bots visit your pages.

Why Is Crawl Budget Optimization Important?

Crawling is the first step to appearing in search. Without being crawled, new pages and page updates won’t be added to search engine indexes.

The more often that crawlers visit your pages, the quicker updates and new pages appear in the index. Consequently, your optimization efforts will take less time to take hold and start affecting your rankings.

Google’s index contains hundreds of billions of pages and is growing each day. It costs search engines to crawl each URL, and with the growing number of websites, they want to reduce computational and storage costs by reducing the crawl rate and indexation of URLs.

There is also a growing urgency to reduce carbon emissions for climate change, and Google has a long-term strategy to improve sustainability and reduce carbon emissions.

These priorities could make it difficult for websites to be crawled effectively in the future. While crawl budget isn’t something you need to worry about with small websites with a few hundred pages, resource management becomes an important issue for massive websites. Optimizing crawl budget means having Google crawl your website by spending as few resources as possible.

So, let’s discuss how you can optimize your crawl budget in today’s world.

1. Disallow Crawling Of Action URLs In Robots.Txt

You may be surprised, but Google has confirmed that disallowing URLs will not affect your crawl budget. This means Google will still crawl your website at the same rate. So why do we discuss it here?

Well, if you disallow URLs that are not important, you basically tell Google to crawl useful parts of your website at a higher rate.

For example, if your website has an internal search feature with query parameters like /?q=google, Google will crawl these URLs if they are linked from somewhere.

Similarly, in an e-commerce site, you might have facet filters generating URLs like /?color=red&size=s.

These query string parameters can create an infinite number of unique URL combinations that Google may try to crawl.

Those URLs basically don’t have unique content and just filter the data you have, which is great for user experience but not for Googlebot.

Allowing Google to crawl these URLs wastes crawl budget and affects your website’s overall crawlability. By blocking them via robots.txt rules, Google will focus its crawl efforts on more useful pages on your site.

Here is how to block internal search, facets, or any URLs containing query strings via robots.txt:

Disallow: *?*s=*
Disallow: *?*color=*
Disallow: *?*size=*

Each rule disallows any URL containing the respective query parameter, regardless of other parameters that may be present.

  • * (asterisk) matches any sequence of characters (including none).
  • ? (Question Mark): Indicates the beginning of a query string.
  • =*: Matches the = sign and any subsequent characters.

This approach helps avoid redundancy and ensures that URLs with these specific query parameters are blocked from being crawled by search engines.

Note, however, that this method ensures any URLs containing the indicated characters will be disallowed no matter where the characters appear. This can lead to unintended disallows. For example, query parameters containing a single character will disallow any URLs containing that character regardless of where it appears. If you disallow ‘s’, URLs containing ‘/?pages=2’ will be blocked because *?*s= matches also ‘?pages=’. If you want to disallow URLs with a specific single character, you can use a combination of rules:

Disallow: *?s=*
Disallow: *&s=*

The critical change is that there is no asterisk ‘*’ between the ‘?’ and ‘s’ characters. This method allows you to disallow specific exact ‘s’ parameters in URLs, but you’ll need to add each variation individually.

Apply these rules to your specific use cases for any URLs that don’t provide unique content. For example, in case you have wishlist buttons with “?add_to_wishlist=1” URLs, you need to disallow them by the rule:

Disallow: /*?*add_to_wishlist=*

This is a no-brainer and a natural first and most important step recommended by Google.

An example below shows how blocking those parameters helped to reduce the crawling of pages with query strings. Google was trying to crawl tens of thousands of URLs with different parameter values that didn’t make sense, leading to non-existent pages.

Reduced crawl rate of URLs with parameters after blocking via robots.txt.Reduced crawl rate of URLs with parameters after blocking via robots.txt.

However, sometimes disallowed URLs might still be crawled and indexed by search engines. This may seem strange, but it isn’t generally cause for alarm. It usually means that other websites link to those URLs.

Indexing spiked because Google indexed internal search URLs after they were blocked via robots.txt.Indexing spiked because Google indexed internal search URLs after they were blocked via robots.txt.

Google confirmed that the crawling activity will drop over time in these cases.

Google's comment on redditGoogle’s comment on Reddit, July 2024

Another important benefit of blocking these URLs via robots.txt is saving your server resources. When a URL contains parameters that indicate the presence of dynamic content, requests will go to the server instead of the cache. This increases the load on your server with every page crawled.

Please remember not to use “noindex meta tag” for blocking since Googlebot has to perform a request to see the meta tag or HTTP response code, wasting crawl budget.

1.2. Disallow Unimportant Resource URLs In Robots.txt

Besides disallowing action URLs, you may want to disallow JavaScript files that are not part of the website layout or rendering.

For example, if you have JavaScript files responsible for opening images in a popup when users click, you can disallow them in robots.txt so Google doesn’t waste budget crawling them.

Here is an example of the disallow rule of JavaScript file:

Disallow: /assets/js/popup.js

However, you should never disallow resources that are part of rendering. For example, if your content is dynamically loaded via JavaScript, Google needs to crawl the JS files to index the content they load.

Another example is REST API endpoints for form submissions. Say you have a form with action URL “/rest-api/form-submissions/”.

Potentially, Google may crawl them. Those URLs are in no way related to rendering, and it would be good practice to block them.

Disallow: /rest-api/form-submissions/

However, headless CMSs often use REST APIs to load content dynamically, so make sure you don’t block those endpoints.

In a nutshell, look at whatever isn’t related to rendering and block them.

2. Watch Out For Redirect Chains

Redirect chains occur when multiple URLs redirect to other URLs that also redirect. If this goes on for too long, crawlers may abandon the chain before reaching the final destination.

URL 1 redirects to URL 2, which directs to URL 3, and so on. Chains can also take the form of infinite loops when URLs redirect to one another.

Avoiding these is a common-sense approach to website health.

Ideally, you would be able to avoid having even a single redirect chain on your entire domain.

But it may be an impossible task for a large website – 301 and 302 redirects are bound to appear, and you can’t fix redirects from inbound backlinks simply because you don’t have control over external websites.

One or two redirects here and there might not hurt much, but long chains and loops can become problematic.

In order to troubleshoot redirect chains you can use one of the SEO tools like Screaming Frog, Lumar, or Oncrawl to find chains.

When you discover a chain, the best way to fix it is to remove all the URLs between the first page and the final page. If you have a chain that passes through seven pages, then redirect the first URL directly to the seventh.

Another great way to reduce redirect chains is to replace internal URLs that redirect with final destinations in your CMS.

Depending on your CMS, there may be different solutions in place; for example, you can use this plugin for WordPress. If you have a different CMS, you may need to use a custom solution or ask your dev team to do it.

3. Use Server Side Rendering (HTML) Whenever Possible

Now, if we’re talking about Google, its crawler uses the latest version of Chrome and is able to see content loaded by JavaScript just fine.

But let’s think critically. What does that mean? Googlebot crawls a page and resources such as JavaScript then spends more computational resources to render them.

Remember, computational costs are important for Google, and it wants to reduce them as much as possible.

So why render content via JavaScript (client side) and add extra computational cost for Google to crawl your pages?

Because of that, whenever possible, you should stick to HTML.

That way, you’re not hurting your chances with any crawler.

4. Improve Page Speed

As we discussed above, Googlebot crawls and renders pages with JavaScript, which means if it spends fewer resources to render webpages, the easier it will be for it to crawl, which depends on how well optimized your website speed is.

Google says:

Google’s crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site.

So using server-side rendering is already a great step towards improving page speed, but you need to make sure your Core Web Vital metrics are optimized, especially server response time.

5. Take Care of Your Internal Links

Google crawls URLs that are on the page, and always keep in mind that different URLs are counted by crawlers as separate pages.

If you have a website with the ‘www’ version, make sure your internal URLs, especially on navigation, point to the canonical version, i.e. with the ‘www’ version and vice versa.

Another common mistake is missing a trailing slash. If your URLs have a trailing slash at the end, make sure your internal URLs also have it.

Otherwise, unnecessary redirects, for example, “https://www.example.com/sample-page” to “https://www.example.com/sample-page/” will result in two crawls per URL.

Another important aspect is to avoid broken internal links pages, which can eat your crawl budget and soft 404 pages.

And if that wasn’t bad enough, they also hurt your user experience!

In this case, again, I’m in favor of using a tool for website audit.

WebSite Auditor, Screaming Frog, Lumar or Oncrawl, and SE Ranking are examples of great tools for a website audit.

6. Update Your Sitemap

Once again, it’s a real win-win to take care of your XML sitemap.

The bots will have a much better and easier time understanding where the internal links lead.

Use only the URLs that are canonical for your sitemap.

Also, make sure that it corresponds to the newest uploaded version of robots.txt and loads fast.

7. Implement 304 Status Code

When crawling a URL, Googlebot sends a date via the “If-Modified-Since” header, which is additional information about the last time it crawled the given URL.

If your webpage hasn’t changed since then (specified in “If-Modified-Since“), you may return the “304 Not Modified” status code with no response body. This tells search engines that webpage content didn’t change, and Googlebot can use the version from the last visit it has on the file.

Simple explanation of how 304 not modified http status code worksA simple explanation of how 304 not modified http status code works.

Imagine how many server resources you can save while helping Googlebot save resources when you have millions of webpages. Quite big, isn’t it?

However, there is a caveat when implementing 304 status code, pointed out by Gary Illyes.

Gary Illes on LinkedinGary Illes on LinkedIn

So be cautious. Server errors serving empty pages with a 200 status can cause crawlers to stop recrawling, leading to long-lasting indexing issues.

8. Hreflang Tags Are Vital

In order to analyze your localized pages, crawlers employ hreflang tags. You should be telling Google about localized versions of your pages as clearly as possible.

First off, use the lang_code" href="url_of_page" /> in your page’s header. Where “lang_code” is a code for a supported language.

You should use the element for any given URL. That way, you can point to the localized versions of a page.

Read: 6 Common Hreflang Tag Mistakes Sabotaging Your International SEO

9. Monitoring and Maintenance

Check your server logs and Google Search Console’s Crawl Stats report to monitor crawl anomalies and identify potential problems.

If you notice periodic crawl spikes of 404 pages, in 99% of cases, it is caused by infinite crawl spaces, which we have discussed above, or indicates other problems your website may be experiencing.

Crawl rate spikesCrawl rate spikes

Often, you may want to combine server log information with Search Console data to identify the root cause.

Summary

So, if you were wondering whether crawl budget optimization is still important for your website, the answer is clearly yes.

Crawl budget is, was, and probably will be an important thing to keep in mind for every SEO professional.

Hopefully, these tips will help you optimize your crawl budget and improve your SEO performance – but remember, getting your pages crawled doesn’t mean they will be indexed.

In case you face indexation issues, I suggest reading the following articles:


Featured Image: BestForBest/Shutterstock
All screenshots taken by author

What Is Largest Contentful Paint: An Easy Explanation via @sejournal, @vahandev

Largest Contentful Paint (LCP) is a Google user experience metric integrated into ranking systems in 2021.

LCP is one of the three Core Web Vitals (CWV) metrics that track technical performance metrics that impact user experience.

Core Web Vitals exist paradoxically, with Google providing guidance highlighting their importance but downplaying their impact on rankings.

LCP, like the other CWV signals, is useful for diagnosing technical issues and ensuring your website meets a base level of functionality for users.

What Is Largest Contentful Paint?

LCP is a measurement of how long it takes for the main content of a page to download and be ready to be interacted with.

Specifically, the time it takes from page load initiation to the rendering of the largest image or block of text within the user viewport. Anything below the fold doesn’t count.

Images, video poster images, background images, and block-level text elements like paragraph tags are typical elements measured.

LCP consists of the following sub-metrics:

Optimizing for LCP means optimizing for each of these metrics, so it takes less than 2.5 seconds to load and display LCP resources.

Here is a threshold scale for your reference:

LCP thresholdsLCP thresholds

Let’s dive into what these sub-metrics mean and how you can improve.

Time To First Byte (TTFB)

TTFB is the server response time and measures the time it takes for the user’s browser to receive the first byte of data from your server. This includes DNS lookup time, the time it takes to process requests by server, and redirects.

Optimizing TTFB can significantly reduce the overall load time and improve LCP.

Server response time largely depends on:

  • Database queries.
  • CDN cache misses.
  • Inefficient server-side rendering.
  • Hosting.

Let’s review each:

1. Database Queries

If your response time is high, try to identify the source.

For example, it may be due to poorly optimized queries or a high volume of queries slowing down the server’s response time. If you have a MySQL database, you can log slow queries to find which queries are slow.

If you have a WordPress website, you can use the Query Monitor plugin to see how much time SQL queries take.

Other great tools are Blackfire or Newrelic, which do not depend on the CMS or stack you use, but require installation on your hosting/server.

2. CDN Cache Misses

A CDN cache miss occurs when a requested resource is not found in the CDN’s cache, and the request is forwarded to fetch from the origin server. This process takes more time, leading to increased latency and longer load times for the end user.

Usually, your CDN provider has a report on how many cache misses you have.

Example of CDN cache reportExample of CDN cache report

If you observe a high percentage ( >10% ) of cache misses, you may need to contact your CDN provider or hosting support in case you have managed hosting with cache integrated to solve the issue.

One reason that may cause cache misses is when you have a search spam attack.

For example, a dozen spammy domains link to your internal search pages with random spammy queries like [/?q=甘肃代], which are not cached because the search term is different each time. The issue is that Googlebot aggressively crawls them, which may cause high server response times and cache misses.

In that case, and overall, it is a good practice to block search or facets URLs via robots.txt. But once you block them via robots.txt, you may find those URLs to be indexed because they have backlinks from low-quality websites.

However, don’t be afraid. John Mueller said it would be cleared in time.

Here is a real-life example from the search console of high server response time (TTFB) caused by cache misses:

Crawl spike of 404 search pages which have high server response timeCrawl spike of 404 search pages that have high server response time

3. Inefficient Server Side Rendering

You may have certain components on your website that depend on third-party APIs.

For example, you’ve seen reads and shares numbers on SEJ’s articles. We fetch those numbers from different APIs, but instead of fetching them when a request is made to the server, we prefetch them and store them in our database for faster display.

Imagine if we connect to share count and GA4 APIs when a request is made to the server. Each request takes about 300-500 ms to execute, and we would add about ~1,000 ms delay due to inefficient server-side rendering. So, make sure your backend is optimized.

4. Hosting

Be aware that hosting is highly important for low TTFB. By choosing the right hosting, you may be able to reduce your TTFB by two to three times.

Choose hosting with CDN and caching integrated into the system. This will help you avoid purchasing a CDN separately and save time maintaining it.

So, investing in the right hosting will pay off.

Read more detailed guidance:

Now, let’s look into other metrics mentioned above that contribute to LCP.

Resource Load Delay

Resource load delay is the time it takes for the browser to locate and start downloading the LCP resource.

For example, if you have a background image on your hero section that requires CSS files to load to be identified, there will be a delay equal to the time the browser needs to download the CSS file to start downloading the LCP image.

In the case when the LCP element is a text block, this time is zero.

By optimizing how quickly these resources are identified and loaded, you can improve the time it takes to display critical content. One way to do this is to preload both CSS files and LCP images by setting fetchpriority=”high” to the image so it starts downloading the CSS file.



But a better approach – if you have enough control over the website – is to inline the critical CSS required for above the fold, so the browser doesn’t spend time downloading the CSS file. This saves bandwidth and will preload only the image.

Of course, it’s even better if you design webpages to avoid hero images or sliders, as those usually don’t add value, and users tend to scroll past them since they are distracting.

Another major factor contributing to load delay is redirects.

If you have external backlinks with redirects, there’s not much you can do. But you have control over your internal links, so try to find internal links with redirects, usually because of missing trailing slashes, non-WWW versions, or changed URLs, and replace them with actual destinations.

There are a number of technical SEO tools you can use to crawl your website and find redirects to be replaced.

Resource Load Duration

Resource load duration refers to the actual time spent downloading the LCP resource.

Even if the browser quickly finds and starts downloading resources, slow download speeds can still affect LCP negatively. It depends on the size of the resources, the server’s network connection speed, and the user’s network conditions.

You can reduce resource load duration by implementing:

  • WebP format.
  • Properly sized images (make the intrinsic size of the image match the visible size).
  • Load prioritization.
  • CDN.

Element Render Delay

Element render delay is the time it takes for the browser to process and render the LCP element.

This metric is influenced by the complexity of your HTML, CSS, and JavaScript.

Minimizing render-blocking resources and optimizing your code can help reduce this delay. However, it may happen that you have heavy JavaScript scripting running, which blocks the main thread, and the rendering of the LCP element is delayed until those tasks are completed.

Here is where low values of the Total Blocking Time (TBT) metric are important, as it measures the total time during which the main thread is blocked by long tasks on page load, indicating the presence of heavy scripts that can delay rendering and responsiveness.

One way you can improve not only load duration and delay but overall all CWV metrics when users navigate within your website is to implement speculation rules API for future navigations. By prerendering pages as users mouse over links or pages they will most likely navigate, you can make your pages load instantaneously.

Beware These Scoring “Gotchas”

All elements in the user’s screen (the viewport) are used to calculate LCP. That means that images rendered off-screen and then shifted into the layout, once rendered, may not count as part of the Largest Contentful Paint score.

On the opposite end, elements starting in the user viewport and then getting pushed off-screen may be counted as part of the LCP calculation.

How To Measure The LCP Score

There are two kinds of scoring tools. The first is called Field Tools, and the second is called Lab Tools.

Field tools are actual measurements of a site.

Lab tools give a virtual score based on a simulated crawl using algorithms that approximate Internet conditions that a typical mobile phone user might encounter.

Here is one way you can find LCP resources and measure the time to display them via DevTools > Performance report:

You can read more in our in-depth guide on how to measure CWV metrics, where you can learn how to troubleshoot not only LCP but other metrics altogether.

LCP Optimization Is A Much More In-Depth Subject

Improving LCP is a crucial step toward improving CVW, but it can be the most challenging CWV metric to optimize.

LCP consists of multiple layers of sub-metrics, each requiring a thorough understanding for effective optimization.

This guide has given you a basic idea of improving LCP, and the insights you’ve gained thus far will help you make significant improvements.

But there’s still more to learn. Optimizing each sub-metric is a nuanced science. Stay tuned, as we’ll publish in-depth guides dedicated to optimizing each sub-metric.

More resources:


Featured image credit: BestForBest/Shutterstock

13 Steps To Boost Your Site’s Crawlability And Indexability via @sejournal, @MattGSouthern

One of the most important elements of search engine optimization, often overlooked, is how easily search engines can discover and understand your website.

This process, known as crawling and indexing, is fundamental to your site’s visibility in search results. Without being crawled your pages cannot be indexed, and if they are not indexed they won’t rank or display in SERPs.

In this article, we’ll explore 13 practical steps to improve your website’s crawlability and indexability. By implementing these strategies, you can help search engines like Google better navigate and catalog your site, potentially boosting your search rankings and online visibility.

Whether you’re new to SEO or looking to refine your existing strategy, these tips will help ensure that your website is as search-engine-friendly as possible.

Let’s dive in and discover how to make your site more accessible to search engine bots.

1. Improve Page Loading Speed

Page loading speed is crucial to user experience and search engine crawlability. To improve your page speed, consider the following:

  • Upgrade your hosting plan or server to ensure optimal performance.
  • Minify CSS, JavaScript, and HTML files to reduce their size and improve loading times.
  • Optimize images by compressing them and using appropriate formats (e.g., JPEG for photographs, PNG for transparent graphics).
  • Leverage browser caching to store frequently accessed resources locally on users’ devices.
  • Reduce the number of redirects and eliminate any unnecessary ones.
  • Remove any unnecessary third-party scripts or plugins.

2. Measure & Optimize Core Web Vitals

In addition to general page speed optimizations, focus on improving your Core Web Vitals scores. Core Web Vitals are specific factors that Google considers essential in a webpage’s user experience.

These include:

To identify issues related to Core Web Vitals, use tools like Google Search Console’s Core Web Vitals report, Google PageSpeed Insights, or Lighthouse. These tools provide detailed insights into your page’s performance and offer suggestions for improvement.

Some ways to optimize for Core Web Vitals include:

  • Minimize main thread work by reducing JavaScript execution time.
  • Avoid significant layout shifts by using set size attribute dimensions for media elements and preloading fonts.
  • Improve server response times by optimizing your server, routing users to nearby CDN locations, or caching content.

By focusing on both general page speed optimizations and Core Web Vitals improvements, you can create a faster, more user-friendly experience that search engine crawlers can easily navigate and index.

3. Optimize Crawl Budget

Crawl budget refers to the number of pages Google will crawl on your site within a given timeframe. This budget is determined by factors such as your site’s size, health, and popularity.

If your site has many pages, it’s necessary to ensure that Google crawls and indexes the most important ones. Here are some ways to optimize for crawl budget:

  • Using a clear hierarchy, ensure your site’s structure is clean and easy to navigate.
  • Identify and eliminate any duplicate content, as this can waste crawl budget on redundant pages.
  • Use the robots.txt file to block Google from crawling unimportant pages, such as staging environments or admin pages.
  • Implement canonicalization to consolidate signals from multiple versions of a page (e.g., with and without query parameters) into a single canonical URL.
  • Monitor your site’s crawl stats in Google Search Console to identify any unusual spikes or drops in crawl activity, which may indicate issues with your site’s health or structure.
  • Regularly update and resubmit your XML sitemap to ensure Google has an up-to-date list of your site’s pages.

4. Strengthen Internal Link Structure

A good site structure and internal linking are foundational elements of a successful SEO strategy. A disorganized website is difficult for search engines to crawl, which makes internal linking one of the most important things a website can do.

But don’t just take our word for it. Here’s what Google’s search advocate, John Mueller, had to say about it:

“Internal linking is super critical for SEO. I think it’s one of the biggest things that you can do on a website to kind of guide Google and guide visitors to the pages that you think are important.”

If your internal linking is poor, you also risk orphaned pages or pages that don’t link to any other part of your website. Because nothing is directed to these pages, search engines can only find them through your sitemap.

To eliminate this problem and others caused by poor structure, create a logical internal structure for your site.

Your homepage should link to subpages supported by pages further down the pyramid. These subpages should then have contextual links that feel natural.

Another thing to keep an eye on is broken links, including those with typos in the URL. This, of course, leads to a broken link, which will lead to the dreaded 404 error. In other words, page not found.

The problem is that broken links are not helping but harming your crawlability.

Double-check your URLs, particularly if you’ve recently undergone a site migration, bulk delete, or structure change. And make sure you’re not linking to old or deleted URLs.

Other best practices for internal linking include using anchor text instead of linked images, and adding a “reasonable number” of links on a page (there are different ratios of what is reasonable for different niches, but adding too many links can be seen as a negative signal).

Oh yeah, and ensure you’re using follow links for internal links.

5. Submit Your Sitemap To Google

Given enough time, and assuming you haven’t told it not to, Google will crawl your site. And that’s great, but it’s not helping your search ranking while you wait.

If you recently made changes to your content and want Google to know about them immediately, you should submit a sitemap to Google Search Console.

A sitemap is another file that lives in your root directory. It serves as a roadmap for search engines with direct links to every page on your site.

This benefits indexability because it allows Google to learn about multiple pages simultaneously. A crawler may have to follow five internal links to discover a deep page, but by submitting an XML sitemap, it can find all of your pages with a single visit to your sitemap file.

Submitting your sitemap to Google is particularly useful if you have a deep website, frequently add new pages or content, or your site does not have good internal linking.

6. Update Robots.txt Files

You’ll want to have a robots.txt file for your website. It’s a plain text file in your website’s root directory that tells search engines how you would like them to crawl your site. Its primary use is to manage bot traffic and keep your site from being overloaded with requests.

Where this comes in handy in terms of crawlability is limiting which pages Google crawls and indexes. For example, you probably don’t want pages like directories, shopping carts, and tags in Google’s directory.

Of course, this helpful text file can also negatively impact your crawlability. It’s well worth looking at your robots.txt file (or having an expert do it if you’re not confident in your abilities) to see if you’re inadvertently blocking crawler access to your pages.

Some common mistakes in robots.text files include:

  • Robots.txt is not in the root directory.
  • Poor use of wildcards.
  • Noindex in robots.txt.
  • Blocked scripts, stylesheets, and images.
  • No sitemap URL.

For an in-depth examination of each of these issues – and tips for resolving them, read this article.

7. Check Your Canonicalization

What a canonical tag does is indicate to Google which page is the main page to give authority to when you have two or more pages that are similar, or even duplicate. Although, this is only a directive and not always applied.

Canonicals can be a helpful way to tell Google to index the pages you want while skipping duplicates and outdated versions.

But this opens the door for rogue canonical tags. These refer to older versions of a page that no longer exist, leading to search engines indexing the wrong pages and leaving your preferred pages invisible.

To eliminate this problem, use a URL inspection tool to scan for rogue tags and remove them.

If your website is geared towards international traffic, i.e., if you direct users in different countries to different canonical pages, you need to have canonical tags for each language. This ensures your pages are indexed in each language your site uses.

8. Perform A Site Audit

Now that you’ve performed all these other steps, there’s still one final thing you need to do to ensure your site is optimized for crawling and indexing: a site audit.

That starts with checking the percentage of pages Google has indexed for your site.

Check Your Indexability Rate

Your indexability rate is the number of pages in Google’s index divided by the number of pages on your website.

You can find out how many pages are in the Google index from the Google Search Console Index by going to the “Pages” tab and checking the number of pages on the website from the CMS admin panel.

There’s a good chance your site will have some pages you don’t want indexed, so this number likely won’t be 100%. However, if the indexability rate is below 90%, you have issues that need investigation.

You can get your no-indexed URLs from Search Console and run an audit for them. This could help you understand what is causing the issue.

Another helpful site auditing tool included in Google Search Console is the URL Inspection Tool. This allows you to see what Google spiders see, which you can then compare to actual webpages to understand what Google is unable to render.

Audit (And request Indexing) Newly Published Pages

Any time you publish new pages to your website or update your most important pages, you should ensure they’re being indexed. Go into Google Search Console and use the inspection tool to make sure they’re all showing up. If not, request indexing on the page and see if this takes effect – usually within a few hours to a day.

If you’re still having issues, an audit can also give you insight into which other parts of your SEO strategy are falling short, so it’s a double win. Scale your audit process with tools like:

9. Check For Duplicate Content

Duplicate content is another reason bots can get hung up while crawling your site. Basically, your coding structure has confused it, and it doesn’t know which version to index. This could be caused by things like session IDs, redundant content elements, and pagination issues.

Sometimes, this will trigger an alert in Google Search Console, telling you Google is encountering more URLs than it thinks it should. If you haven’t received one, check your crawl results for duplicate or missing tags or URLs with extra characters that could be creating extra work for bots.

Correct these issues by fixing tags, removing pages, or adjusting Google’s access.

10. Eliminate Redirect Chains And Internal Redirects

As websites evolve, redirects are a natural byproduct, directing visitors from one page to a newer or more relevant one. But while they’re common on most sites, if you’re mishandling them, you could inadvertently sabotage your indexing.

You can make several mistakes when creating redirects, but one of the most common is redirect chains. These occur when there’s more than one redirect between the link clicked on and the destination. Google doesn’t consider this a positive signal.

In more extreme cases, you may initiate a redirect loop, in which a page redirects to another page, directs to another page, and so on, until it eventually links back to the first page. In other words, you’ve created a never-ending loop that goes nowhere.

Check your site’s redirects using Screaming Frog, Redirect-Checker.org, or a similar tool.

11. Fix Broken Links

Similarly, broken links can wreak havoc on your site’s crawlability. You should regularly check your site to ensure you don’t have broken links, as this will hurt your SEO results and frustrate human users.

There are a number of ways you can find broken links on your site, including manually evaluating every link on your site (header, footer, navigation, in-text, etc.), or you can use Google Search Console, Analytics, or Screaming Frog to find 404 errors.

Once you’ve found broken links, you have three options for fixing them: redirecting them (see the section above for caveats), updating them, or removing them.

12. IndexNow

IndexNow is a protocol that allows websites to proactively inform search engines about content changes, ensuring faster indexing of new, updated, or removed content. By strategically using IndexNow, you can boost your site’s crawlability and indexability.

However, using IndexNow judiciously and only for meaningful content updates that substantially enhance your website’s value is crucial. Examples of significant changes include:

  • For ecommerce sites: Product availability changes, new product launches, and pricing updates.
  • For news websites: Publishing new articles, issuing corrections, and removing outdated content.
  • For dynamic websites, this includes updating financial data at critical intervals, changing sports scores and statistics, and modifying auction statuses.
  • Avoid overusing IndexNow by submitting duplicate URLs too frequently within a short timeframe, as this can negatively impact trust and rankings.
  • Ensure that your content is fully live on your website before notifying IndexNow.

If possible, integrate IndexNow with your content management system (CMS) for seamless updates. If you’re manually handling IndexNow notifications, follow best practices and notify search engines of both new/updated content and removed content.

By incorporating IndexNow into your content update strategy, you can ensure that search engines have the most current version of your site’s content, improving crawlability, indexability, and, ultimately, your search visibility.

13. Implement Structured Data To Enhance Content Understanding

Structured data is a standardized format for providing information about a page and classifying its content.

By adding structured data to your website, you can help search engines better understand and contextualize your content, improving your chances of appearing in rich results and enhancing your visibility in search.

There are several types of structured data, including:

  • Schema.org: A collaborative effort by Google, Bing, Yandex, and Yahoo! to create a unified vocabulary for structured data markup.
  • JSON-LD: A JavaScript-based format for encoding structured data that can be embedded in a web page’s or .
  • Microdata: An HTML specification used to nest structured data within HTML content.

To implement structured data on your site, follow these steps:

  • Identify the type of content on your page (e.g., article, product, event) and select the appropriate schema.
  • Mark up your content using the schema’s vocabulary, ensuring that you include all required properties and follow the recommended format.
  • Test your structured data using tools like Google’s Rich Results Test or Schema.org’s Validator to ensure it’s correctly implemented and free of errors.
  • Monitor your structured data performance using Google Search Console’s Rich Results report. This report shows which rich results your site is eligible for and any issues with your implementation.

Some common types of content that can benefit from structured data include:

  • Articles and blog posts.
  • Products and reviews.
  • Events and ticketing information.
  • Recipes and cooking instructions.
  • Person and organization profiles.

By implementing structured data, you can provide search engines with more context about your content, making it easier for them to understand and index your pages accurately.

This can improve search results visibility, mainly through rich results like featured snippets, carousels, and knowledge panels.

Wrapping Up

By following these 13 steps, you can make it easier for search engines to discover, understand, and index your content.

Remember, this process isn’t a one-time task. Regularly check your site’s performance, fix any issues that arise, and stay up-to-date with search engine guidelines.

With consistent effort, you’ll create a more search-engine-friendly website with a better chance of ranking well in search results.

Don’t be discouraged if you find areas that need improvement. Every step to enhance your site’s crawlability and indexability is a step towards better search performance.

Start with the basics, like improving page speed and optimizing your site structure, and gradually work your way through more advanced techniques.

By making your website more accessible to search engines, you’re not just improving your chances of ranking higher – you’re also creating a better experience for your human visitors.

So roll up your sleeves, implement these tips, and watch as your website becomes more visible and valuable in the digital landscape.

More Resources:


Featured Image: BestForBest/Shutterstock

SEO Aspects of Content Syndication

Syndicating content to other platforms can generate more views and increase a brand’s visibility.

Popular syndication platforms include Medium, Substack, and LinkedIn. Many established media outlets allow the placement of quality and relevant content on their websites.

Publishers who syndicate content have two ways to point it to their sites as the source:

  • A link to their original article. This is a weak signal to Google (plus, the links are typically nofollow) but can direct some traffic from the syndicated content back to your site.
  • A rel= “canonical” link element pointing to the source is a stronger signal and may send external link equity back to your article. Not all sites offer this option, however. LinkedIn and Substack, for example, do not allow canonicals.

I prefer links and canonicals where possible, for search engines and referral traffic. However, even with both options in place, Google may choose to index and rank syndicated content versus the original article.

Google Decides

We’ve long known that rel= “canonical” is not a directive. Even to eliminate internal duplicate content, Google will decide which page to index and rank based on internal links and other signals (such as content depth and relevancy).

The same exists for cross-site canonical tags. Based on the domain authority and external links, Google may rank a non-original version of syndicated content. Google’s John Mueller confirmed this. When asked why Google often ranks syndicated content over the original, Mueller stated:

In general, when you syndicate or republish your content across platforms, you’re trading the extra visibility within that platform with the possibility that the other platform will appear in the search results above your website…

…the rel=canonical is not a directive, even within the same site. And if the pages are different, it doesn’t make sense for search engines to treat the pages as being equivalent. If you want to make sure that “your” version is the one shown in the search, you need to use `noindex` on the alternate versions.

Unfortunately, I’m unaware of syndication platforms that would noindex a page on their site.

In the same thread, Mueller cautioned website owners against noindexing their own content in fear of duplicate content penalty, stating that if you cannot noindex syndicated content, let Google decide, adding there’s no such thing as a duplicate content penalty.

SEO Implications

Content syndication is not a tactic for search engine optimization, but it could benefit content and brand exposure.

There’s no reliable way to ensure that Google will perceive your site as the original source and rank the content accordingly.

Yet there are a few ways to make content syndication SEO-friendlier:

  • Pick syndication partners that allow rel= “canonical” tags to point back to your site (which Google may or may not follow)
  • To keep your site’s content original, create different versions of an article when syndicating. This is time-consuming and only possible when you syndicate your content manually. It doesn’t ensure that your article will outrank syndicated versions. Nonetheless, many search engine optimizers (including me) still recommend it.

In short, syndicated content can reach a much wider audience and is thus a helpful marketing tactic. It does little, however, for search engine optimization.

3-Tiered Index Hints SEO Link Value

Leaked documentation from Google suggests that where the search giant stores a link to a web page determines, at least in part, the page’s authority.

Google stores its index of web pages on a three-tiered hardware system, each with different processing speeds. The leak from Google’s engineering documents, first reported by consultant Mike King, suggests a link’s authority depends on its storage tier — faster tiers equate to more authority.

“It was something we already suspected, but data in the leak suggests that the value of a link relates to where the page that the link exists on is stored in Google’s index,” says consultant Barry Adams, who helps publishers optimize articles for the Google News ecosystem, including Top Stories and Discover. He also noted that Google’s News algorithm processes content much faster than universal search.

Photo of computer RAM sticks on a motherboard

Google stores its index of web page links across three hardware tiers. Links on the RAM tier likely have higher authority.

“When an article is published, Google wants to show it in Top Stories as soon as possible because people want the latest news and updates. A lot of the processes that Google would perform as part of regular indexing and ranking of content don’t necessarily apply to News. To process News search results, Google has to take a few shortcuts along the way,” says Adams.

Ted Kubaitis at SEO Tool Lab has tested how quickly Google recognizes and incorporates new links into its rankings. So far, Google has taken at least three weeks to recognize and apply the authority of new backlinks to the rankings of web pages. News cycles are much faster.

Google upended Yahoo years ago as the leading search engine by treating links as recommendations by web users and ranking pages accordingly. The more recommendations or links a web page has from other domains, the higher that page tends to rank. Google calls this “citation indexing.” The logic is circular. If all other ranking factors are equal, a link from a web page with many trustworthy links pointing to it has more authority than a page with no links.

Storage Tiers

Google’s index of web pages resides on three hardware tiers:

  • RAM: This is the fastest and most valuable tier, reserved for popular and highly ranked web pages. RAM storage includes popular News stories, which suggests backlinks from News articles pass more authority than those from niche blogs or low-traffic sites.
  • Solid state drives: The middle tier offers faster retrieval than traditional rotating hard drives but is slower than RAM. Pages stored here are less frequently accessed than those in RAM but still hold significant value.
  • Hard disk drives: The slowest tier, rotating hard disks, stores older or less frequently accessed pages. These pages are less valuable in terms of link value.

News Publishers

Engagement is a long-suspected ranking signal, especially in Google’s Discover feed. The leak data confirms what I and other SEO consultants already surmised: Google’s ranking algorithm likely tracks user interaction metrics.

That means search-result listings must generate clicks and engagement to last. To secure a top ranking, pages must appear in search results and receive clicks from users who then stick around for at least a few seconds. If they bounce (leave right away) and visit another listing for longer, a page won’t likely maintain its rankings.

News publishers with rigid paywalls experience low engagement metrics, resulting in lesser rankings in Google Discover feeds and thus fewer external citations.

To mitigate, many publishers implement a “leaky” paywall wherein visitors can view a few free pages before subscribing. This approach encourages engagement, enhancing the content’s potential inclusion in Top Stories and Discover feeds. However, excessive ads, pop-up overlays, and interstitials can boost bounce rates and lower engagement signals.

SEO Implications

For search optimizers, this three-tiered revelation necessitates a shift in link-building tactics. Traditional methods, such as acquiring links from niche sites or older content likely stored on less valuable tiers, may have a diminishing impact on rankings.

Thus link-building tactics should prioritize web pages stored on Google’s RAM tier. This means securing mentions in high-traffic news articles and publications, which pass a higher link value.

Barry Adams states, “You need creative link-building campaigns with creative content marketing to generate news stories. You also need what I call serious PR to try and get press coverage in news stories.”

Google’s recent addition of Site Reputation Abuse to its web spam policies — “when third-party pages are published with little or no first-party oversight or involvement” — could eventually result in the search engine ignoring press release pickups, particularly if absentee publishers aggregate company news releases on a no-index subdomain.

As of this writing, I still see positive results from search-engine-optimized press releases, whether they generate original news coverage or get aggregated on a publisher’s subdomain. But publishers and press release distributors appear to be scrutinizing the content much more diligently.

The Right Way to Pass PageRank

PageRank is a component of Google’s search algorithm that assigns a value to web pages based on the number and quality of inbound links. The term is sometimes called “link equity.”

Google offers many tools to manage PageRank. Some transfer equity from one page to the next; others prevent it. All the tools are requests to Google, not directives.

Here is how to use the tools correctly

301 Redirects

A 301 redirect is Google’s strongest signal for passing link equity. In 2013, Matt Cutts (then head of Google’s webspam team) confirmed that 301 redirects pass most PageRank to a destination page but not all. In 2016, Google’s Gary Illyes tweeted that 301 and 302 redirects retain all PageRank.

However, Google has repeatedly stated that 301 redirects may pass no link equity for anything other than 1:1 URL replacements, such as redesigns or replatforms. If the destination page differs from the original, Google may treat it as a soft 404 (i.e., a 200 OK response code for a nonexistent page) and pass no link equity. Google’s John Mueller tweeted as much in 2017.

Screenshot of John Mueller's tweet

In a 2017 tweet, Google’s John Mueller stated 301s are best for 1:1 URL replacements.

Rel=”canonical” Tag

Rel=”canonical” is a link tag to inform search engines of the original content URL without redirecting users to it. It is often used for content syndication or internal duplicate content.

Rel=”canonical” passes PageRank to the original page. But like 301 redirects, Google may ignore it and choose a representative URL based on other signals.

Nofollow Attribute

Search optimizers use the nofollow link attribute to prevent the passing of PageRank, such as for links from sponsored posts and ads. It used to be a strong directive to Google, but, like 301s and canonical tags, it is now a request.

Still, websites should include either rel=nofollow or rel=sponsored attributes for paid or affiliate links. Similarly, seek the removal of links from low-grade or spammy sites to yours, even if nofollowed.

However, nofollow meta tags (versus link attributes) apply to an entire page and are strictly honored by Google. So no equity will pass from links pointing to your site from pages with the following nofollow in the header code:


Disavow Files

Search optimizers use disavow files to prevent PageRank from passing to their sites from spammy ones. Disavow is a request to Google and is best used only if the site has received a manual link-related penalty.

Interaction To Next Paint (INP): Everything You Need To Know via @sejournal, @BrianHarnish

The SEO field has no shortage of acronyms.

From SEO to FID to INP – these are some of the more common ones you will run into when it comes to page speed.

There’s a new metric in the mix: INP, which stands for Interaction to Next Paint. It refers to how the page responds to specific user interactions and is measured by Google Chrome’s lab data and field data.

What, Exactly, Is Interaction To Next Paint?

Interaction to Next Paint, or INP, is a new Core Web Vitals metric designed to represent the overall interaction delay of a page throughout the user journey.

For example, when you click the Add to Cart button on a product page, it measures how long it takes for the button’s visual state to update, such as changing the color of the button on click.

If you have heavy scripts running that take a long time to complete, they may cause the page to freeze temporarily, negatively impacting the INP metric.

Here is the example video illustrating how it looks in real life:

Notice how the first button responds visually instantly, whereas it takes a couple of seconds for the second button to update its visual state.

How Is INP Different From FID?

The main difference between INP and First Input Delay, or FID, is that FID considers only the first interaction on the page. It measures the input delay metric only and doesn’t consider how long it takes for the browser to respond to the interaction.

In contrast, INP considers all page interactions and measures the time browsers need to process them. INP, however, takes into account the following types of interactions:

  • Any mouse click of an interactive element.
  • Any tap of an interactive element on any device that includes a touchscreen.
  • The press of a key on a physical or onscreen keyboard.

What Is A Good INP Value?

According to Google, a good INP value is around 200 milliseconds or less. It has the following thresholds:

Threshold Value Description
200 Good responsiveness.
Above 200 milliseconds and up to 500 milliseconds Moderate and needs improvement.
Above 500 milliseconds Poor responsiveness.

Google also notes that INP is still experimental and that the guidance it recommends regarding this metric is likely to change.

How Is INP Measured?

Google measures INP from Chrome browsers anonymously from a sample of the single longest interactions that happen when a user visits the page.

Each interaction has a few phases: presentation time, processing time, and input delay. The callback of associated events contains the total time involved for all three phases to execute.

If a page has fewer than 50 total interactions, INP considers the interaction with the absolute worst delay; if it has over 50 interactions, it ignores the longest interactions per 50 interactions.

When the user leaves the page, these measurements are then sent to the Chrome User Experience Report called CrUX, which aggregates the performance data to provide insights into real-world user experiences, known as field data.

What Are The Common Reasons Causing High INPs?

Understanding the underlying causes of high INPs is crucial for optimizing your website’s performance. Here are the common causes:

  • Long tasks that can block the main thread, delaying user interactions.
  • Synchronous event listeners on click events, as we saw in the example video above.
  • Changes to the DOM cause multiple reflows and repaints, which usually happens when the DOM size is too large ( > 1,500 HTML elements).

How To Troubleshoot INP Issues?

First, read our guide on how to measure CWV metrics and try the troubleshooting techniques offered there. But if that still doesn’t help you find what interactions cause high INP, this is where the “Performance” report of the Chrome (or, better, Canary) browser can help.

  • Go to the webpage you want to analyze.
  • Open DevTools of your Canary browser, which doesn’t have browser extensions (usually by pressing F12 or Ctrl+Shift+I).
  • Switch to the Performance tab.
  • Disable cache from the Network tab.
  • Choose mobile emulator.
  • Click the Record button and interact with the page elements as you normally would.
  • Stop the recording once you’ve captured the interaction you’re interested in.

Throttle the CPU by 4x using the “slowdown” dropdown to simulate average mobile devices and choose a 4G network, which is used in 90% of mobile devices when users are outdoors. If you don’t change this setting, you will run your simulation using your PC’s powerful CPU, which is not equivalent to mobile devices.

It is a highly important nuance since Google uses field data gathered from real users’ devices. You may not face INP issues with a powerful device – that is a tricky point that makes it hard to debug INP. By choosing these settings, you bring your emulator state as close as possible to the real device’s state.

Here is a video guide that shows the whole process. I highly recommend you try this as you read the article to gain experience.

What we have spotted in the video is that long tasks cause interaction to take longer and a list of JavaScript files that are responsible for those tasks.

If you expand the Interactions section, you can see a detailed breakdown of the long task associated with that interaction, and clicking on those script URLs will open JavaScript code lines that are responsible for the delay, which you can use to optimize your code.

A total of 321 ms long interaction consists of:

  • Input delay: 207 ms.
  • Processing duration: 102 ms.
  • Presentation delay: 12 ms.

Below in the main thread timeline, you’ll see a long red bar representing the total duration of the long task.

Underneath the long red taskbar, you can see a yellow bar labeled “Evaluate Script,” indicating that the long task was primarily caused by JavaScript execution.

In the first screenshot time distance between (point 1) and (point 2) is a delay caused by a red long task because of script evaluation.

What Is Script Evaluation?

Script evaluation is a necessary step for JavaScript execution. During this crucial stage, the browser executes the code line by line, which includes assigning values to variables, defining functions, and registering event listeners.

Users might interact with a partially rendered page while JavaScript files are still being loaded, parsed, compiled, and evaluated.

When a user interacts with an element (clicks, taps, etc.) and the browser is in the stage of evaluating a script that contains an event listener attached to the interaction, it may delay the interaction until the script evaluation is complete.

This ensures that the event listener is properly registered and can respond to the interaction.

In the screenshot (point 2), the 207 ms delay likely occurred because the browser was still evaluating the script that contained the event listener for the click.

This is where Total Blocking Time (TBT) comes in, which measures the total amount of time that long tasks (longer than 50 ms) block the main thread until the page becomes interactive.

If that time is long and users interact with the website as soon as the page renders, the browser may not be able to respond promptly to the user interaction.

It is not a part of CWV metrics but often correlates with high INPs. So, in order to optimize for the INP metric, you should aim to lower your TBT.

What Are Common JavaScripts That Cause High TBT?

Analytics scripts – such as Google Analytics 4, tracking pixels, google re-captcha, or AdSense ads – usually cause high script evaluation time, thus contributing to TBT.

Example of website where ads and analytics scripts cause high javascript execution time.An example of a website where ads and analytics scripts cause high JavaScript execution time.

One strategy you may want to implement to reduce TBT is to delay the loading of non-essential scripts until after the initial page content has finished loading.

Another important point is that when delaying scripts, it’s essential to prioritize them based on their impact on user experience. Critical scripts (e.g., those essential for key interactions) should be loaded earlier than less critical ones.

Improving Your INP Is Not A Silver Bullet

It’s important to note that improving your INP is not a silver bullet that guarantees instant SEO success.

Instead, it is one item among many that may need to be completed as part of a batch of quality changes that can help make a difference in your overall SEO performance.

These include optimizing your content, building high-quality backlinks, enhancing meta tags and descriptions, using structured data, improving site architecture, addressing any crawl errors, and many others.

More resources:


Featured Image: BestForBest/Shutterstock

How To Use Python To Test SEO Theories (And Why You Should) via @sejournal, @andreasvoniatis

When working on sites with traffic, there is as much to lose as there is to gain from implementing SEO recommendations.

The downside risk of an SEO implementation gone wrong can be mitigated using machine learning models to pre-test search engine rank factors.

Pre-testing aside, split testing is the most reliable way to validate SEO theories before making the call to roll out the implementation sitewide or not.

We will go through the steps required on how you would use Python to test your SEO theories.

Choose Rank Positions

One of the challenges of testing SEO theories is the large sample sizes required to make the test conclusions statistically valid.

Split tests – popularized by Will Critchlow of SearchPilot – favor traffic-based metrics such as clicks, which is fine if your company is enterprise-level or has copious traffic.

If your site doesn’t have that envious luxury, then traffic as an outcome metric is likely to be a relatively rare event, which means your experiments will take too long to run and test.

Instead, consider rank positions. Quite often, for small- to mid-size companies looking to grow, their pages will often rank for target keywords that don’t yet rank high enough to get traffic.

Over the timeframe of your test, for each data point of time, for example day, week or month, there are likely to be multiple rank position data points for multiple keywords. In comparison to using a metric of traffic (which is likely to have much less data per page per date), which reduces the time period required to reach a minimum sample size if using rank position.

Thus, rank position is great for non-enterprise-sized clients looking to conduct SEO split tests who can attain insights much faster.

Google Search Console Is Your Friend

Deciding to use rank positions in Google makes using the data source a straightforward (and conveniently a low-cost) decision in Google Search Console (GSC), assuming it’s set up.

GSC is a good fit here because it has an API that allows you to extract thousands of data points over time and filter for URL strings.

While the data may not be the gospel truth, it will at least be consistent, which is good enough.

Filling In Missing Data

GSC only reports data for URLs that have pages, so you’ll need to create rows for dates and fill in the missing data.

The Python functions used would be a combination of merge() (think VLOOKUP function in Excel) used to add missing data rows per URL and filling the data you want to be inputed for those missing dates on those URLs.

For traffic metrics, that’ll be zero, whereas for rank positions, that’ll be either the median (if you’re going to assume the URL was ranking when no impressions were generated) or 100 (to assume it wasn’t ranking).

The code is given here.

Check The Distribution And Select Model

The distribution of any data represents its nature, in terms of where the most popular value (mode) for a given metric, say rank position (in our case the chosen metric) is for a given sample population.

The distribution will also tell us how close the rest of the data points are to the middle (mean or median), i.e., how spread out (or distributed) the rank positions are in the dataset.

This is critical as it will affect the choice of model when evaluating your SEO theory test.

Using Python, this can be done both visually and analytically; visually by executing this code:

ab_dist_box_plt = (

ggplot(ab_expanded.loc[ab_expanded['position'].between(1, 90)], 

aes(x = 'position')) + 

geom_histogram(alpha = 0.9, bins = 30, fill = "#b5de2b") +
geom_vline(xintercept=ab_expanded['position'].median(), color="red", alpha = 0.8, size=2) +

labs(y = '# Frequency n', x = 'nGoogle Position') + 

scale_y_continuous(labels=lambda x: ['{:,.0f}'.format(label) for label in x]) + 

#coord_flip() +

theme_light() +

theme(legend_position = 'bottom', 

axis_text_y =element_text(rotation=0, hjust=1, size = 12),

legend_title = element_blank()

)

ab_dist_box_plt

The chart shows that the distribution is positively skewedImage from author, July 2024

The chart above shows that the distribution is positively skewed (think skewer pointing right), meaning most of the keywords rank in the higher-ranked positions (shown towards the left of the red median line).

Now, we know which test statistic to use to discern whether the SEO theory is worth pursuing. In this case, there is a selection of models appropriate for this type of distribution.

Minimum Sample Size

The selected model can also be used to determine the minimum sample size required.

The required minimum sample size ensures that any observed differences between groups (if any) are real and not random luck.

That is, the difference as a result of your SEO experiment or hypothesis is statistically significant, and the probability of the test correctly reporting the difference is high (known as power).

This would be achieved by simulating a number of random distributions fitting the above pattern for both test and control and taking tests.

The code is given here.

When running the code, we see the following:

(0.0, 0.05) 0

(9.667, 1.0) 10000

(17.0, 1.0) 20000

(23.0, 1.0) 30000

(28.333, 1.0) 40000

(38.0, 1.0) 50000

(39.333, 1.0) 60000

(41.667, 1.0) 70000

(54.333, 1.0) 80000

(51.333, 1.0) 90000

(59.667, 1.0) 100000

(63.0, 1.0) 110000

(68.333, 1.0) 120000

(72.333, 1.0) 130000

(76.333, 1.0) 140000

(79.667, 1.0) 150000

(81.667, 1.0) 160000

(82.667, 1.0) 170000

(85.333, 1.0) 180000

(91.0, 1.0) 190000

(88.667, 1.0) 200000

(90.0, 1.0) 210000

(90.0, 1.0) 220000

(92.0, 1.0) 230000

To break it down, the numbers represent the following using the example below:

(39.333,: proportion of simulation runs or experiments in which significance will be reached, i.e., consistency of reaching significance and robustness.

1.0) : statistical power, the probability the test correctly rejects the null hypothesis, i.e., the experiment is designed in such a way that a difference will be correctly detected at this sample size level.

60000: sample size

The above is interesting and potentially confusing to non-statisticians. On the one hand, it suggests that we’ll need 230,000 data points (made of rank data points during a time period) to have a 92% chance of observing SEO experiments that reach statistical significance. Yet, on the other hand with 10,000 data points, we’ll reach statistical significance – so, what should we do?

Experience has taught me that you can reach significance prematurely, so you’ll want to aim for a sample size that’s likely to hold at least 90% of the time – 220,000 data points are what we’ll need.

This is a really important point because having trained a few enterprise SEO teams, all of them complained of conducting conclusive tests that didn’t produce the desired results when rolling out the winning test changes.

Hence, the above process will avoid all that heartache, wasted time, resources and injured credibility from not knowing the minimum sample size and stopping tests too early.

Assign And Implement

With that in mind, we can now start assigning URLs between test and control to test our SEO theory.

In Python, we’d use the np.where() function (think advanced IF function in Excel), where we have several options to partition our subjects, either on string URL pattern, content type, keywords in title, or other depending on the SEO theory you’re looking to validate.

Use the Python code given here.

Strictly speaking, you would run this to collect data going forward as part of a new experiment. But you could test your theory retrospectively, assuming that there were no other changes that could interact with the hypothesis and change the validity of the test.

Something to keep in mind, as that’s a bit of an assumption!

Test

Once the data has been collected, or you’re confident you have the historical data, then you’re ready to run the test.

In our rank position case, we will likely use a model like the Mann-Whitney test due to its distributive properties.

However, if you’re using another metric, such as clicks, which is poisson-distributed, for example, then you’ll need another statistical model entirely.

The code to run the test is given here.

Once run, you can print the output of the test results:

Mann-Whitney U Test Test Results

MWU Statistic: 6870.0

P-Value: 0.013576443923420183

Additional Summary Statistics:

Test Group: n=122, mean=5.87, std=2.37

Control Group: n=3340, mean=22.58, std=20.59

The above is the output of an experiment I ran, which showed the impact of commercial landing pages with supporting blog guides internally linking to the former versus unsupported landing pages.

In this case, we showed that offer pages supported by content marketing enjoy a higher Google rank by 17 positions (22.58 – 5.87) on average. The difference is significant, too, at 98%!

However, we need more time to get more data – in this case, another 210,000 data points. As with the current sample size, we can only be sure that <10% of the time>

Split Testing Can Demonstrate Skills, Knowledge And Experience

In this article, we walked through the process of testing your SEO hypotheses, covering the thinking and data requirements to conduct a valid SEO test.

By now, you may come to appreciate there is much to unpack and consider when designing, running and evaluating SEO tests. My Data Science for SEO video course goes much deeper (with more code) on the science of SEO tests, including split A/A and split A/B.

As SEO professionals, we may take certain knowledge for granted, such as the impact content marketing has on SEO performance.

Clients, on the other hand, will often challenge our knowledge, so split test methods can be most handy in demonstrating your SEO skills, knowledge, and experience!

More resources: 


Featured Image: UnderhilStudio/Shutterstock