Google’s John Mueller answered a question on LinkedIn about how Google chooses canonicals, offering advice about what SEOs and publishers can do to encourage Google how to pick the right URL.
What Is A Canonical URL?
In the situation where multiple URLs (the addresses for multiple web pages) have the same content, Google will choose one URL that will be representative for all of the pages. The chosen page is referred to as the canonical URL.
Google Search Central has published documentation that explains how SEOs and publishes can communicate their preference of which URL to use. None of these methods force Google to choose the preferred URL, they mainly serve as a strong hint.
There are three ways to indicate the canonical URL:
Redirecting duplicate pages to the preferred URL (a strong signal)
Use the rel=canonical link attribute to specify the preferred URL (a strong signal)
List the preferred URL in the sitemap (a weak signal)
Some of Google’s canonicalization documentation incorrectly refers to the rel=canonical as a link element. The link tag, , is the element. The rel=canonical is an attribute of the link element. Google also calls rel=canonical an annotation, which might be an internal way Google refers to it but it’s not the proper way to refer to rel=canonical (it’s an HTML attribute of the link element).
There are two important things you need to know about HTML elements and attributes:
HTML elements are the building blocks for creating a web page.
An HTML attribute is something that adds more information about that building block (the HTML element).
The Mozilla Developer Network HTML documentation (an authoritative source for HTML specifications) notes that “link” is an HTML element and that “rel=” is an attribute of the link element.
Person Read The Manual But Still Has Questions
The person reading Google’s documentation which lists the above three ways to specify a canonical still had questions so he asked it on LinkedIn.
He referred to the documentation as “doc” in his question:
“The mentioned doc suggests several ways to specify a canonical URL.
1. Adding tag in
section of the page, and another, 2. Through sitemap, etc.
So, if we consider only point 2 of the above.
Which means the sitemap—Technically it contains all the canonical links of a website.
Then why in some cases, a couple of the URLs in the sitemap throws: “Duplicate without user-selected canonical.” ?”
As I pointed out above, Google’s documentation says that the sitemap is a weak signal.
Google Uses More Signals For Canonicalization
John Mueller’s answer reveals that Google uses more factors or signals than what is officially documented.
He explained:
“If Google’s systems can tell that pages are similar enough that one of them could be focused on, then we use the factors listed in that document (and more) to try to determine which one to focus on.”
Internal Linking Is A Canonical Factor
Mueller next explained that internal links can be used to give Google a strong signal of which URL is the preferred one.
This is how Mueller answered:
“If you have a strong preference, it’s best to make that preference very obvious, by making sure everything on your site expresses that preference – including the link-rel-canonical in the head, sitemaps, internal links, etc. “
He then followed up with:
“When it comes to search, which one of the pages Google’s systems focus on doesn’t matter so much, they’d all be shown similarly in search. The exact URL shown is mostly just a matter for the user (who might see it) and for the site-owner (who might want to monitor & track that URL).”
Takeaways
In my experience it’s not uncommon that a large website contains old internal links that point to the wrong URL. Sometimes it’s not old internal links that are the cause, it’s 301 redirects from an old page to another URL that is not the preferred canonical. That can also lead to Google choosing a URL that is not preferred by the publisher.
If Google is choosing the wrong URL then it may be useful to crawl the entire site (like with Screaming Frog) and then look at the internal linking patterns as well as redirects because it may very well be that forgotten internal links hidden deep within the website or chained redirects to the wrong URL are causing Google to choose the wrong URL.
Google’s documentation also notes that external links to the wrong page could influence which page Google chooses as the canonical, so that’s one more thing that needs to be checked for debugging why the wrong URL is being ranked.
The important takeaway here is that if the standard ways of specifying the canonical are not working then it’s possible that there is an external links, or unintentional internal linking, or a forgotten redirect that is causing Google to choose the wrong URL. Or, as John Mueller suggested, increasing the amount of internal links to the preferred URL may help Google to choose the preferred URL.
The concept of Compressibility as a quality signal is not widely known, but SEOs should be aware of it. Search engines can use web page compressibility to identify duplicate pages, doorway pages with similar content, and pages with repetitive keywords, making it useful knowledge for SEO.
Although the following research paper demonstrates a successful use of on-page features for detecting spam, the deliberate lack of transparency by search engines makes it difficult to say with certainty if search engines are applying this or similar techniques.
What Is Compressibility?
In computing, compressibility refers to how much a file (data) can be reduced in size while retaining essential information, typically to maximize storage space or to allow more data to be transmitted over the Internet.
TL/DR Of Compression
Compression replaces repeated words and phrases with shorter references, reducing the file size by significant margins. Search engines typically compress indexed web pages to maximize storage space, reduce bandwidth, and improve retrieval speed, among other reasons.
This is a simplified explanation of how compression works:
Identify Patterns: A compression algorithm scans the text to find repeated words, patterns and phrases
Shorter Codes Take Up Less Space: The codes and symbols use less storage space then the original words and phrases, which results in a smaller file size.
Shorter References Use Less Bits: The “code” that essentially symbolizes the replaced words and phrases uses less data than the originals.
A bonus effect of using compression is that it can also be used to identify duplicate pages, doorway pages with similar content, and pages with repetitive keywords.
Research Paper About Detecting Spam
This research paper is significant because it was authored by distinguished computer scientists known for breakthroughs in AI, distributed computing, information retrieval, and other fields.
Another of the co-authors is Dennis Fetterly, currently a software engineer at Google. He is listed as a co-inventor in a patent for a ranking algorithm that uses links, and is known for his research in distributed computing and information retrieval.
Those are just two of the distinguished researchers listed as co-authors of the 2006 Microsoft research paper about identifying spam through on-page content features. Among the several on-page content features the research paper analyzes is compressibility, which they discovered can be used as a classifier for indicating that a web page is spammy.
Detecting Spam Web Pages Through Content Analysis
Although the research paper was authored in 2006, its findings remain relevant to today.
Then, as now, people attempted to rank hundreds or thousands of location-based web pages that were essentially duplicate content aside from city, region, or state names. Then, as now, SEOs often created web pages for search engines by excessively repeating keywords within titles, meta descriptions, headings, internal anchor text, and within the content to improve rankings.
Section 4.6 of the research paper explains:
“Some search engines give higher weight to pages containing the query keywords several times. For example, for a given query term, a page that contains it ten times may be higher ranked than a page that contains it only once. To take advantage of such engines, some spam pages replicate their content several times in an attempt to rank higher.”
The research paper explains that search engines compress web pages and use the compressed version to reference the original web page. They note that excessive amounts of redundant words results in a higher level of compressibility. So they set about testing if there’s a correlation between a high level of compressibility and spam.
They write:
“Our approach in this section to locating redundant content within a page is to compress the page; to save space and disk time, search engines often compress web pages after indexing them, but before adding them to a page cache.
…We measure the redundancy of web pages by the compression ratio, the size of the uncompressed page divided by the size of the compressed page. We used GZIP …to compress pages, a fast and effective compression algorithm.”
High Compressibility Correlates To Spam
The results of the research showed that web pages with at least a compression ratio of 4.0 tended to be low quality web pages, spam. However, the highest rates of compressibility became less consistent because there were fewer data points, making it harder to interpret.
Figure 9: Prevalence of spam relative to compressibility of page.
The researchers concluded:
“70% of all sampled pages with a compression ratio of at least 4.0 were judged to be spam.”
But they also discovered that using the compression ratio by itself still resulted in false positives, where non-spam pages were incorrectly identified as spam:
“The compression ratio heuristic described in Section 4.6 fared best, correctly identifying 660 (27.9%) of the spam pages in our collection, while misidentifying 2, 068 (12.0%) of all judged pages.
Using all of the aforementioned features, the classification accuracy after the ten-fold cross validation process is encouraging:
95.4% of our judged pages were classified correctly, while 4.6% were classified incorrectly.
More specifically, for the spam class 1, 940 out of the 2, 364 pages, were classified correctly. For the non-spam class, 14, 440 out of the 14,804 pages were classified correctly. Consequently, 788 pages were classified incorrectly.”
The next section describes an interesting discovery about how to increase the accuracy of using on-page signals for identifying spam.
Insight Into Quality Rankings
The research paper examined multiple on-page signals, including compressibility. They discovered that each individual signal (classifier) was able to find some spam but that relying on any one signal on its own resulted in flagging non-spam pages for spam, which are commonly referred to as false positive.
The researchers made an important discovery that everyone interested in SEO should know, which is that using multiple classifiers increased the accuracy of detecting spam and decreased the likelihood of false positives. Just as important, the compressibility signal only identifies one kind of spam but not the full range of spam.
The takeaway is that compressibility is a good way to identify one kind of spam but there are other kinds of spam that aren’t caught with this one signal. Other kinds of spam were not caught with the compressibility signal.
This is the part that every SEO and publisher should be aware of:
“In the previous section, we presented a number of heuristics for assaying spam web pages. That is, we measured several characteristics of web pages, and found ranges of those characteristics which correlated with a page being spam. Nevertheless, when used individually, no technique uncovers most of the spam in our data set without flagging many non-spam pages as spam.
For example, considering the compression ratio heuristic described in Section 4.6, one of our most promising methods, the average probability of spam for ratios of 4.2 and higher is 72%. But only about 1.5% of all pages fall in this range. This number is far below the 13.8% of spam pages that we identified in our data set.”
So, even though compressibility was one of the better signals for identifying spam, it still was unable to uncover the full range of spam within the dataset the researchers used to test the signals.
Combining Multiple Signals
The above results indicated that individual signals of low quality are less accurate. So they tested using multiple signals. What they discovered was that combining multiple on-page signals for detecting spam resulted in a better accuracy rate with less pages misclassified as spam.
The researchers explained that they tested the use of multiple signals:
“One way of combining our heuristic methods is to view the spam detection problem as a classification problem. In this case, we want to create a classification model (or classifier) which, given a web page, will use the page’s features jointly in order to (correctly, we hope) classify it in one of two classes: spam and non-spam.”
These are their conclusions about using multiple signals:
“We have studied various aspects of content-based spam on the web using a real-world data set from the MSNSearch crawler. We have presented a number of heuristic methods for detecting content based spam. Some of our spam detection methods are more effective than others, however when used in isolation our methods may not identify all of the spam pages. For this reason, we combined our spam-detection methods to create a highly accurate C4.5 classifier. Our classifier can correctly identify 86.2% of all spam pages, while flagging very few legitimate pages as spam.”
Key Insight:
Misidentifying “very few legitimate pages as spam” was a significant breakthrough. The important insight that everyone involved with SEO should take away from this is that one signal by itself can result in false positives. Using multiple signals increases the accuracy.
What this means is that SEO tests of isolated ranking or quality signals will not yield reliable results that can be trusted for making strategy or business decisions.
Takeaways
We don’t know for certain if compressibility is used at the search engines but it’s an easy to use signal that combined with others could be used to catch simple kinds of spam like thousands of city name doorway pages with similar content. Yet even if the search engines don’t use this signal, it does show how easy it is to catch that kind of search engine manipulation and that it’s something search engines are well able to handle today.
Here are the key points of this article to keep in mind:
Doorway pages with duplicate content is easy to catch because they compress at a higher ratio than normal web pages.
Groups of web pages with a compression ratio above 4.0 were predominantly spam.
Negative quality signals used by themselves to catch spam can lead to false positives.
In this particular test, they discovered that on-page negative quality signals only catch specific types of spam.
When used alone, the compressibility signal only catches redundancy-type spam, fails to detect other forms of spam, and leads to false positives.
Google Search Central published new documentation on Google Trends, explaining how to use it for search marketing. This guide serves as an easy to understand introduction for newcomers and a helpful refresher for experienced search marketers and publishers.
The new guide has six sections:
About Google Trends
Tutorial on monitoring trends
How to do keyword research with the tool
How to prioritize content with Trends data
How to use Google Trends for competitor research
How to use Google Trends for analyzing brand awareness and sentiment
The section about monitoring trends advises there are two kinds of rising trends, general and specific trends, which can be useful for developing content to publish on a site.
Using the Explore tool, you can leave the search box empty and view the current rising trends worldwide or use a drop down menu to focus on trends in a specific country. Users can further filter rising trends by time periods, categories and the type of search. The results show rising trends by topic and by keywords.
To search for specific trends users just need to enter the specific queries and then filter them by country, time, categories and type of search.
The section called Content Calendar describes how to use Google Trends to understand which content topics to prioritize.
Google explains:
“Google Trends can be helpful not only to get ideas on what to write, but also to prioritize when to publish it. To help you better prioritize which topics to focus on, try to find seasonal trends in the data. With that information, you can plan ahead to have high quality content available on your site a little before people are searching for it, so that when they do, your content is ready for them.”
Google’s Search Advocate, John Mueller, shared insights on diagnosing widespread crawling issues.
This guidance was shared in response to a disruption reported by Adrian Schmidt on LinkedIn. Google’s crawler stopped accessing several of his domains at the same time.
Despite the interruption, Schmidt noted that live tests via Search Console continued to function without error messages.
Investigations indicated no increase in 5xx errors or issues with robots.txt requests.
What could the problem be?
Mueller’s Response
Addressing the situation, Mueller pointed to shared infrastructure as the likely cause:
“If it shared across a bunch of domains and focuses on something like crawling, it’s probably an issue with a shared piece of infrastructure. If it’s already recovering, at least it’s not urgent anymore and you have a bit of time to poke at recent changes / infrastructure logs.”
Infrastructure Investigation
All affected sites used Cloudflare as their CDN, which raised some eyebrows.
When asked about debugging, Mueller recommended checking Search Console data to determine whether DNS or failed requests were causing the problem.
Mueller stated:
“The crawl stats in Search Console will also show a bit more, perhaps help decide between say DNS vs requests failing.”
He also pointed out that the timing was a key clue:
“If it’s all at exactly the same time, it wouldn’t be robots.txt, and probably not DNS.”
Impact on Search Results
Regarding search visibility concerns, Mueller reassured this type of disruption wouldn’t cause any problems:
“If this is from today, and it just lasted a few hours, I wouldn’t expect any visible issues in search.”
Why This Matters
When Googlebot suddenly stops crawling across numerous sites simultaneously, it can be challenging to identify the root cause.
While temporary crawling pauses might not immediately impact search rankings, they can disrupt Google’s ability to discover and index new content.
The incident highlights a vulnerability organizations might face without realizing it, especially those relying on shared infrastructure.
How This Can Help You
If time Googlebot stops crawling your sites:
Check if the problem hits multiple sites at once
Look at your shared infrastructure first
Use Search Console data to narrow down the cause
Don’t rule out DNS just because regular traffic looks fine
Keep an eye on your logs
For anyone running multiple sites behind a CDN, make sure you:
Have good logging set up
Watch your crawl rates
Know who to call when things go sideways
Keep tabs on your infrastructure provider
Featured Image: PeopleImages.com – Yuri A/Shutterstock
In a recent update to its Search Central documentation, Google has added specific guidelines for URL parameter formatting.
The update brings parameter formatting recommendations from a faceted navigation blog post into the main URL structure documentation, making these guidelines more accessible.
Key Updates
The new documentation specifies that developers should use the following:
Equal signs (=) to separate key-value pairs
Ampersands (&) to connect multiple parameters
Google recommends against using alternative separators such as:
Colons and brackets
Single or double commas
Why This Matters
URL parameters play a role in website functionality, particularly for e-commerce sites and content management systems.
They control everything from product filtering and sorting to tracking codes and session IDs.
While powerful, they can create SEO challenges like duplicate content and crawl budget waste.
Proper parameter formatting ensures better crawling efficiency and can help prevent common indexing issues that affect search performance.
The documentation addresses broader URL parameter challenges, such as managing dynamic content generation, handling session IDs, and effectively implementing sorting parameters.
Previous Guidance
Before this update, developers had to reference an old blog post about faceted navigation to find specific URL parameter formatting guidelines.
Consolidating this information into the main guidelines makes it easier to find.
The updated documentation can be found in Google’s Search Central documentation under the Crawling and Indexing section.
Looking Ahead
If you’re using non-standard parameter formats, start planning a migration to the standard format. Ensure proper redirects, and monitor your crawl stats during the switch.
While Google has not said non-standard parameters will hurt rankings, this update clarifies what they prefer. New sites and redesigns should adhere to the standard format to avoid future headaches.
Google Search Central updated their favicon documentation to recommend higher-resolution images, exceeding the previous minimum standard. Be aware of the changes described below, as they may impact how your site appears in search results.
Favicon
A favicon is a custom icon that is shown in browser tabs, browser bookmarks, browser favorites and sometimes in the search results. The word “favicon” is short for Favorites Icon.
An attractive favicon is useful for making it easier for users to find links to your site from their bookmarks, folders and browser tabs and can (in theory) help increase clicks from the search results. Thus, a high quality favicon that meets Google’s requirements is important in order to maximize popularity, user interactions and engagements, and visits from the search engine results pages (SERPs).
What Changed?
One of the changes to Google’s documentation is to make it clearer that a favicon must be in a square aspect ratio. The other important change is to strongly emphasize that publishers use a favicon that’s at least a 48×48 pixel size. Eight by eight pixels is still the minimum acceptable size for a favicon but publishers will probably miss out on the opportunity for a better presentation in the search results by going with a an 8×8 pixel favicon.
This is the part of the documentation that changed:
Previous version:
“Your favicon must be a multiple of 48px square, for example: 48x48px, 96x96px, 144x144px and so on (or SVG with a 1:1 (square) aspect ratio).”
New version:
“Your favicon must be a square (1:1 aspect ratio) that’s at least 8x8px. While the minimum size requirement is 8x8px, we recommend using a favicon that’s larger than 48x48px so that it looks good on various surfaces.”
Comparison Of Favicon Sizes
Reason For Documentation Changes
Google’s changelog for documentation says that the change was made to make it clearer what Google’s requirements are. This is an example of Google taking a look at their documentation to see how it can be improved. It’s the kind of thing that all publishers, even ecommerce merchants, should do at least once a year to identify if they overlooked an opportunity to be communicate a clearer message. Even ecommerce or local merchants can benefit from a yearly content review because things change or customer feedback can indicate a gap in necessary information.
This is Google’s official explanation for the change:
“Updated the favicon guidelines to state that favicons must have a 1:1 aspect ratio and be at least 8x8px in size, with a strong recommendation for using a higher resolution favicon of at least 48x48px.
What: Updated the favicon guidelines to state that favicons must have a 1:1 aspect ratio and be at least 8x8px in size, with a strong recommendation for using a higher resolution favicon of at least 48x48px.
Why: To reflect the actual requirements for favicons.”
Google Ads is enhancing its Performance Max campaigns with new AI-driven features.
These updates are focused on asset testing, video optimization, and campaign management.
The features arrive as advertisers gear up for the holiday shopping season.
Key Updates
New Asset Testing Capabilities
Starting in early November, retailers will gain access to new experimental features within Performance Max.
A key addition is the ability to measure the impact of supplementary assets beyond product feeds.
That means advertisers can measure the effectiveness of adding images, text, and video content to product-feed campaigns.
Google is also implementing Final URL expansion testing. This allows advertisers to evaluate whether alternative landing pages can drive better conversion rates by matching user intent.
Advanced Image Generation
Google is integrating Imagen 3, its latest text-to-image AI model, into the Google Ads platform.
This update aims to generate higher-performing visuals across Performance Max, Demand Gen, App, and Display campaigns.
The model has been trained on advertising performance data to create more effective commercial imagery.
Video Enhancement Tools
Google Ads is introducing automated video optimization features that include:
Automatic aspect ratio adaptation for different YouTube formats
Smart video shortening while preserving key messages
Granular control over enhanced video assets
These features roll out with built-in quality controls and opt-out options at the campaign and individual asset levels.
While most features are immediately available, video shortening for Demand Gen campaigns will launch in 2025.
Campaign Hierarchy Changes
There is a significant change in how Performance Max and Standard Shopping campaigns interact.
Instead of automatic prioritization for Performance Max campaigns, Google is introducing an Ad Rank-based system.
This new system determines which ads to serve when both campaign types target the same products within an account.
Improved Collaboration Features
Google is expanding shareable ad previews to Performance Max campaigns that include product feeds and travel objectives.
This simplifies the creative review process by allowing preview access without requiring Google Ads credentials.
Context
These updates demonstrate Google’s commitment to AI-driven advertising, particularly as businesses prepare for seasonal peaks.
This timely release suggests Google Ads is focusing on providing advanced tools for optimizing holiday marketing campaigns.
Looking Ahead
For advertisers currently using Performance Max, these updates provide new opportunities to optimize campaign performance with experimental features and improved creative capabilities.
The rollout starts immediately for most features. Specific tools, such as retail asset testing, will be available in early November, and video shortening for Demand Gen campaigns is expected to launch in 2025.
“We’ve been pretty clear that Core Web Vitals are not giant factors in ranking, and I doubt you’d see a big drop just because of that.”
The main benefit of improving website performance is providing a better user experience.
A poor experience could naturally decrease traffic by discouraging return visitors, regardless of how they initially found the site.
Mueller continues:
“Having a website that provides a good experience for users is worthwhile, because if users are so annoyed that they don’t want to come back, you’re just wasting the first-time visitors to your site, regardless of where they come from.”
Small Sites’ Competitive Edge
Mueller believes smaller websites have a unique advantage when it comes to implementing SEO changes.
Recalling his experience of trying to get a big company to change a robots.txt line, he explains:
“Smaller sites have a gigantic advantage when it comes to being able to take advantage of changes – they can be so much more nimble.”
Mueller noted that larger organizations may need extensive processes for simple changes, while smaller sites can update things like robots.txt in just 30 minutes.
He adds:
“None of this is easy, you still need to figure out what to change to adapt to a dynamic ecosystem online, but I bet if you want to change your site’s robots.txt (for example), it’s a matter of 30 minutes at most.”
Context
Mueller’s response followed research presented by Andrew Mcleod, who documented consistent patterns across multiple websites indicating rapid ranking changes after performance modifications.
In one case, a site with over 50,000 monthly visitors experienced a drop in traffic within 72 hours of implementing advertisements.
Mcleod’s analysis, which included five controlled experiments over three months, showed:
Traffic drops of up to 20% within 48 hours of enabling ads
Recovery periods of 1-2 weeks after removing ads
Consistent patterns across various test cases
Previous Statements
This latest guidance aligns with Mueller’s previous statements on Core Web Vitals.
In a March podcast, Mueller confirmed that Core Web Vitals are used in “ranking systems or in Search systems,” but emphasized that perfect scores won’t notably affect search results.
Mueller’s consistent message is clear: while Core Web Vitals are important for user experience and are part of Google’s ranking systems, you should prioritize content quality rather than focus on metrics.
Looking Ahead
Core Web Vitals don’t directly affect rankings, per Mueller.
While Google’s stance on ranking factors remains unchanged, the reality is that technical performance and user experience work together to influence traffic.