Canonical links are frequently misunderstood by developers, marketers, content authors and SEOs. When used correctly, rel=canonical is a powerful tool that can improve search results for visitors. However, when used incorrectly, canonical links can be detrimental to search results, and may in some cases be ignored by search engines entirely.

What is Canonicalization?

Canonical links allow authors to specify a preferred version of a web page. In other words, these links instruct search engines of which URL should appear in search results for a particular piece of content.

When authoring content on the web, it is common to present the same material at different URLs. This frequently occurs when a site is accessible via more than one protocol or subdomain and when index files are directly accessible. For example:

  • http://example.com
  • http://example.com/index.html
  • http://www.example.com
  • http://www.example.com/index.html
  • https://example.com
  • https://example.com/index.html
  • https://www.example.com
  • https://www.example.com/index.html

However, this may also occur when content is organized into different categories or collections, and when all or part of a body of content is syndicated on other sites.

This duplicate content presents a problem for search engines, which aim to improve user experience. If a search engine was to index each of the URLs above, a visitor could be presented with repetitive content in search results, leading to a poor user experience. Therefore, search engines “[try] hard to index and show pages with distinct information.”

When a search engine encounters “substantive blocks” of duplicate content, the search engine must select a single version of the content to show in search results.

Note: The concept of a “duplicate content penalty” is largely a myth. As Google notes:

There’s no such thing as a “duplicate content penalty.” At least, not in the way most people mean when they say that.

Demystifying the “duplicate content penalty”

Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results.

…Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.

Google Search Console Help: Duplicate content

Through the use of canonical links, authors can effectively inform search engines to index content only at one URL.

The rel=canonical link may be added via either an HTML tag or HTTP header.

<link rel="canonical" href="http://www.example.com/canonical.html" />
Link: <http://www.example.com/canonical.html>; rel="canonical"

Note: Any HTML link may be specified as an HTTP header according to RFC5988§5.

Canonical Indexing

The rel=canonical link is an SEO hint not a directive. While canonicalized URLs are not supposed to be indexed or show up in search results, there are some circumstances where search engines may choose not to honor canonical links.

Are Non-Canonical Pages Indexed?

For all practical purposes – no. If Google honors a rel=canonical tag, then the non-canonical page is not eligible for ranking. It will not have a unique cached copy, and it will not appear in the public index via a “site:” search.

Rel=Confused? Answers to Your Rel=Canonical Questions

Nevertheless, Google may ignore a canonical URL if:

  • Content is significantly different on the canonicalized URL.
  • Multiple canonical links exist on a URL.
  • A canonical loop exists wherein a canonicalized paged links back to the original page.
  • The canonicalized page is redirected.
  • The target does not exist.
    • The URL is empty, broken, or results in an error or “soft 404”.
  • Google believes the link is malicious.
  • The link does not appear in the <head>, or if the <head> contains unusual content.

Note: The above list outlines some of the known instances in which Google may elect to disregard a canonical link. Other examples may exist.

Content Discrepancies

Canonical links are designed to connect content that is substantively similar. If the canonicalized content does not closely match the original content, search engines are unlikely to respect the link.

A large portion of the duplicate page’s content should be present on the canonical version… for example, if [two pages are] only topically similar but not extremely close in exact words, the canonical designation might be disregarded by search engines.

5 common mistakes with rel=canonical

Note: While not recommended, it is possible to use rel=canonical on dissimilar content; doing so may cause search engines to stop trusting your site’s canonical links.

The number of canonical links that exist for a specific URL is significant. When multiple canonical declarations exist for a URL, Google will “likely ignore all the rel=canonical hints”.

If multiple canonical links do exist for a URL, be careful to ensure that the links do not contradict one another by verifying that all canonical links point to the same destination.

Tip: Use the HTML or the HTTP method of declaring canonical links consistently across a site to avoid overlap and accidental conflicts.

The position of canonical links in the HTML document is also significant. Ensure that the rel=canonical tag appears as close to the beginning of the document as possible, and strive to maintain valid HTML – particularly in the document head.

The rel=canonical link tag should only appear in the <head> of an HTML document. Additionally, to avoid HTML parsing issues, it’s good to include the rel=canonical as early as possible in the <head>. When we encounter a rel=canonical designation in the <body>, it’s disregarded.

5 common mistakes with rel=canonical

Matt Cutts explains why this is the case:

If Google trusted rel=canonical in the HTML body, we’d see far more attacks where people would drop arel=canonical on part of a web page to try to hijack it.

…so now we come to another corner case… we probably won’t trust a rel=canonical if we see weird stuff in your HEAD section… [because] we may assume that someone forgot to close the HEAD section[, and] we don’t allow rel=canonical in the BODY.

A rel=canonical corner case

An even poorer user experience than indexing duplicate content would be indexing the wrong content, no content, or malicious content. Matt Cutts outlines scenarios in which Google would ignore such links:

We take rel=canonical URLs as a strong hint, but in some cases we won’t use them:

  • For example, if we think you’re shooting yourself in the foot by accident (pointing a rel=canonical toward a non-existent/404 page), we’d reserve the right not to use the destination URL you specify with rel=canonical.
  • Another example where we might not go with your rel=canonical preference: if we think your website has been hacked and the hacker added a malicious rel=canonical.

Canonicalization vs Redirection

Because search engines can choose to ignore canonical links, redirection is preferred; as Matt Cutts explains, “regarding 301 redirects vs rel=canonical, in general I would use 301 redirects… they’re more widely supported.” He goes on to say, “the rel=canonical is more appropriate for when you can’t get to the server headers… if you can do 301 redirects… do [that]. If you don’t have the ability or option to do 301 redirects… rel=canonical makes sense.”

Google echoes this in the Search Console Help guidelines “Use canonical URLs” the use of “301 redirects to send traffic from [undesirable] URLs to your preferred URL. A server-side 301 redirect is the best way to ensure that users and search engines are directed to the correct page.”

Note: Because canonicalized pages are actually separate entities, visitor behavior is recorded separately for each unique URL in analytics tools. This may or may not be desirable.

Barring any technical reasons preventing redirection, another acceptable reason to use a canonical link over a redirect is if redirection would negatively impact user experience (e.g. categorization with breadcrumbs, pagination).

Canonicalized Pagination

A common mistake when canonicalizing pages is to canonicalize paginated content to the first page of content. Because the content of the first page is not representative of or substantively similar to the remaining paginated content, linking these pages to the first page essentially instructs search engines not to index a majority of the paginated content. This may lead search engines to ignore the canonical links.

Because “searchers commonly prefer to view a whole article or category on a single page,” Google recommends linking to a “View All” version of the content, if one exists and does not present any user experience issues. In fact, “if [Google thinks] this is what the searcher is looking for, [they] try to show the View All page in search results.”

Canonicalizing Facets

While facets can and likely should be canonicalized, Google recommends using the Google Search Console Parameter Handling tool to “tell Google about any parameters you would like ignored.” As Google notes, “ignoring certain parameters can reduce duplicate content in Google’s index, and make your site more crawlable.”

Note: Search engines impose a crawl budget on sites (i.e. content is crawled and updated over time, not immediately). Providing search engines with valuable information about how your site is structured will allow search engines to index content more efficiently.

Combining with Robots Directives

Implementing rel=canonical in combination with robots directives like noindex is not advised, because noindex prevents pages from pass PageRank to the canonical version. As Google’s John Mueller reports:

You should not combine the noindex with a re-canonical pointing at an indexable URL (the rel=canonical says they’re equivalent, the noindex says they’re pretty much opposites)… pick one, but not both.

John Mueller’s comment on canonical

And again:

[When using rel=canonical,] only use the rel=canonical link element… One reason for this is that we sometimes find a non-canonical URL first. If this URL has a noindex robots meta tag, we might decide not to index anything until we crawl and index the canonical URL.

John Mueller’s answer to Canonical conflicts with NOINDEX?