Search engines like Google have a problem – it’s called ‘duplicate content’. Duplicate content means that similar content appears at multiple locations (URLs) on the web, and as a result search engines don’t know which URL to show in the search results. This can hurt the ranking of a webpage, and the problem only gets worse when people start linking to the different versions of the same content. This article will help you to understand the various causes of duplicate content, and to find the solution to each of them.
What is duplicate content?
Duplicate content is content which is available on multiple URLs on the web. Because more than one URL shows the same content, search engines don’t know which URL to list higher in the search results. Therefore they might rank both URLs lower and give preference to other webpages.
In this article, we’ll mostly focus on the technical causes of duplicate content and their solutions. If you’d like to get a broader perspective on duplicate content and learn how it relates to copied or scraped content or even keyword cannibalization.
Let’s illustrate this with an example
Duplicate content can be likened to being at a crossroads where road signs point in two different directions for the same destination: Which road should you take? To make matters worse, the final destination is different too, but only ever so slightly. As a reader, you don’t mind because you get the content you came for, but a search engine has to pick which page to show in the search results because, of course, it doesn’t want to show the same content twice.
Let’s say your article about ‘keyword x’ appears at http://www.example.com/keyword-x/
and the same content also appears at http://www.example.com/article-category/keyword-x/
. This situation is not fictitious: it happens in lots of modern Content Management Systems. Then let’s say your article has been picked up by several bloggers and some of them link to the first URL, while others link to the second. This is when the search engine’s problem shows its true nature: it’s your problem. The duplicate content is your problem because those links both promote different URLs. If they were all linking to the same URL, your chances of ranking for ‘keyword x’ would be higher.
Causes of duplicate content
There are dozens of reasons for duplicate content. Most of them are technical: it’s not very often that a human decides to put the same content in two different places without making clear which is the original – it feels unnatural to most of us. There are many technical reasons though and it mostly happens because developers don’t think like a browser or even a user, let alone a search engine spider – they think like a programmer. Take that article we mentioned earlier, that appears on http://www.example.com/keyword-x/
and http://www.example.com/article-category/keyword-x/
. If you ask the developer, they will say it only exists once.
Misunderstanding the concept of a URL
No, that developer hasn’t gone mad, they are just speaking a different language. A CMS will probably power the website, and in that database there’s only one article, but the website’s software just allows for that same article in the database to be retrieved through several URLs. That’s because, in the eyes of the developer, the unique identifier for that article is the ID that article has in the database, not the URL. But for the search engine, the URL is the unique identifier for a piece of content. If you explain that to a developer, they will begin to get the problem. And after reading this article, you’ll even be able to provide them with a solution right away.
Session IDs
You often want to keep track of your visitors and allow them, for instance, to store items they want to buy in a shopping cart. In order to do that, you have to give them a ‘session.’ A session is a brief history of what the visitor did on your site and can contain things like the items in their shopping cart. To maintain that session as a visitor clicks from one page to another, the unique identifier for that session – called the Session ID – needs to be stored somewhere. The most common solution is to do that with cookies. However, search engines don’t usually store cookies.
At that point, some systems fall back to using Session IDs in the URL. This means that every internal link on the website gets that Session ID added to its URL, and because that Session ID is unique to that session, it creates a new URL, and therefore duplicate content.
URL parameters used for tracking and sorting
Another cause of duplicate content is using URL parameters that do not change the content of a page, for instance in tracking links. You see, to a search engine, http://www.example.com/keyword-x/
and http://www.example.com/keyword-x/?source=rss
are not the same URL. The latter might allow you to track what source people came from, but it might also make it harder for you to rank well – very much an unwanted side effect!
This doesn’t just go for tracking parameters, of course. It goes for every parameter you can add to a URL that doesn’t change the vital piece of content, whether that parameter is for ‘changing the sorting on a set of products’ or for ‘showing another sidebar’: all of them cause duplicate content.
Scrapers and content syndication
Most of the reasons for duplicate content are either the ‘fault’ of you or your website. Sometimes, however, other websites use your content, with or without your consent. They don’t always link to your original article, and therefore the search engine doesn’t ‘get’ it and has to deal with yet another version of the same article. The more popular your site becomes, the more scrapers you’ll get, making this problem bigger and bigger.
Order of parameters
Another common cause is that a CMS doesn’t use nice clean URLs, but rather URLs like /?id=1&cat=2
, where ID refers to the article and cat refers to the category. The URL /?cat=2&id=1
will render the same results in most website systems, but they’re completely different for a search engine.
Comment pagination
In my beloved WordPress, but also in some other systems, there is an option to paginate your comments. This leads to the content being duplicated across the article URL, and the article URL + /comment-page-1/, /comment-page-2/ etc.
Printer-friendly pages
If your content management system creates printer-friendly pages and you link to those from your article pages, Google will usually find them, unless you specifically block them. Now, ask yourself: Which version do you want Google to show? The one with your ads and peripheral content, or the one that only shows your article?
WWW vs. non-WWW
This is one of the oldest in the book, but sometimes search engines still get it wrong: WWW vs. non-WWW duplicate content, when both versions of your site are accessible. Another, less common situation but one I’ve seen as well is HTTP vs. HTTPS duplicate content, where the same content is served out over both.
Conceptual solution: a ‘canonical’ URL
As we’ve already seen, the fact that several URLs lead to the same content is a problem, but it can be solved. One person who works at a publication will normally be able to tell you quite easily what the ‘correct’ URL for a certain article should be, but sometimes when you ask three people within the same company, you’ll get three different answers…
That’s a problem that needs addressing because, in the end, there can be only one (URL). That ‘correct’ URL for a piece of content is referred to as the Canonical URL by the search engines.
Identifying duplicate contents issues
You might not know whether you have a duplicate content issue on your site or with your content. Using Google is one of the easiest ways to spot duplicate content.
There are several search operators that are very helpful in cases like these. If you’d want to find all the URLs on your site that contain your keyword X article, you’d type the following search phrase into Google:
site:example.com intitle:"Keyword X"
Google will then show you all pages on example.com that contain that keyword. The more specific you make that intitle
part of the query, the easier it is to weed out duplicate content. You can use the same method to identify duplicate content across the web. Let’s say the full title of your article was ‘Keyword X – why it is awesome’, you’d search for:
intitle:"Keyword X - why it is awesome"
And Google would give you all sites that match that title. Sometimes it’s worth even searching for one or two complete sentences from your article, as some scrapers might change the title. In some cases, when you do a search like that, Google might show a notice.
Practical solutions for duplicate content
Once you’ve decided which URL is the canonical URL for your piece of content, you have to start a process of canonicalization (yeah we know, try saying that three times out loud fast). This means we have to tell search engines about the canonical version of a page and let them find it ASAP. There are four methods of solving the problem, in order of preference:
1. Not creating duplicate content
2. Redirecting duplicate content to the canonical URL
3. Adding a canonical link element to the duplicate page
4. Adding an HTML link from the duplicate page to the canonical page
Avoiding duplicate content
Some of the above causes for duplicate content have very simple fixes to them:
Are there Session ID’s in your URLs?
These can often just be disabled in your system’s settings.
Have you got duplicate printer friendly pages?
These are completely unnecessary: you should just use a print style sheet.
Are you using comment pagination in WordPress?
You should just disable this feature (under settings » discussion) on 99% of sites.
Are your parameters in a different order?
Tell your programmer to build a script to always put parameters in the same order (this is often referred to as a URL factory).
Are there tracking links issues?
In most cases, you can use hash tag based campaign tracking instead of parameter-based campaign tracking.
Have you got WWW vs. non-WWW issues?
Pick one and stick with it by redirecting the one to the other. You can also set a preference in Google Webmaster Tools, but you’ll have to claim both versions of the domain name.
If your problem isn’t that easily fixed, it might still be worth putting in the effort. The goal should be to prevent duplicate content from appearing altogether, because it’s by far the best solution to the problem.
301 Redirecting duplicate content
In some cases, it’s impossible to entirely prevent the system you’re using from creating wrong URLs for content, but sometimes it is possible to redirect them. If this isn’t logical to you, do keep it in mind while talking to your developers. If you do get rid of some of the duplicate content issues, make sure that you redirect all the old duplicate content URLs to the proper canonical URLs.
Using links
Sometimes you don’t want to or can’t get rid of a duplicate version of an article, even when you know that it’s the wrong URL. To solve this particular issue, the search engines have introduced the canonical link element. It’s placed in the <head> section of your site, and it looks like this:
<link rel="canonical" href="http://example.com/wordpress/seo-plugin/" />
In the href
section of the canonical link, you place the correct canonical URL for your article. When a search engine that supports canonical finds this link element, it performs a soft 301 redirect, transferring most of the link value gathered by that page to your canonical page.
This process is a bit slower than the 301 redirect though, so if you can just do a 301 redirect that would be preferable, as mentioned by Google’s John Mueller.
Linking back to the original content
If you can’t do any of the above, possibly because you don’t control the <head> section of the site your content appears on, adding a link back to the original article on top of or below the article is always a good idea. You might want to do this in your RSS feed by adding a link back to the article in it. Some scrapers will filter that link out, but others might leave it in. If Google encounters several links pointing to your original article, it will figure out soon enough that that’s the actual canonical version.