I am a search engine optimization consultant and the SEO…
Updated: December 26, 2012
Google, Bing and Yahoo cracked down hard on duplicate content starting December 2010. Penalties hit hardest on February 24, 2011 in the Google Panda algorithm update. Bing and Yahoo rankings followed suite.
How To: The SEO Tools and Process to Address Duplicate Content
An SEO services client with which we work has developed multiple websites for different brands, but the client recycled the content. Instead of writing 100% unique text for each website, paragraphs and sometimes whole pages were used universally across multiple websites. They were getting away without noticeable revenue loss, so despite existing duplicate content penalties (though not actual penalties – more accurately wasting crawl budget and possibly dividing link juice) on interior entry pages, the client decided it was not a big enough priority to rewrite all the content … until now.
February search engine algorithm updates penalized entire websites that have pages similar to any other site that the search engine credits as the originator. Even if words are rearranged and the brand name is switched out, the Google algorithm is not fooled. Google chooses one website as the originator and penalizes the others.
In late 2010, various rankings started to slip. On February 24th, clients with duplicate or similar content across different websites saw a total drop off for #1 ranked keyword phrases. Google guidelines for duplicate content indicate that the algorithm perceives these similar pages as “deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic“.
By the way, these guidelines were [initially] updated March 20, 2011, less than a month after the [first] Panda algorithm update.
Systematic Process of Identifying and Addressing Duplicate Pages
If you are intimately familiar with your websites, like this search engine optimization consultant is, you already know which pages are similar and possibly causing duplicate content penalties. If you are an SEO agency taking on a new client with duplicate content issues, leaving it up to you to figure out where the duplicates are within their online properties, then you may need a few SEO tools to help identify possible duplicate pages.
UPDATES: Google Panda Filter Dates
- Panda Update 1.0: Feb. 24, 2011
- Panda Update 2.0: April 11, 2011 (about 7 weeks later)
- Panda Update 2.1: May 10, 2011 (about 4 weeks later)
- Panda Update 2.2: June 16, 2011 (about 5 weeks later)
- Panda Update 2.3: July 23, 2011 (about 5 weeks later)
- Panda Update 2.4: August 12, 2011 (about 3 weeks later)
- Panda Update 2.5: September 28, 2011 (about 7 weeks later)
- Panda Update 2.51: October 9, 2011 (minor filter update about 2 weeks later)
- Panda Update 2.52: October 13, 2011 (minor filter update)
- Panda Update 3.0: October 19, 2011 (3 weeks later)
- Panda Update 3.1: November 18, 2011 (3 weeks later)
- Panda Update 3.2: January 18, 2012 (8 weeks later)
- Panda Update 3.3: February 28, 2012 (refresh to update index 6 weeks later)
- Panda Update 3.4: March 23, 2012 (refresh to update index 3.5 weeks later)
- Panda Update 3.5: April 19, 2012 (refresh to update index 4 weeks later)
- Panda Update 3.6: April 27, 2012 (refresh to update index 8 days later)
- This has become crazy to update. Just check the SEOMoz Google Algorithm Change History.
To determine if your website fell victim to a Panda filter, check your traffic. Panda “penalizes” your website by dropping your rankings, not just on pages with duplicate or thin content, but universally across your website – including your homepage. If you were hit by the Panda Filter, you will see a significant traffic drop on one of the dates above. Subsequently, if you address the issue, your traffic should be adjusted the next time the Panda filter runs. Each time the filter runs, it updates the Google index.
Update: February and March 2012 updates were merely when the Panda Filter was run again in order to refresh the Google index, so that changes made to websites will be reflected in Google. In other words, if you implemented fixes to get out of the Panda Filter before February 28 or March 23, you can check traffic at those dates to see if your fixes did the trick.
Identify Duplicate Content
This paid tool makes your process simple, but costs money. Copyscape is a tool for finding copyright and plagiarism offenders. Since Google generally penalizes the copier and not the original author, plagiarism is not an SEO issue. For this reason CopyScape is not a regular part of the SEO arsenal of tools. Don’t ask me why as an SEO I even know about it, but I’ve known about it for years (guess that’s part of the multidisciplinary Blend SEO approach).
The process to identifying duplicate pages is, using the paid version called CopySentry, you can feed it your non-penalized website and let it find the duplicate content out there amidst your penalized websites.
Using free tools takes a little more time and effort. Download and install A1 Sitemap Generator, a great sitemap generator program with a fully functional free 30 day trial.
- Run a scan of your penalized website. It will generate a list of pages of the website, ignoring those blocked by robots.txt, following any redirects or canonical tags – meaning you have a list of webpages that spiders crawl.
- Among the sitemap output choices, you can create a text file list of page URLs with each URL on a separate line.
- Paste this list into excel and start your search for duplicate content. Use your intuition and Google to check blocks of text. If a different website ranks number one for any block of text, that website is credited as the originator. Mark this URL and the originator URL.
Test or Check Duplicate Content
Once you have a list of pages from your penalized website and their counterpart on the originator website, you will want to check their similarity. Are they similar enough to require a rewrite? Run the URLs through a webpage comparison or duplicate content checker tool. After you go through several pages between two sites, you will eventually get a feel for where the cut-off is for rewrite verses no rewrite. Any pages with similarity higher than your cut-off require a writer to take a look for the duplicate or similar language. Similar content on the penalized website must be completely rewritten.
Update: Unfortunately, my favorite duplicate content diagnostic tool has been abandoned.
This first similarity tool is my favorite. Some only give a single percentage, letting you wonder how much of that similarity is due to non-visible code. This tool tells you, without inundating you with too much information. There’s no captcha, so checking through your list is quick. I embedded the form below, so you can try it here.
This next tool, embedded below simply gives you a single percentage. Depending on the template between your penalized and originator websites, the number you get will seem pretty low. My cut-off with this tool was about 10%. Anything over 10%, required a copywriter to rewrite the page or at least a section of the page.
This comparison SEO tool by SEO Book compares the page titles, meta information, and common phrases occurring on different pages.
Here are several more alternatives you can try.