Why Web Scraping/Spinning is Back

This content has been archived. It may no longer be accurate or relevant.

From Plagiarism Today:

February 23, 2011, was a banner day for plagiarism and copyright infringement of blog/news content. It was the day that Google launched a major Panda/Farmer update that sought to reduce the presence of “low quality” content in search results.

Though the change was aimed at so-called “content farms”, sites that would pay human authors small amounts to churn out countless articles of questionable quality, it ultimately hit a variety of other unwanted content types including article spinning, article marketing and web scraping.

Prior to this update, many spammers found a great deal of success by simply taking content they found on other sites and simply uploading it elsewhere. This was done with or without attribution, with or without modification and almost always without permission.

However, after the update, there was a scramble to get away from all forms of questionable content marketing. Other equally questionable tactics rose up from the ashes, but the plague of web scraping was seemingly done as a major concern for sites.

Unfortunately, nine years later (almost to the day), that is seemingly much less true. Now it’s easy to find scraped, plagiarized and otherwise copied articles in search results. To make matters worse, they often rank higher than the original.

So what happened? There doesn’t appear to be a clear answer. What is obvious is that Google (and other search engines) have a serious problem in front of them and the time to address it is now.

. . . .

In August, Jesselyn Cook at HuffPost wrote an article about “Bizarre Ripoff ‘News’ Sites” that were ripping off her work. There she provided several examples of her articles appearing on spammy sites with strange alterations to the text.

The alterations often made no sense. For example, “Bill Nye the Science Guy Goes OFF About Climate Change” became “Invoice Nye the Science Man goes OFF About Local Weather Change.”

To those familiar with article spinning, this is a very familiar tale. These sites are clearly using an automated tool to replace words with synonyms. The goal is to create content that appears, to Google at least, to be unique. Whether it’s human-readable is none of the site’s concern as long as they get those Google clicks (and some ad revenue). It’s a tactic that’s been around since at least 2004 and had a heyday during the late 2000s.

. . . .

The big question is “What changed?” Why is it that, after nearly a decade, these antiquated approaches to web spamming are back?

The real answer is that web scraping never really went away. The nature of spamming is that, even after a technique is defeated, people will continue to try it. The reason is fairly simple: Spam is a numbers game and, if you stop a technique 99.9% of the time, a spammer just has to try 1,000 times to have one success (on average).

But that doesn’t explain why many people are noticing more of these sites in their search results, especially when looking for certain kinds of news.

Part of the answer may come from a September announcement by Richard Gingras, Google’s VP for News. There, he talked about efforts they were making to elevate “original reporting” in search results. According to the announcement, Google strongly favored the latest or most comprehensive reporting on a topic. They were going to try and change that algorithm to show more preference to original reporting, keeping those stories toward the top for longer.

Whether that change has materialized is up for debate. I, personally, regularly see duplicative articles rank well both in Google and Google News even today. That said, some of the sites I was monitoring last month when I started researching this topic have disappeared from Google News.

But, whether there’s been a significant change or not, it illustrates the problem. By increasingly favoring “new” content, Google opened a door for these spammers. After all, any scraped, copied or spun version of an original article will appear to be “new” when compared to the original.

Link to the rest at Plagiarism Today