scraped blog

Publish a blog? Here’s why websites that scrape content are a pain. Why you shouldn’t do it. Why you shouldn’t ignore it and some admittedly juvenile measures to take if it happens to you.

I originally wanted to title this post “Why Websites that Scrape Content are a Pain in the Ass“. Which is probably more accurate, albeit a little salty for a blog title. Because they are. A. Pain. In. The. Ass. As a professional who earns his living selling what amounts to intellectual property, I’m of the opinion that copyright is sacrosanct. For me, you and any other ‘creative’ who creates, well, anything. Logo design, illustration, photography, articles, whatever. Flipped around, I’m not terribly keen on people pinching our, or our client’s stuff, whether it be logos (especially, as if often the case, for submission logo design contest) or, as in the instance we’ll take a look at, a blog post article.
Original Blog Post in Draft

Blogging is all about the love

If you publish a blog, I’m sure you put a helluva lot of love and effort into keeping it fresh. You’re passionate about graphic design, photography, fishing, model trains, politics or whatever your personal obsession happens to be. As the studio blog of The Logo Factory, our humble ramblings are usually about logo design and matters related to the graphic design industry (even if it’s six degrees of separation). Have my special interests too. I’m a pretty vocal critic of logo design contests and what have you. I try not to be terribly frothing in my rants, attempting (you can judge how successfully) to write my anti-spec articles in an Scrapped blog articleeven-keeled manner. Quoting Ross Kimbarovsky (co-founder of Crowdspring, someone that I rarely agree with when it comes to the design profession, debating with quite often) I try to be a “critic” as opposed to a “hater“.

Think pieces take time.

Full Metal Jihads are easy to write. Level headed pieces that remain passionate, not so much. They take time. Sometimes an awful lot of time. They have to be edited several times over, lest I say something that wanders too close to libel. I need to check my facts. Not that I’m complaining – upkeep of this blog is part of my job as creative director of The Logo Factory and something I’m passionate about. I’m not the best writer in the world and articles often descend into meaningless blather, too long to read for everyone but the most interested viewer. On the other hand, there are posts that, while certainly not brilliant or Pulitzer caliber, I’m proud of. The Platitudes of Spec Work for example, probably one of my better articles during which I took great of time, and put in a lot of effort, to discuss a highly controversial topic of the day – advertising agency CP+B doling out one of their client’s logos via a design contest. Won’t rehash here, you can read at the link if interested.

One article. Three blogs. None ours.

Understanding all of this, imagine how thrilled I was to discover that this article had been copied and pasted onto somebody else’s blogs, not once, not twice, but three times. And that if you searched for this article in Google, the website on which the copied version sat outranked my original. And if that wasn’t enough, each of the copied articles hotlinked to two images on our blog, effectively stealing bandwidth and resources from our server. Some dopey amateur? Hardly. This was an A-list art director, head of his city’s Creative Director’s Guild and someone who claimed to have been a design professional for more than a decade. In essence, a competitor (as much as one graphic designer is a competitor against another) was using my article, my images and my server resources, in order to promote his company through search engines. Worse, if you looked for the article via Google, you wouldn’t find our site. You’d find someone elses. The whole idea of our blog is to promote our design studio. Here was another design studio, using my work, to promote themselves. Not cool at all.

Avoiding the name & shame.

Gonna cut the guy some slack and not mention his name. Hell, I’ll even blur out the identifying info in the screengrabs (he removed the copied work once I contacted him so I want to be fair.) This isn’t an isolated case, but an issue that we’ve been dealing with for years now. As have a lot of bloggers that I know. I think it’s worthwhile to look at the issue of content scraping (taking copy from another website in order to rank in search engines), both from a content producer, and a content copier’s point of view. Why, when you find that someone’s duplicated your content, you should go after them. And why, if you’re thinking about copying someone’s stuff, you should probably think again.

Copyright applies to blog posts too.

Many people, some well-intentioned, others not so much, seem to believe that anything that’s published on the internet is somehow public domain, devoid of copyright protection. Alas, that’s demonstrably not true. If a copyright notice was needed (which it’s not) most blogs (including this one) feature a copyright notice somewhere on the page (look w-a-y down at the bottom. It’s there).
Copyright Notice FooterIf you didn’t know, copyright means, quite literally, the right to copy. And in most cases, the only person who has the right to copy is the person that wrote the article, designed the blog, took the photograph drew the picture, designed the artwork. Just because a blog article is posted on the internet does not mean it comes with a “please put this article on your blog” or that copyright, which most people have a basic grasp of, doesn’t apply. This isn’t snootiness. Or being uncooperative. Or being some type of elitest who won’t get with the program. Most blog authors, myself included, are perfectly cool with someone taking a few paragraphs (with credit and link, natch). That’s the natural give-and-take of the internet. I’m perfectly fine, quite flattered actually, when people ask my permission to reproduce any of my verbal meanderings. On several occasions I’ve even rewritten the requested article, updating copy with additional or new information. On the other hand, taking an article whole cloth without permission is simply not cool. From the “letter of the law” department, taking an article and putting it on your own blog or website isn’t “stealing” per se. It’s copyright infringement. Whether it’s stealing under the “spirit of the law” is another matter entirely.

The copyright infringing, passive agressive, dating ritual.

Confronting someone who’s plagiarized your material is almost a dating ritual. This time was no different. Predictably, he went down the tired-and-true list of excuses that people who get caught pinching other people’s IP go down. “I don’t know how it got there” becomes “someone else [insert employee, girlfriend, boyfriend, some freelance guy that we fired last week] did it“. Which leads to me going down the tired-and-true reasons why they were categorically full of crap. In this particular instance, our blog has this javascript thingy (not really sure how it works) that automatically puts a “read more” text with a link at the bottom of a cut-and-paste.
Read More LinkNothing cloak-and-dagger, but an attempt to get a back-link when someone posts a quote somewhere else. Sure enough, at the bottom of one of these blogs, there it was. This guy (we’ll call him Sam) hadn’t noticed that the link was there for over five months. Not that I needed ‘proof’ – I had the original – but that bit of text was a slam dunk. Further, when the article was first published, Sam had recommended it through his Twitter account, linking to our site, and the original article. That was easy to find too. That Tweet date was after the publication date of the original article, but before the publication on any of his blogs. Gig is up sparky.
Original Article Announced on Twitter

It’s all about the permission.

Now that we’ve established the “who owns what and who copied it” part of the relationship, we move on to the “justification” part of this happy little dance. Usually along the lines of “but I credited/will credit you” or “I gave/will give you a link“. That’s all nice and stuff, but links and/or credit aren’t a panacea against copyright. In fact, they’re completely and utterly irrelevant. Copyright is about permission and the right to copy. Links and/or credit are about being cool and hat-tipping the original author, designer or writer after you’ve already obtained permission and that right to copy. Or, you’ve found an interesting article, quoted a few bits and pieces and then linked to the original. Or, as in this case, “I’m a dick that copied your stuff and I’m trying to rationalize my bad behaviour“. A few problems with this. First, the article you wrote to promote your company or website is on someone else’s site and promoting their company or website. A credit, unless there’s some context (ie: “so-and-so is an expert at X, and we recommend you hire him here“) means nothing. Any link that is featured (if it’s on a blog platform will probably have a “no follow” tag, rendering any SEO benefit worthless) will either be too far down the page to be any benefit, or for the readers who do find it, it probably won’t be active. No clicky. No linky.

How copied articles can ruin your search engine rankings.

Remember this. No matter how hard people bleat that they copied your material to do you a favor, the simple truth is they nicked your stuff to promote themselves, more than likely through search engines. They’re trying to steal your traffic. And if you wrote that killer article to get eyeballs to your site, they’re trying to steal those eyeballs. They’ve gone to all this effort to get people to their site. They’re hardly going to invite their visitors to go to yours. The trouble is that these people are often using your content on very-high ranking domains like Blogger, WordPress or other “social-media’ sites, and these sites will (as in this case) beat out your site in the search engine game. It’s highly likely that the copied material will come up higher in search results, and you’ll lose a hefty portion of the readers you spent hours creating that article to attract. In the case we’re using as an example, one of the copied versions beat out my original in a search, even when I typed in the somewhat convoluted title. Verbatum.
Blog scraping search resultsNone of the sites told us who wrote the article (me), or where it came from originally (here). Of course, no-one is going to be using the keywords “Platitudes of Spec” to search for anything, but that’s the point. Even when I was looking specifically for that article, I still wouldn’t have found the original. And it gets worse. While there’s some debate about whether a ‘duplicate content’ Google filter exists, or how draconian it is, it’s quite possible that not only will the copied material beat you in ranking, but Google may remove your page from the search results entirely, the result of a duplicate content penalty. That’s enough of a reason not to ignore people when they pinch your stuff.

Getting material removed. The not-so-nice way.

copyright copOnce we’ve gotten through this “don’t know how this happened” rite of passage, we move on to the “take it down” part of our courtship. Some suggest that a politely worded e-mail will do the trick. Oddly, people who’ve pinched stuff put up a surprising amount of resistance to this demand, regardless of how politely worded it is. Which is weird in one respect – it’s not their property – while perfectly understandable from another. They pinched your copy in order to get content for search engines. They found your copy through a search engine, via a keyword search that appeals to them. By now, they’ve probably achieved some of the rankings they’re after and aren’t too keen to give it up. Tough titty. They’ll take your material off because if they don’t, you’re going to complain loud. And often. And very publicly. Via the internet, the very place they’re using your content to market themselves. And if your site is ranking well enough to appeal to copy thieves, it’s ranking well enough to draw attention to their nefarious activities. Top illustrator Von Glitschka is a classic example of how this works. Von’s one of the most copied designers I know, and he keeps his Rogue’s Gallery and Twitter feed up to date with people who have absconded with his work. We kinda do the same thing with our Copycats section, but that isn’t quite as current (and awaiting reworking into the new format). I’ve found that “name and shame” works infinitely better than ANY DMCA takedown notice. As long as you use lots of “allegedly” and “looks like” and “could be argued“, you should stay clear of the lawyers’ gunsights. Truth is a remarkably effective antidote to bullshit.

Why all the copying?

Why all the content copying and plagiarism? That’s the easy part of the equation. Blogs are voracious consumers of content. Due to their time-sensitive nature, one of the SEO benefits of writing posts is that they’re indexed at a rapid clip by most search engines. A well written post can skyrocket to the top of the search engine results within minutes.
Google Blog Announcement
If the post has a ‘hook’, it can get picked up on Twitter and bounced around the internet for a few hours. Or days. That’s the upside. The downside is that this lift is often temporary, and the blog post will “settle down” through the organic search results as newer, similarly themed content, comes online. Twitter is extremely fleeting, so your killer blog post will be forgotten, just as quickly as it was retweeted by everyone and their brother.

It’s all about the money.

After that initial hoorah, It sometimes takes a few weeks for a blog post to get indexed permanently in search engines, so in the meantime we’ll have to launch another post. Which will get a boost and then settle. And so on. Trouble is, as we discussed earlier, writing actual articles, as well as assembling accompanying illustrations takes time. Which equates into money. If you’re writing your own blog posts, that means less time spent with other money-making activities like servicing clients. If you’re paying someone to write your blogs, that equates into money because, well, you’re paying someone to write blogs. More blogs equals more money. Most commercial sites are hesitant to use Google Adwords to finance their blogs – why go to the trouble to get people to a site, only to invite them to click through to another, often for a mere pennies in ad revenue, when the real purpose is to get them to buy something for a whole lot more. One cost-cutting solution? Copy someone else’s material. If it’s from the same general area (graphic design for example) the SEO benefits will work quite nicely. It’s a great plan if it didn’t have one major flaw.

Finding copied stuff on the internet is easy. For now.

See, here’s the point. Finding out who’s honking on your material is a relatively easy thing to do. All it takes is typing a couple of keywords into a Google search bar. There’s even a service, Copyscape, which will troll the internet looking for instances of your prose (doesn’t have to be full pages, Copyscape will find sentences and paragraphs with uncanny accuracy). Even better, Copyscape will allow you to run a few searches for free. Have a go. If you publish a blog, you’ll probably be surprised by what you find.

Content scrapers are getting smarter.

Content scrapers are getting wise to anti-plagiarism tactics. There’s tons of job postings on sites like Scriptlance and Getafreelancer requesting copy editors who can “get around Copyscape“. These so-called editors will take relevant blog posts and articles (lifted from other blogs) and rewrite them, word by word, stuffing one keyword after another into the new version. As many of these ‘copy-editors’ understand English as a second language, this rewording often ends up with some strangely mangled results. This isn’t limited to one-off copy editing jobs either. There’s a fairly high-profile logo design company who’ve started their own cottage industry of pinching people’s copy, rewording it, often into almost indecipherable jibberish in order to generate content for their network of ‘shell blogs’. These blogs are set-up to send link love to the Mothership website so whether the reworded content makes sense isn’t really the point. Which is good. Cause it doesn’t.
Another Blog ScraperI found some of our material, pinched from our illustrative logos article, that was a thinly reworded version of the original. Trouble is, whoever was tweaking the original had given up half way through, and a hefty bit of the piece was published on their WordPress splog intact, which enabled me to find it and ask for its removal. And it’s not just our humble shop. I think designers would be amazed as how much copy, and how much of their artwork (particularly logos, lifted from gallery sites like Logopond and Brandstack) are re-purposed as content for blogs who’s sole purpose is to inflate the SEO rankings of an online logo design company, and more recently, their design contest spec site.
Illustrative logos pros and cons
But we’ll leave that for another time.

Hotlinking images on your blog? Not a good idea.

Hotlinking to an image (embedding an HTML link that resolves to the image on someone else’s server, as opposed to placing the image on a server you maintain and linking to that) is not a good idea. Two reasons. The first is that it’s not cool in a “can’t we all get along” kind of way. If you hotlink to an image, you’re using up the originating servers’s bandwidth. On a commercial server with pretty high bandwidth limitations, that’s not a big issue. On smaller co-hosted sites who might have bandwidth caps, you’re essentially stealing their bandwidth, and if a lot of people bring up your page, you might just get the original page disabled once their monthly bandwidth allocation has been used up. In a karmic sense, that falls on the ‘not cool’ side of the fence.

Hotlinking images gives someone access to your webpage.

Okay, let’s say you don’t give a shit about using someone else’s resources, karma or any namby-pamby “can’t we all play nice” vibe. Using the same logic as spammers, you’ll do what you can get away with, because you can get away with it. Nyah, nyah. Fair enough. But you shouldn’t hotlink images either. Hotlinking images gives the person who originally uploaded the images to their server, a direct pipeline into your webpage. All they have to do is change the original image on their server, and viola, it changes on yours. And that opens up all sorts of creative avenues. By spending one minute in Fireworks, changing two file names and then updating my orginal ‘Platitudes‘ article, the cat who had pinched my stuff now announced to the world that he wasn’t the sharpest tool in the virtual shed. On all three of his blogs.
Swiped bandwidth resultJuvenile? Perhaps. But that’s the point. By poaching someone’s images, you’ve left yourself open to all sorts of pranks, even if it’s by a cat (me) who should really should start acting his age (guilty as charged). Should the websites that linked to the images placed them on their server and published locally? Not really. See, the image in question is a stock image. I paid for a license to use it on this blog and publishing it on another blog is violating the stock agency’s terms of service.

Taking it legal. How to file a DMCA takedown.

Let’s say you’ve run into a really persistent copycat. Won’t take your article down, won’t answer your e-mails, and doesn’t seem to mind that you’ve substituted a rather provocative image on his, or her, webpage. Take it up a notch, with a DMCA takedown, sent straight to Google. Copyscape offers a simple step-by-step approach, starting with, tah-dah, the polite e-mail.

[Footnote: This article was originally posted on our (now) Legacy Blog and moved to its current location for consistency and database functionality. While it was accurate at the time of publication, it is currently posted as part of our historical record and details may have changed.]