If you blog or run a website with a comment function I am sure you’ll be familiar with the concept of comment spam. Most people presume these ever-so-personal (not) messages are generated by some kind of script and submitted by robotic means but believe it or not they are mostly posted by people. This is mostly because comment systems are getting better at detecting and tripping over comment spam scripts so that a commenter has to prove they are human before they can comment. I’m sure you’ve gotten frustrated by having to enter those weird combinations of odd words under the title of “CAPTCHA”.

Photograph of a can of SPAM
Image by MarcusQ on Flickr – CC:By-SA

So the people who want to comments placed (usually because they contain links to theirs or their clients sites) pay others to leave comments. Care to take a guess where in the world these people live? Yep in the same places we in the west have always exploited, the same places we have work in seat shops to make our trainers, the same places we outsource our call centres to, the same ones we have always exploited (yes I know I said that twice). In the countries where there are poor people who are driven to such tasks by their desperation to survive. Unlike the call centres the workers here don’t even have to speak the language they are commenting in. They will often use a script designed to create comments for them. Originally they were just provided text to copy and paste but because the comment systems began to detect the same text and block it, scripts were developed to create seemingly random but slightly coherent text to paste. It’s a fairly trivial task for a developer. I did one (very rushed and not particularly good) to create the platitudes for my platitude generator.

I had a comment posted today which made me smile and also shows up what I am talking about exactly. What the person posting the comment has done is pasted the wrong text. Instead of posting the output of the script they’ve somehow got the text the output is generated from. It gives and good insight into this kind of spam and I thought I’d share it.  I’ve truncated this by the way as the full comments was over 10 times as long.

So we start with

{I have|I’ve} been {surfing|browsing} online more than {three|3|2|4} hours today, yet I never found any inter­esting article like yours.

So in the first line you can see {I have | I’ve}. The { } indicates an option for the script and the choices within thatoption are separated by |. In this case it means the script will randomly choose “I have” or “I’ve” for the comment text. You can also see how comments will vary between “more than 2 hours” and “more than three hours” etc. Again this makes it harder for the spam-detecting filters. Where as you and I can read a sentence like “I have been surfing online more than three hours today” and know it is very similar to “I’ve been browsing online more than 4 hours today”, a filter system will find it harder. Let’s look at the rest. See if you can spot the options and the choices.

{It’s|It is} pretty worth enough for me. {In my opinion|Personally|In my view}, if
all {webmasters|site owners|website owners|web owners} and bloggers
made good content as you did, the {internet|net|web} will be {much
more|a lot more} useful than ever before.|
I {couldn’t|could not} {resist|refrain from} com­menting. {Very well|Perfectly|Well|Exceptionally well} written!|
{I will|I’ll} {right away|immediately} {take hold of|grab|clutch|grasp|seize|snatch} your {rss|rss feed}
as I {can not|can’t} {in finding|find|to find} your {email|e-mail} sub­scrip­tion {link|hyperlink} or {newsletter|e-newsletter}
service. Do {you have|you’ve} any? {Please|Kindly} {allow|permit|let} me {realize|recognize|understand|recognise|know} {so that|in
order that} I {may just|may|could} subscribe.
{Thanks|Blessings|Thank you}.

This kind of randomness might seem quite easy to detect to us but to a comment checking system it means the patterns are harder to stop and – more importantly – harder to distinguish from a real comment. To make it even harder for the filter the spam-script source text is harvested from real comments or other text written by humans. Hopefully this will give you and insight into why sometimes even the best spam-filter will let comments through which seem blatantly obvious to us.

  1. I’ve often had to resort to using a geo-ip based block – but unfortunately that often isn’t sufficient as they will (or can) proxy traffic through UK based servers. Then you start e.g. blocking content based on the http language accept header – which might work for a short while …. or for certain keywords – or just not accepting hyperlinks.

    I know of one local company who specialise in marketing/e-commerce/SEO that hire a large number of foreign bodies (in the Philippines) and pay them to create content which boosts the page rank of whatever their current customer is.

    Thankfully Akismet appears to do a good enough job on my own websites – combining that with manual comment approval stops it being a problem for me – but I’ve no idea how I’d cope if I was running a busy blog/site receiving >50 comments a day (I doubt I get more than 50 comments a year!).


