Blocking comment spam in BlogCFC (or it could be adapted to others)

Posted At : August 3, 2009 12:45 PM | Posted By : Charlie Arehart
Related Categories: blogging
Comments: (11)

Note: This blog post is from 2009. Some content may be outdated--though not necessarily. Same with links and subsequent comments from myself or others. Corrections are welcome, in the comments. And I may revise the content as necessary.

Want another tool to help battle blog comment spam? Here's an approach I use that may benefit others. I look for certain bad URLs being referenced in the comment, and if they exist I block the comment. Sure, there are other solutions. I've wondered for a while about sharing this code publicly like this, but I get enough people who've asked for it that I figure I may as well.

Update: Ray has clarified (in a comment) that BlogCFC does already have this functionality, in the "trackback spamlist" feature (on the Settings page of the BlogCFC Admin). I thought that had only to do with track backs, not comments. If you're using BlogCFC, you should use that feature to achieve what I describe here. But some of the thoughts and techniques may still interest some.

What's the problem, for bloggers and commenters, and why Captcha isn't enough

We all know that comment spam is the bane of our existence. How many times have we seen comments referring to wowgold or battery crap or some foreign characters we can't even read. Sure, captchas and other tools are intended to try to stop it. But some still gets by those. These are often real people typing this in, so they get by tools that try to block automated entries. (I appreciate that some tools do still more. Check out the link above to learn more of them.) For those still interested, press on.

These spammers are clever: they'll repeat words from earlier in the blog entry, or from some other commenter, or even from some entirely different blog entry, hoping the blog owner won't notice that a shifty URL has been planted in the text (or the URL field of the comment form), all trying to get a little Google pagerank love for the URL they're pimping.

So I wanted to come up with my own solution that simply detected and blocked any comments with references to those bad urls. What I did works for BlogCFC (admittedly an old edition), but the concept can be of value to you regardless of the blogging software you may use.

And to be clear, this bane of blog comment spam is not just an annoyance for bloggers themselves, but also any who are blog commenters. Most blog software is setup to send us commenters a copy of any other comment someone posts. Even if a blogger is diligent about catching and deleting such comments (so they get no pagerank love from being posted), some of the damage is done in that the fellow commenters on that entry did get the email.

My solution

Again, I wanted a solution that let me detect and prevent submissions of spammy URL references. There's no blacklist for keywords in the version of BlogCFC I have.

Even then, I realize some don't like doing blacklists of keywords anyway, since you can get false positives. Then there's the challenge that if you look for some words, the spammers just change them. But for the problem above, their goal is to get their URL listed.

So I was interested in looking only for URLs, not just any "words". Further, I want to check in both the content field and the URL field of the comment. (And if it meant I blocked someone who was merely mentioning one of these spammy URLs, in a helpful way, I'm willing to risk that false positive.)

My approach

So the way I do it is that I created a file to track the bad urls. When I get a comment that's got content that's spam, I put any domains it refers to into that file (and then delete the comment, of course).

Then before accepting any new comment, my code reads that file (yes, on each comment submission. I could optimize things, of course, reading the file for a cached period. I could also offer an interface to more easily add URLs to the badurl list file. I just haven't gotten to that. For now, I just edit it, maybe a few times a month after having gotten most of the common crap URLs under control.)

About the blacklisted urls file

Rather than post the badurls.txt file here, you can leave me a comment (which will ask for your email address which is not shared and your URL. Tell me the URL of your CF blog), and I'll send it to you directly. Don't want to give away intel to the spammers, plus by me sending it along you'll get the latest.

Another thing I could do is create a service where the badurls file is kept and accessed/updated centrally. Again, just haven't gone to that yet. Nor even creating a Riaforge project for this. I'll wait to see what people think.

The badurls file is really just one big long list (comma-separated) of bad domains. Here's just a sampel of the first few entries (it's all just on one line):

dedikodulu.net,acrobatajans.com,dosyapaylasim.net

Note that I don't bother using the full url, and I even leave off the www. part, since some spammers use sobdomains. Of course, I wouldn't add to the list a domain that looked like it could be legit. But if it looks suspicious, it's black listed.

What do the spammers see?

I don't tell the spammers that I'm rejecting them because of the spammy URL. I just report "Invalid request" as an error. I also happen to email myself when people attempt to send comments (in case they have problems with the captcha or for some other reason their comment doesn't make it), so I have fun watching how the spammers flail about trying again and again to get their crap in. :-)

I figure if it was a false positive and someone REALLY sincerely felt that their comment should be let in, despite their referring to one of these urls and getting rejected, they could just contact me directly (as I offer a contact link on my blog, or they may think to enter a plain comment. Again, these are rare instances, I think.) The benefit for cutting down on spam comments has far outweighed the risk.

Update: With regard to the BlogCFC "trackback spamlist" feature, I'll note that it doesn't offer any feedback at all if a comment has a blacklisted keyword/url. It just closes the form as if it took, but the comment is not posted.

What do I do with the badurls file? Show me some code.

I drop the badurls.txt file into the blog root directory (typically blog/client in blogcfc), in the same directory with the addcomment.cfm template. In that file, I make just the following 3 edits to that addcomment.cfm template.

First, I add the following that reads the file in:

And in the addcomment.cfm test I place some more code for adding a comment which should go inside this line:

<cfif isDefined("form.addcomment") and entry.allowcomments>)

and after the first IF test for:

I added this:

Sure, I could have done that in CFSCRIPT. Same with the next chunk coming up. Feel free to change it if that suits you. :-)

Needed (and created) a new UDF, FindList

You'll notice this calls a udf, findlist, which does something that surprisingly no built-in function does: searching one string for any of several items in a list. (For an explanation of how it differs from listfind and listcontains, see the version posted at CFLib. That udf is a little more complicated, as I expanded it based on some feedback from others.)

Hope all that may help someone. Feel free to comment.

For more content like this from Charlie Arehart:

Signup to get his blog posts by email:

Follow his blog RSS feed

View the rest of his blog posts

View his blog posts on the Adobe CF portal

Need more help with problems?

If you may prefer direct help, rather than digging around here/elsewhere or via comments, he can help via his online consulting services

See that page for more on how he can help a) over the web, safely and securely, b) usually very quickly, c) teaching you along the way, and d) with satisfaction guaranteed

Comments (11)

Related Blog Entries

Simplifying the captcha graphic in Lyla Captcha (and BlogCFC) (August 17, 2006)
Captchas: making them simpler, and dialing down the angst against them (August 17, 2006)

Comments

Years ago, I read an article (by you?) mentioning how it often made sense to use cfinclude instead of cffile to get the contents of a file that didn't contain any CF code as cfinclude would be much faster to execute.

Would that not apply here? I would think that would make getting the bad URL list quite fast.

# Posted By Steve Bryant | 8/3/09 1:55 PM

Hi Charlie,

I'd be curious to know how often you see "repeat offenders"...that is, comments that are blocked because of your blacklist. I like this idea, but I'd guess that the majority of spam comments would get past this because they change domains and create new ones so often. Even so, if it helps block some spam, it's a valuable addition to the arsenal.

# Posted By Jake Munson | 8/3/09 3:22 PM

Curious - Charlie, why didn't you just add the URLs to the main BlogCFC spam list? I've been doing that for years now. In fact, a lot of my spam 'word' list is actually just the bad URLs. That requires no modding at all. :)

# Posted By Raymond Camden | 8/3/09 3:37 PM

Let me address the 3 in 1:

@Steve, I did do an article in the past on some "secret powers of CFINCLUDEs", but I don't recall asserting that it was faster than CFFILE, no. There were some interesting other surprises that I wrote about. I'm not aware of any assertion that CFINCLUDing a file is faster than reading it with CFFILE. Plus, doing the CFINCLUDE wouldn't turn it into a variable that could be tested. You'd have to wrap it in a CFSAVECONTENT. Just doesn't seem worth it to me. :-)

@Jake, you ask how often I see offenders. I had said something to that effect (though I just updated it to fix a missing word and add the parenthetical statement):

"I also happen to email myself when people attempt to send comments (in case they have problems with the captcha or for some other reason their comment doesn't make it), so I have fun watching how the spammers flail about trying again and again to get their crap in."

You then say, "I'd guess that the majority of spam comments would get past this because they change domains and create new ones so often". Well, I suppose that could be true, but all I care about is that once they've been identified (by me) as a spam URL, I block them, and they never get in again. I don't worry about evaluating if a new one is a replacement of an old one. I never (ever) visit their sites so don't care to do that comparison. I'm not saying that you think I should, but do you see why I just don't worry about it?

But I think you're trying to say "this seems a sysyphian task, constantly pushing the ball up the hill of adding URLs, only to have it roll down the other side under the weight of new ones". I just don't see it that way. It works: it blocks the spammers once I identify them.

It's kind of like what I wrote in my earlier blog entries on captcha a couple years ago (I'll add them as related entries above the comments): I don't regard it as a door lock (that can be broken with modest effort) but more like a screen door, just keeping out the random pests. So I don't worry if this solution works perfectly. :-) Like you say, "if it helps block some spam, it's a valuable addition to the arsenal." Amen.

@Ray, I'm running 5.005. The closest thing I see in the settings is a "trackback spamlist". I assumed that had something to do with trackbacks, so never assumed it solved this problem. I didn't spend too much time digging into the code to see what that may do. As you can see, mine were just a few lines of code.

More curious, though, if indeed that's what it's for, I'll note that I have proposed this solution to other blogCFC users (directly and on a list, though not your forums) and no one had ever suggested that as a solution. I sure wish they had. :-)

Anyway, the good news is that the code may help someone else (and it caused me to create that findlist UDF, which I was so surprised to see no one ever had, and that there was no equivalent in CFML.)

# Posted By Charlie Arehart | 8/3/09 4:08 PM

Yes indeed - the TB spam list _also_ applies to comments. I should see if the latest admin makes that more clear - it might not.

# Posted By Raymond Camden | 8/3/09 4:22 PM

Are you running cfformprotect? I was not because I was on an older version of BlogCFC and I used to get ton of comment spam. After updating I can tell you that cfformprotect has virtually eliminated comment & send spam. This is my first suggestion to anyone using forms, it really does help!

# Posted By Dan Vega | 8/3/09 4:52 PM

Hi Dan, no, I do not. I have of course known of it. I didn't want to try to figure out how to incorporate it into the very old version of BlogCFC I had, and again, this was just a few lines of code.

I'll note that I did mention in the opening of my entry here that of course there are other solutions, and I linked to my CF411 category for them, where certainly CFFormProtect is offered.

I also said, "I appreciate that some tools do still more." :-) For those interested in more on CFFormProtect, which does a lot to stop both automated and even human spammers, see http://cfformprotect...

If anyone else is running 5.005 of BlogCFC and has implemented it, I wouldn't mind a heads up of where and how you did. I suppose it could be in the same place I've done what I did above. Otherwise, I may well find some time to try to dig in and add it myself.

Better still, some day I'll just update my old BlogCFC version and get its many improved benefits (CFFormProtect was added to it late last year, per http://techfeed.net/...).

# Posted By Charlie Arehart | 8/3/09 7:23 PM

I did see your comment in your spam post about the spammers flailing against your protections. I guess I was wondering more what percentage of your spam comments are blocked by this protection. I am not really saying that you (or others) should not use this method...as you probably know, CFFormProtect is made up of numerous prevention methods that individually are not super strong (well, except for Akismet), but added together they are very effective. So I was just curious how effective this method is.

Also, if you don't mind, I might reference this method in my CFUnited presentation (giving you credit, of course).

# Posted By Jake Munson | 8/3/09 9:20 PM

I rick roll comment spam. It has been very effective.

http://www.myinterne...

# Posted By Gerald Guido | 8/4/09 10:17 AM

Charlie,

You sure didn't make that claim. Not sure why I remembered it that way.

Having brought up the topic, though, I decided to find out which one is faster:
http://www.bryantweb...

# Posted By Steve Bryant | 8/6/09 10:51 AM

@Jake, I don't really have numbers to confirm. All I know is that once I slam the door in the face of one of these pests, they no longer come in. But I don't doubt the value of CFFormProtect (Jake's tool). I recommend people look into it. It tries to block both automated and human spammers. More at http://cfformprotect...

@Gerald, sounds interesting. Of course, I think most won't bother to take you up on the testing, since your other blog entries on the topic suggest you will take some retribution. :-) But if you do ever document what you do, feel free to post that here and I'll add it to the CF411 list of CF-based spam-catching tools.

@Steve, thanks for sharing. I see that you say there that the CFFILE was in fact faster. That's what I would have expected, myself.

We've all been there, coming to a conclusion (or taking what someone says, or we think they said) and holding to it for some time before really confirming it. Your tests are one way to go.

Then again, as you say in your opening paragraph, even such observations aren't always conclusive. Many factors can influence whether the observation would always be true. More than that, there's the very impact of micro-benchmarking (testing something in a loop) which may not really equate to what would happen in the real world. This has been debated often (in the CF world and wider IT world). I don't really want to open that debate here, and in fact I chose not to offer this on your blog entry for the same reason.

Not picking on you at all, Steve. :-) Just wanted to put out there that it's something to be cautious about. You may have had it in mind in your thoughts in that first paragraph. (For others interested, a google of microbenchmark will find many articles identifying the problems, though not focused on CF specifically.)

# Posted By Charlie Arehart | 8/6/09 12:33 PM

Sun	Mon	Tue	Wed	Thu	Fri	Sat
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31