[Looking for Charlie's main web site?]

Blocking comment spam in BlogCFC (or it could be adapted to others)

Want another tool to help battle blog comment spam? Here's an approach I use that may benefit others. I look for certain bad URLs being referenced in the comment, and if they exist I block the comment. Sure, there are other solutions. I've wondered for a while about sharing this code publicly like this, but I get enough people who've asked for it that I figure I may as well.

Update: Ray has clarified (in a comment) that BlogCFC does already have this functionality, in the "trackback spamlist" feature (on the Settings page of the BlogCFC Admin). I thought that had only to do with track backs, not comments. If you're using BlogCFC, you should use that feature to achieve what I describe here. But some of the thoughts and techniques may still interest some.

What's the problem, for bloggers and commenters, and why Captcha isn't enough

We all know that comment spam is the bane of our existence. How many times have we seen comments referring to wowgold or battery crap or some foreign characters we can't even read. Sure, captchas and other tools are intended to try to stop it. But some still gets by those. These are often real people typing this in, so they get by tools that try to block automated entries. (I appreciate that some tools do still more. Check out the link above to learn more of them.) For those still interested, press on.

These spammers are clever: they'll repeat words from earlier in the blog entry, or from some other commenter, or even from some entirely different blog entry, hoping the blog owner won't notice that a shifty URL has been planted in the text (or the URL field of the comment form), all trying to get a little Google pagerank love for the URL they're pimping.

So I wanted to come up with my own solution that simply detected and blocked any comments with references to those bad urls. What I did works for BlogCFC (admittedly an old edition), but the concept can be of value to you regardless of the blogging software you may use.

And to be clear, this bane of blog comment spam is not just an annoyance for bloggers themselves, but also any who are blog commenters. Most blog software is setup to send us commenters a copy of any other comment someone posts. Even if a blogger is diligent about catching and deleting such comments (so they get no pagerank love from being posted), some of the damage is done in that the fellow commenters on that entry did get the email.

My solution

Again, I wanted a solution that let me detect and prevent submissions of spammy URL references. There's no blacklist for keywords in the version of BlogCFC I have.

Even then, I realize some don't like doing blacklists of keywords anyway, since you can get false positives. Then there's the challenge that if you look for some words, the spammers just change them. But for the problem above, their goal is to get their URL listed.

So I was interested in looking only for URLs, not just any "words". Further, I want to check in both the content field and the URL field of the comment. (And if it meant I blocked someone who was merely mentioning one of these spammy URLs, in a helpful way, I'm willing to risk that false positive.)

My approach

So the way I do it is that I created a file to track the bad urls. When I get a comment that's got content that's spam, I put any domains it refers to into that file (and then delete the comment, of course).

Then before accepting any new comment, my code reads that file (yes, on each comment submission. I could optimize things, of course, reading the file for a cached period. I could also offer an interface to more easily add URLs to the badurl list file. I just haven't gotten to that. For now, I just edit it, maybe a few times a month after having gotten most of the common crap URLs under control.)

About the blacklisted urls file

Rather than post the badurls.txt file here, you can leave me a comment (which will ask for your email address which is not shared and your URL. Tell me the URL of your CF blog), and I'll send it to you directly. Don't want to give away intel to the spammers, plus by me sending it along you'll get the latest.

Another thing I could do is create a service where the badurls file is kept and accessed/updated centrally. Again, just haven't gone to that yet. Nor even creating a Riaforge project for this. I'll wait to see what people think.

The badurls file is really just one big long list (comma-separated) of bad domains. Here's just a sampel of the first few entries (it's all just on one line):

dedikodulu.net,acrobatajans.com,dosyapaylasim.net

Note that I don't bother using the full url, and I even leave off the www. part, since some spammers use sobdomains. Of course, I wouldn't add to the list a domain that looked like it could be legit. But if it looks suspicious, it's black listed.

What do the spammers see?

I don't tell the spammers that I'm rejecting them because of the spammy URL. I just report "Invalid request" as an error. I also happen to email myself when people attempt to send comments (in case they have problems with the captcha or for some other reason their comment doesn't make it), so I have fun watching how the spammers flail about trying again and again to get their crap in. :-)

I figure if it was a false positive and someone REALLY sincerely felt that their comment should be let in, despite their referring to one of these urls and getting rejected, they could just contact me directly (as I offer a contact link on my blog, or they may think to enter a plain comment. Again, these are rare instances, I think.) The benefit for cutting down on spam comments has far outweighed the risk.

Update: With regard to the BlogCFC "trackback spamlist" feature, I'll note that it doesn't offer any feedback at all if a comment has a blacklisted keyword/url. It just closes the form as if it took, but the comment is not posted.

What do I do with the badurls file? Show me some code.

I drop the badurls.txt file into the blog root directory (typically blog/client in blogcfc), in the same directory with the addcomment.cfm template. In that file, I make just the following 3 edits to that addcomment.cfm template.

First, I add the following that reads the file in:

<cftry>
   <!--- ought to cache this and refresh when file changes --->
   <cffile action="READ" file="#expandpath("badurls.txt")#" variable="badurllist">
   <!--- the next line is just to test if the data in the file is in fact a valid CF list. if not, email me --->
   <cfset listerrcnt = listlen(badurllist)>
   <cfcatch>
   <cfmail to="whoever" from="whoever" subject="failure during blog addcomment, badurl list processing"><cfdump var="#cfcatch#"></cfmail>
   </cfcatch>
</cftry>

And in the addcomment.cfm test I place some more code for adding a comment which should go inside this line:

<cfif isDefined("form.addcomment") and entry.allowcomments>)

and after the first IF test for:

<!--- tests to block spammers --->
).

I added this:

<cfif findlist(badurllist,trim(GetHostFromURL(form.website))) or findList(badurllist,form.comments)>
      <cfset errorStr = errorStr & "- " & "Invalid request" & "<br>">
   </cfif>

Sure, I could have done that in CFSCRIPT. Same with the next chunk coming up. Feel free to change it if that suits you. :-)

Needed (and created) a new UDF, FindList

You'll notice this calls a udf, findlist, which does something that surprisingly no built-in function does: searching one string for any of several items in a list. (For an explanation of how it differs from listfind and listcontains, see the version posted at CFLib. That udf is a little more complicated, as I expanded it based on some feedback from others.)

<cffunction name="findList">
   <!--- FindList, from Charlie Arehart--->
   <cfargument name="valuelist" required="Yes" type="string">
   <cfargument name="stringtocompare" required="Yes" type="string">
   <cfset var found=0>
   <cfloop list="#arguments.valuelist#" index="x">
      <cfif findnocase(x,arguments.stringtocompare)>
         <cfset found=1>
      </cfif>
   </cfloop>
   <cfreturn found>
</cffunction>

Hope all that may help someone. Feel free to comment.

8 Adobe AIR Apps that DON'T Suck

Looking for some demonstrations of effective use of AIR? Check out 8 Adobe AIR Apps that DON'T Suck. Sure, it's got the oft-mentioned Ebay one, but lots more, including ones for Google Analytics, Pownce, Twitter, and Digg, that you may have missed.

Of course, many of those same examples (and more, though not all) are offered at the Adobe AIR showcase page. Still, the showcase is just a handful of highlighted examples and some, such as the WebKut page clipping tool and the XDrive app aren't listed. You can instead find those and dozens more at the AIR Marketplace. Curiously, a few of the makeuseof linked examples aren't even listed in the the Marketplace site, including the Google Analytics (still in beta), Pownce, Twhirl, or Digg ones, to name a few. (That may be old news to avid AIR fans, but I thought I'd point it out for those on the periphery.)

Given that, and for those who don't look at the Showcase or Marketplace pages regularly anyway, I thought this blog entry worth highlighting. In any case, It's nice to see outsiders picking up the baton in the AIR race.

BTW, I'm a huge fan of the site it's posted at, makeuseof.com, as they offer really valuable resource links on a wide range of tech topics every day.

Want to simplify your Blogcfc (or other Lyla-based) captcha? Here's the XML file.

Want to simplify your BlogCFC (or other Lyla-based) captcha? Just grab this updated xml file:

right-click and save this updated xml file

If you're using BlogCFC, you can just drop it into your /blog/client/includes directory (saving your old one to restore if needed, and you may need run the query string option "?reinit=1" to reload blogCFC settings.)

This will instantly your captcha will change from this:

to this:

I've confirmed that the original captcha.xml is the same between releases 5.005 and 5.5 beta 1, where Ray is now including the changed XML file himself in the product itself.

For those curious about what I mean by "simplifying", a few weeks ago I wrote an entry explaining how you could simply your Captcha to just a couple of letters, with a much easier read background and format. I also proposed why I think it's ok. We bloggers don't need to keep out really determined hackers (with a double-keyed deadbolt lock), we just need to keep out the annoying pests (with a screen door).

Since that post, many bloggers have indeed taken up the suggestion, but I have seen blogs where some commenters have pointed our my older entry, with the blogger's saying, "I do plan to get to it". That other entry offered the specific steps to change the captcha.xml file, but if you haven't changed it yourself since implementing BlogCFC, just drop this in. Of course, if you want to do a comparison to make sure, there are lots of good compare tools. My favorite is BeyondCompare.

Do you blog? Do you identify yourself on your blog? Please do!

I'm so surprised by how many blogs I come across where the blogger has not identified themselves in any way: no name, no bio, no email link. I suppose some may do it intentionally, as some form of anonymity (and I do realize why some may not want to list their email), but I honestly think most just had't thought about whether to list their name or anything more.

I'd like to put out a plea to at least consider listing your name, either in your title ("clever name - by blogger name"), or just in some text below it, or in your toolbar. Better still would be a small bio, or a link to a page that has one. (Maybe it would help if blog software offered an "about" pod that made you think of it more readily.) A photo would be nice, too. And for reasons (and with cautions) I propose below, I recommend you also list your email address.

Why bother with name, bio, and/or email? Because it's in your interest!

There are a couple of reasons to consider it, and they help both you and your readers.

First, as for listing at least your name, a good reason is simply to associate yourself with all the value you create by your blog. Why not get credit for your work? Plus, many would really like to know who you are. (And if your blog software puts a tiny "by" under each blog entry, I'll argue that's not enough. I've missed that myself on more than one site.) Again, whether in the title, below it, or in the toolbar, just put it somewhere! :-)

As for a bio, again, even just a couple sentences about yourself (below the title or in the toolbar) can really personalize the blog. Don't assume everyone knows your background, even if they know you by name. Many readers will appreciate knowing more about where you work, where you're from, etc. Such details can also lend perspective to what you write about. (For instance, if you're a fan or a foe of something where that would color all of your posts, it can be helpful for people to realize, "oh, he works for them|on that open source project|with that tool| etc.)." Let people know where you're coming from.) But at least consider offering some background, even a single sentence.

Finally, as for your email address, someone may want to contact you to offer feedback that's not specific to a post. They may want to offer you work (and not want to announce that in a blog comment)--and even then, which post should they enter such a generic note to you in, anyway? Keep in mind that not all readers realize that you get notified of all comments by email, so they may give up trying to contact you.

Heck, they may even have trouble posting a comment, and therefore need *some* way to contact you. I've certainly seen that before.

But isn't it bad to post your email address online?

OK, I realize you may not want to offer your email, as spambots will capture it. But you've probably noticed more and more people listing their addresses as "name (at) domain". The thinking is that people can figure that out, but spambots (at least the dummer ones) will not. I'll grant that they'll eventually catch on. You just need to way how important the benefits are against the pain of more spam. (You do have a spam catching program, I hope? I love the one I use, Cloudmark Desktop. No, it's not free, but there are certainly many of them you can check out.)

Be careful using that (at) trick with Mailto links
If you do decide to use the (at) approach, but you also offer a mailto link, like:

be careful: you need to list the "anti-spammer" address in the mailto (used to launch the email) as well as between the a tags (as shown to the user). Spambots grab all the text on your page, not just what's "visible". This is a pain, because then in the email that's opened the user must notice that you've done this and change it, or the mail will fail to get to you. What I do is explain to the user that by forcing some body text into the mail that's opened. Did you know that was possible?

<a href="mailto:charlie (at) carehart.org?body=please change the spam-fighting email address format I filled in for you, replacing the (at)!">charlie (at) carehart.org</a>

And for those who maybe already knew about it, did you know that you could also use:

<a href="http://tipicalcharlie.blog-city.com/forcing_a_line_break_in_an_html_email_link.htm">force a line break within such content in an HTML email link</a>
(this is from another blog of mine, typicalcharlie.com, which is for generic, non-CF tips)

So please, bloggers, step up and identify yourself. We'll all appreciate it!

Simplifying the captcha graphic in Lyla Captcha (and BlogCFC)

Wish you could simplify your captcha's? If you use Peter Farrell's Lyla Captcha, as I do because it's embedded in Ray's BlogCFC, I'll show a few quick changes you can make that will make them much easier for your users to read.

Sound counter-intuitive? Aren't captcha's supposed to be difficult to read, to hamper spammers? In my last entry, I made a call for simplifying captchas and why they aren't all bad. As a blog owner who uses them to weed out the random spambots who would otherwise clog my comments and feedback mechanisms, I like captchas, and I'm grateful for the work Peter's done.

That said, I have to admit that as I've encountered them in the blogs of others, I've grown a tad weary of their complexity. They require the user to type several characters and have several swirly ovals, random lines, and a wavy background. Frankly they're quite hard to read, and it would be a shame to lose commenters for that reason.

hard captcha

Again, the intent is to make it hard for some spammer to scan the captcha request somehow and figure out what's being requested so as to automate around it. Fair enough, but as I said in my last entry I'm really not that concerned about protecting my site from determined break-ins. I'm not a bank. I just want to keep out the automated pests.

With just a couple of changes to Lyla's captcha.xml file, you'll have a much simplified captcha, if you want one.

hard captcha

Lyla is highly customizable

On a lark, I decided to try to find out if Lyla might just be modifiable to dial down the intensity. Turns out it is, by simple changes in the lyla captcha.xml file, as documented in this PDF. Thanks again, Peter! :-)

After a few simple tweaks, I reduced my captcha to just asking for 3 characters, all lowercase, without all the swirly ovals, lines, and wavy background.

Changing Lyla's captcha.xml

In BlogCFC, the captcha.xml file is located in blog\client\includes (or just \includes if you've installed the blog client directory as your webroot.)

To effect the change I wanted, I ended up with the following values for the following entries. Again, see the docs for more info:

<config name="randStrType" value="alphaLcase"/>
<config name="randStrLen" value="3"/>
<config name="fontColor" value="dark"/>
<config name="backgroundColor" value="light"/>
<config name="useGradientBackground" value="false"/>
<config name="backgroundColorUseCyclic" value="false"/>
<config name="useOvals" value="false"/>
<config name="useBackgroundLines" value="false"/>
<config name="useForegroundLines" value="false"/>

You can change them to suit your taste. Note that if you do change the randStrLen, the value selected represents the "average" length of the string that users will be asked to enter, and may vary by +/- 1 from that.

Make the changes, and check 'em out for yourself. Note that with Ray's BlogCFC, you need to reinitialize the blog (add ?reinit=1 to your blog URL) to see the changes. What I did was had one browser page open to do that, and another sitting on a blog comment form. After running the reinit, I could then just reload the comment page to see the impact. (If there's a still-simpler way to test changes to the captcha.xml, let me know.)

If you don't use BlogCFC, then you have to re-instantiate the captcha object after making changes to the XML file. If you've stored it in a shared scope (like application), you need to run some code that reloads it. Of course, restarting ColdFusion will also reload the CFC in whatever scope you stored it in.

Conclusion

Making these changes won't solve the accessibility problems some have with captchas, and it certainly could increase the risk of a determined spammer more easily breaking your captcha. As I said in the last entry, I doubt that's a real concern for most of us. If it proves to be so, then you can dial the intensity back up.

I just want to keep from annoying my readers, and I hope others will consider these changes to keep from annoying us all. :-)

PS: I do realize that one could skip the captcha graphic entirely and just go to prompting the user for a random string. That may just a bit "too" easy for a spambot to get around. To each his own.

Thanking Peter

One last note: while Peter certainly appreciates your kind comments (and do share them, as I'm sure many don't bother), those who REALLY appreciate his work should note that he gratefully accepts contributions by way of his Amazon Wishlist or you may may make a donation with PayPal, using his address, pjf@maestropublishing.com.

Captchas: making them simpler, and dialing down the angst against them

Most by now understand what captchas are. Some love 'em, some hate 'em. I want to dial down the rhetoric some with this perspective: as a blog owner fighting frequent spam in comments and trackbacks, captchas (in some form, not necessarily a graphic) have their place to keep out spambots, and they can indeed be simplified (even the graphics ones) and at no loss of benefit. My bottom line: I don't use them as a double-key deadbolt lock to keep out intruders, I just use them as a screendoor to keep out random pests.

If you use Peter Farrell's Lyla Captcha, which I use because it's embedded in Ray's BlogCFC, in the next entry I'll show a few quick changes you could make in the Lyla captcha.xml file to make them much easier to read, going from this
hard captcha
to this
simple captcha.

Before that, I just want to expand on those thoughts above on the general angst against captcha's, and why I think it's ok to make them easier to read.

The Haters

I realize that some have gone to great lengths to decry captchas primarily because they are not "accessible" (to those using screenreaders), though audio ones help solve that.

Others simply hate them because they're too darned difficult to read. I've surely seen that, even in the ones created by default in Lyla (thus my next entry on addressing that).

Now, while most use a graphic that a user must read, it's not the only approach. As the previous link discusses, other approaches include simpler approaches like asking the reader to add some numbers or answer a question (that only a human could reasonably do).

But the other complaint is that they give those who use them a false sense of security, because they can be easily broken, even the graphic ones.

But my Blog is Not a Bank

Here's the thing: my blog is not a bank. While the difficulty in breaking a captcha may be important to a bank or commercial site trying to use them for authentication, I just want to make it hard for an automated spambot to post crap in my blog comments and trackback forms. If you have any similar king of input form on a publicly accessible site, you may suffer similar problems.

I really can't believe anyone would go to the lengths of scanning and breaking the captcha on my site (random as it is) to get a crap spam comment into my lil' ol' blog. And some of the comments are just nonsense; it's not like they're trying to drive traffic to another site or something--so the popularity of my (or your) site isn't the issue. It's just the annoyance factor (both to me as I get notified of comments and to readers who would have to sift through them if I didn't delete them as I do now).

Having made the case for why a simpler captcha may suffice for some purposes, in the next entry I'll show how to control the degree of difficulty in reading them for captchas built using Lyla Captcha.

BlogCFC was created by Raymond Camden. This blog is running version 5.005. (Want to validate the html in this page?)

Managed Hosting Services provided by
Managed Dedicated Hosting