Some code to throttle rapid requests to your CF server from one IP address
It's just a rough cut. I haven't thought it through thoroughly (wow, how's that for an alliteration!). Still, while I know there are couple of concerns that will come to mind for some readers and I try to address those at the end, it does work for me and has helped improve my server's stability and reliability.
Background: do you need to care? Perhaps more than you realize
As background, in my consulting to help people troubleshoot CF server problems, one of the most common surprises I help people discover is that their servers are often being bombarded by spiders, bots, hackers, people grabbing their content, rss readers, or even just their own internal/external ping tools (monitoring whether the server is up.)It can either be that there are many more than they expect, coming more often than they expect, or they may come extremely fast to your server (even many times a second). This throttle tool helps deal with the latter.
Why you can't "just use robots.txt and call it a day"
Yes, I do know that there is a robots.txt standard (or "robots exclusion protocol") which, if implemented on your server, robots should follow so as not to abuse your site. And it does offer a crawl-delay option.The first problem is that some of the things I allude to above aren't bots in the classic sense (such as RSS readers, ping tools). They don't "crawl" your site, so they don't regard that they need to be told how/where to look. They're just coming looking for a given page.
The second problem is that some bots simply ignore the robots.txt, or don't honor all of it. For instance, while Google honors the file in terms of what it should look at, my understanding is that it instead requires you to implement the webmaster toolkit for your site to control its crawl rate.
Then, too, if you may have multiple sites on your server, the spider or bot may not consider that in deciding to send a wave of requests to your server. It may say "I'll only send requests to domain x at a rate of 1 per second", but it may not realize that it's sending requests to domains x, y, z (and a, b, and c) all of which are one server/cluster, which could lead a single server to in fact be hit far more than once a second (in that scenario). It may seem that's an edge case, but honestly it's not that unusual from what I've observed.
Finally, another reason all this becomes a concern is that of course there can be many spiders, bots, and other automated requests all hitting your server at once sometimes. My tool can't help with that, but it can at least the other points above.
(As with so much in IT and this very space, things do change, so what's true today may change, or one may have old knowledge, so as always I welcome feedback.)
The code
So I hope I've made the case for why you should consider some throttling, such that too many requests from one IP address are rejected. I've done it in a two-fold approach, sending both plain text and an http header that is appropriate for this sort of "slow down" kind of rejection. You can certainly change it to your taste.
I've just implemented it as a UDF (user-defined function). Yes, I could have also written at in all CFscript (which would run in any release, as there nothing that couldn't be written in script in that code--well, except the CFLOG, which could be removed). But since CF6 added the ability to define UDFs with tags, and to keep things simplest for the most people, I've just done it as tags. Feel free to modify it to all script if you'd like. It's just a starting point.
I simply drop the UDF into my application.cfm (or application.cfc, as appropriate). Yes, one could include it, or implement it as a CFC method if they wished.
<!---
Written by Charlie Arehart, charlie@carehart.org, in 2009
- Throttles requests made more than "count" times within "duration" seconds.
- sends 503 status code for bots to consider as well as text for humans to read
- also logs to a new "limiter.log" that is created automatically in cf logs directory, tracking when limits are hit, to help fine tune
- note that since it relies on the application scope, you need to place the call to it AFTER a cfapplication tag in application.cfm
--->
<cfargument name="duration" type="numeric" default=3>
<cfargument name="count" type="numeric" default="3">
<cfif not IsDefined("application.rate_limiter")>
<cfset application.rate_limiter = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
<cfelse>
<cfif StructKeyExists(application.rate_limiter, CGI.REMOTE_ADDR) and DateDiff("s",application.rate_limiter[CGI.REMOTE_ADDR].last_attempt,Now()) LT arguments.duration>
<cfif application.rate_limiter[CGI.REMOTE_ADDR].attempts GT arguments.count>
<cfoutput><p>You are making too many requests too fast, please slow down and wait #arguments.duration# seconds</p></cfoutput>
<cfheader statuscode="503" statustext="Service Unavailable">
<cfheader name="Retry-After" value="#arguments.duration#">
<cflog file="limiter" text="#cgi.remote_addr# #application.rate_limiter[CGI.REMOTE_ADDR].attempts# #cgi.request_method# #cgi.SCRIPT_NAME# #cgi.QUERY_STRING# #cgi.http_user_agent# #application.rate_limiter[CGI.REMOTE_ADDR].last_attempt#">
<cfset
application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
<cfabort>
<cfelse>
<cfset
application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
</cfif>
<cfelse>
<cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
</cfif>
</cfif>
</cffunction>
Then I call the UDF, using simply cfset limiter(), as shown below. That's it. No arguments need be passed to it, unless you want to override the defaults of limiting things to 3 requests from one IP address within 3 seconds.
<cfset limiter()>
Note that since the UDF relies on the application scope, you need to place the call to it AFTER a cfapplication tag if using application.cfm.
Caveats and more
There are definitely a few points to consider, and some concerns/observations that readers may have.
- First, BlueDragon fans will want to point out that they don't need to code a solution at all (or use this), because it's had a CFTHROTTLE tag for several years. Indeed it has. I do wish Adobe would implement it in CF (I'm not aware of it existing in Railo). Until then, perhaps this will help others has it has me.
- More important, some will be quick to point out a potential flaw in the approach of throttling by IP address is that you may have some visitors who are behind a proxy where they appear to your server to all be coming from one ip address. Fair enough. This is a dilemma that requires more handling. For instance, the BD CFThrottle tag implements this with a TOKEN attribute allowing you to key on yet another field in the request headers. I didn't choose to bother with that, as in my case (on my site), I just am not that worried about the problem. You may need to, so beware. Again, the log will help you determine how much it's doing any work at all.
- And some may recommend (and others may want to consider) instead doing this throttling at the servlet filter level, rather than CFML (something I've written about before .) Yep, since CF runs atop a servlet engine (JRun by default), you could indeed do that, which could apply then to all applications on your entire CF server (rather than implemented per application like above.) And there are indeed throttling servlet filters, such as this one. Again, I offer this for those who aren't interested in that.
- And of course, an inevitable question/concern some may have is, "but if you slow down a bots, might that that not affect what they think about your site? Might they stop crawling entirely?" I suppose that's a consideration that each will have to make for themselves. I implemented this several months ago and haven't noticed any change either in my page ranks, my own search results, etc. That's all just anecdotal, of course. And again, things can change. I'll say that of course you use this at your own risk. I just offer it for those who may want to consider it, and want to save a little time trying to code up a solution. Again, I welcome feedback if it could be improved.
- Now, one other gotcha to consider, if you implement this and try to test it: some browsers have a built-in throttling mechanism of their own and they won't send more than x requests to a given domain from the browser at a time. I've spoken on this before, and you can read more from yslow creator Steve Souders. So while you may think you can just hit refresh 4 times to force this, it may not quite work that way. What I have found is that if you wait for each request to finish and then do the refresh (and do that 4 times), you'll get the expected message. Again, use the logs for real verification of whether the throttling is really working for real users, and to what extent.
- There is of course another nasty effect of spiders, bots, and other automated requests, and that's the risk of an explosion of sessions which could eat away at your java heap space. People often accuse CF of a memory leak, which it's really just this issue. I've written on it before (see the related entries at the bottom here, above the comments). This suggestion about throttling requests may help a little with that, but it really is a bigger problem with other solutions, that I allude to in the other entries.
- Finally, yes, I realize I could and should post this to the wonderful CFlib repository, and I surely will. I wouldn't mind getting some feedback if anyone sees any issues with it. I'm sure there's some improvement that could be made. I just wanted to get it out, as is, given that it works for me and may help others.



As for "cfcombine", I have to say you've stumped me there. I've never heard of it. I realize it's not a tag, but you must have something else in mind, perhaps some project or tool. I googled but could find nothing obvious. Can you share a little more?
http://combine.riafo...
This project "combines multiple javascript or CSS files into a single, compressed, HTTP request."
You'd want to be careful to exclude scripts like this as maximum limits could be met within a single page load by visitors if not properly configured.
On another note: To deal with abusive requests, I've written a SpiderCheck custom tag to identify both good and bad/abusive spiders. Identified abusive spiders receive an automatic "403 Forbidden" status and a message. I've also written a "BlackListIP" script that blocks POST requests by known exploited IPs and works with Project HoneyPot. I haven't published any of my internal projects/scripts before because I hadn't had much time. I primarily communicate on third-party blogs on topics of interest and don't attend many conferences. (I hope this doesn't make me a troll.) I wouldn't mind sharing my scripts if someone is interested in reviewing them and distributing them. (I personally don't have time to provide any customer support.)
Thanks for all you do.
As for your scripts, I'm sure many would appreciate seeing them. And as I wrote, I too just would hand out my script on request so just finally decided to offer it here.
But you don't need a blog to post it, of course. You could post them onto RIAForge, and then you don't have to "support it" yourself. A community of enthusiastic users may well rally around the tool--and even if not, no one "expects" full support of things posted on riaforge.
I'd think it worth putting it out there just to "run it up the flag pole and see who salutes". I'm sure Ray or others would help you if you're unsure how to go about getting them posted. (I don't know that you should expect too many volunteers from readers here, though. My blog is not among the more widely read/shared/promoted, but maybe someone will see this discussion and offer to help.)
I myself have been meaning to get around to posting something to riaforge also, just to see what the experience is like. I'm sure someone will pipe in to say "it's incredibly easy". I don't doubt that. Just haven't that the right project at the right time (when I could explore it) to see for myself.
But back to your tools, there are certainly a lot of ways to solve the spider dilemma. I've been reluctant to do a check of either those or blacklisted IPs just because of the overhead of checking them on each page request. I imagine those are big lists! :-) But certainly if someone is suffering because of them (or fears that), then it may be worth doing. Again, I'm sure many would appreciate seeing your tools. Hope you'll share them. :-)
--
Mike
Thank you very much for this timely post. @cfchris and I were just discussion this problem last week and this is exactly what he was recommending. I'm curious as to the other steps that you recommend when dealing with this?
Here are a few things we've tried and our experience:
We've adopted a suggesting of the community and sniffed the user agent and forcing the session to timeout for known bots. Specifically what we were seeing is that each bot would throw away the cookie, generating a new session for each request. At standard session timeout rates 15+ minutes, this quickly added up and overtook all of the servers memory. The obvious challenge with that is keeping up with the known bots out there. For instance, a Russian search index http://www.yandex.co... indexes many of our client pages and completely ignores the Robots.txt recommendation.
In general, with every problem we've experienced with search engines taking down our servers it had to do with abandoned sessions slowly eating up memory and ultimately crashing the server.
Robots.txt crawl rate - One thing we employ although not many respect is the Robot 'rate request'. If you set this to the same as the suggested crawl rate in this code say, once every five seconds or Request-rate: 1/5 in Robots.txt, that might help them sync up.
Challenges / Suggestions:
The one challenge I see with this code is that many of our clients images are dynamically re-sized using a ColdFusion call that checks to see if the thumbnail exists, generates it if necessary and then redirects to the file. This in theory would cause pages with heavy dynamic images to trigger this code. However, we may need to revisit that solution in general as the very fact that it does take up a ColdFusion thread for the processing is ultimately what causes the load from our BOT servers to crash the server.
Ultimately you addressed this already, potential impacts on SEO. I thought about this for a while, but the experience right now is that rouge bots that don't play nice also get the same treatment as first class engines such as google, yahoo, msn, and ultimately the same as real users. One other addition might be to add a 'safe user agent list'. For instance, if you notice it's a 'trusted user agent', you simply exclude them from this check. Obvious problems are that user agents are sometimes faked by crawlers to be Firefox or IE, but playing around with that might also prevent this from taking down real users and priority bots, while still keeping most of the ones that don't play well away.
I'm going to play with this on our staging enviroment for a while but plan on launching this into a few sites later this week. I'll let you know my findings.
Thanks again for your help on this topic,
Roy
So I have just added a new bullet to my caveats/notes above, and I point people to the two older entries which are list as "related entries", shown just above the comments here.
It's without question a huge problem, and as I note in the caveat, this limiter wasn't really focused on solving that problem, though it will help. Your ideas are among many that have been considered and that I discuss some in the other entries. Thanks for bringing it up, though.
Wouldn't bots/crawlers or individual internet browsers be assigned their own session? That way you wouldn't have to worry about affecting multiple users in an IP address and wouldn't have to check for CFToken.
So every visit they make creates a new session (CF creates a new session for each page visit they make)--and therefore each new page visit would not be able to access the session scope created in the previous visit.
What you're thinking could make sense for "typical" browsers perhaps, but not for this problem of automated requests. Why not try it out for yourself, though, if you're really interested in this topic. It can be a compelling learning experience.
I've written about it previously a few years ago here: http://tinyurl.com/s...