Some time ago I implemented some code on my own site to throttle when any single IP address (bot, spider, hacker, user) made too many requests at once. I've mentioned it occasionally and people have often asked me to share it, which I've happily done by email. Today with another request I decided to post it and of course seek any feedback.
It's just a rough cut. I haven't thought it through thoroughly (wow, how's that for an alliteration!). Still, while I know there are couple of concerns that will come to mind for some readers and I try to address those at the end, it does work for me and has helped improve my server's stability and reliability.
Background: do you need to care? Perhaps more than you realize
As background, in my
consulting to help people troubleshoot CF server problems, one of the most common surprises I help people discover is that their servers are often being bombarded by spiders, bots, hackers, people grabbing their content, rss readers, or even just their own internal/external ping tools (monitoring whether the server is up.)
It can either be that there are many more than they expect, coming more often than they expect, or they may come extremely fast to your server (even many times a second). This throttle tool helps deal with the latter.
Why you can't "just use robots.txt and call it a day"
Yes, I do know that there is a
robots.txt standard (or "robots exclusion protocol") which, if implemented on your server, robots should follow so as not to abuse your site. And it does offer a crawl-delay option.
The first problem is that some of the things I allude to above aren't bots in the classic sense (such as RSS readers, ping tools). They don't "crawl" your site, so they don't regard that they need to be told how/where to look. They're just coming looking for a given page.
The second problem is that some bots simply ignore the robots.txt, or don't honor all of it. For instance, while Google honors the file in terms of what it should look at, my understanding is that it instead requires you to implement the webmaster toolkit for your site to control its crawl rate.
Then, too, if you may have multiple sites on your server, the spider or bot may not consider that in deciding to send a wave of requests to your server. It may say "I'll only send requests to domain x at a rate of 1 per second", but it may not realize that it's sending requests to domains x, y, z (and a, b, and c) all of which are one server/cluster, which could lead a single server to in fact be hit far more than once a second (in that scenario). It may seem that's an edge case, but honestly it's not that unusual from what I've observed.
Finally, another reason all this becomes a concern is that of course there can be many spiders, bots, and other automated requests all hitting your server at once sometimes. My tool can't help with that, but it can at least the other points above.
(As with so much in IT and this very space, things do change, so what's true today may change, or one may have old knowledge, so as always I welcome feedback.)
The code
So I hope I've made the case for why you should consider some throttling, such that too many requests from one IP address are rejected. I've done it in a two-fold approach, sending both plain text and an http header that is appropriate for this sort of "slow down" kind of rejection. You can certainly change it to your taste.
I've just implemented it as a UDF (user-defined function). Yes, I could have also written at in all CFscript (which would run in any release, as there nothing that couldn't be written in script in that code--well, except the CFLOG, which could be removed). But since CF6 added the ability to define UDFs with tags, and to keep things simplest for the most people, I've just done it as tags. Feel free to modify it to all script if you'd like. It's just a starting point.
I simply drop the UDF into my application.cfm (or application.cfc, as appropriate). Yes, one could include it, or implement it as a CFC method if they wished.
<cffunction name="limiter"> <!---
Written by Charlie Arehart,
charlie@carehart.org, in 2009
- Throttles requests made more than
"count" times within
"duration" seconds.
- sends 503 status code for bots to consider as well as text for humans to read
- also logs to a new
"limiter.log" that is created automatically in cf logs directory, tracking when limits are hit, to help fine tune
- note that since it relies on the application scope, you need to place the call to it AFTER a cfapplication tag in application.cfm
--->
<cfargument name="duration" type="numeric" default=3> <cfargument name="count" type="numeric" default="3"> <cfif not IsDefined("application.rate_limiter")> <cfset application.rate_limiter = StructNew()> <cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()> <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1> <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()> <cfelse> <cfif StructKeyExists(application.rate_limiter, CGI.REMOTE_ADDR) and DateDiff("s",application.rate_limiter[CGI.REMOTE_ADDR].last_attempt,Now()) LT arguments.duration> <cfif application.rate_limiter[CGI.REMOTE_ADDR].attempts GT arguments.count> <cfoutput><p>You are making too many requests too fast, please slow down and wait #arguments.duration# seconds
</p></cfoutput> <cfheader statuscode="503" statustext="Service Unavailable"> <cfheader name="Retry-After" value="#arguments.duration#"> <cflog file="limiter" text="#cgi.remote_addr# #application.rate_limiter[CGI.REMOTE_ADDR].attempts# #cgi.request_method# #cgi.SCRIPT_NAME# #cgi.QUERY_STRING# #cgi.http_user_agent# #application.rate_limiter[CGI.REMOTE_ADDR].last_attempt#"> <cfset
application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1> <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()> <cfabort> <cfelse> <cfset
application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1> <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()> </cfif> <cfelse> <cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()> <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1> <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()> </cfif> </cfif> </cffunction>
Then I call the UDF, using simply cfset limiter(), as shown below. That's it. No arguments need be passed to it, unless you want to override the defaults of limiting things to 3 requests from one IP address within 3 seconds.
<!-- the following must be done after cfapplication -->
<cfset limiter()>
Note that since the UDF relies on the application scope, you need to place the call to it AFTER a cfapplication tag if using application.cfm.
Caveats and more
There are definitely a few points to consider, and some concerns/observations that readers may have.
- First, BlueDragon fans will want to point out that they don't need to code a solution at all (or use this), because it's had a CFTHROTTLE tag for several years. Indeed it has. I do wish Adobe would implement it in CF (I'm not aware of it existing in Railo). Until then, perhaps this will help others has it has me.
- More important, some will be quick to point out a potential flaw in the approach of throttling by IP address is that you may have some visitors who are behind a proxy where they appear to your server to all be coming from one ip address. Fair enough. This is a dilemma that requires more handling. For instance, the BD CFThrottle tag implements this with a TOKEN attribute allowing you to key on yet another field in the request headers. I didn't choose to bother with that, as in my case (on my site), I just am not that worried about the problem. You may need to, so beware. Again, the log will help you determine how much it's doing any work at all.
- And some may recommend (and others may want to consider) instead doing this throttling at the servlet filter level, rather than CFML (something I've written about before .) Yep, since CF runs atop a servlet engine (JRun by default), you could indeed do that, which could apply then to all applications on your entire CF server (rather than implemented per application like above.) And there are indeed throttling servlet filters, such as this one. Again, I offer this for those who aren't interested in that.
- And of course, an inevitable question/concern some may have is, "but if you slow down a bots, might that that not affect what they think about your site? Might they stop crawling entirely?" I suppose that's a consideration that each will have to make for themselves. I implemented this several months ago and haven't noticed any change either in my page ranks, my own search results, etc. That's all just anecdotal, of course. And again, things can change. I'll say that of course you use this at your own risk. I just offer it for those who may want to consider it, and want to save a little time trying to code up a solution. Again, I welcome feedback if it could be improved.
- Now, one other gotcha to consider, if you implement this and try to test it: some browsers have a built-in throttling mechanism of their own and they won't send more than x requests to a given domain from the browser at a time. I've spoken on this before, and you can read more from yslow creator Steve Souders. So while you may think you can just hit refresh 4 times to force this, it may not quite work that way. What I have found is that if you wait for each request to finish and then do the refresh (and do that 4 times), you'll get the expected message. Again, use the logs for real verification of whether the throttling is really working for real users, and to what extent.
- There is of course another nasty effect of spiders, bots, and other automated requests, and that's the risk of an explosion of sessions which could eat away at your java heap space. People often accuse CF of a memory leak, which it's really just this issue. I've written on it before (see the related entries at the bottom here, above the comments). This suggestion about throttling requests may help a little with that, but it really is a bigger problem with other solutions, that I allude to in the other entries.
- Finally, yes, I realize I could and should post this to the wonderful CFlib repository, and I surely will. I wouldn't mind getting some feedback if anyone sees any issues with it. I'm sure there's some improvement that could be made. I just wanted to get it out, as is, given that it works for me and may help others.
Besides feedback/corrections/suggestions, please do also let me know here if it's helpful for you.