[Looking for Charlie's main web site?]

Some code to throttle rapid requests to your CF server from one IP address

Some time ago I implemented some code on my own site to throttle when any single IP address (bot, spider, hacker, user) made too many requests at once. I've mentioned it occasionally and people have often asked me to share it, which I've happily done by email. Today with another request I decided to post it and of course seek any feedback.

It's a first cut. While there are couple of concerns that will come to mind for some readers, and I try to address those at the end, it does work for me and has helped improve my server's stability and reliability, and it's been used by many others.

Background: do you need to care? Perhaps more than you realize

As background, in my consulting to help people troubleshoot CF server problems, one of the most common surprises I help people discover is that their servers are often being bombarded by spiders, bots, hackers, people grabbing their content, rss readers, or even just their own internal/external ping tools (monitoring whether the server is up.)

It can either be that there are many more than they expect, coming more often than they expect, or they may come extremely fast to your server (even many times a second). This throttle tool can help deal with the latter.

Why you can't "just use robots.txt and call it a day"

Yes, I do know that there is a robots.txt standard (or "robots exclusion protocol") which, if implemented on your server, robots should follow so as not to abuse your site. And it does offer a crawl-delay option.

The first problem is that some of the things I allude to above aren't bots in the classic sense (such as RSS readers, ping tools). They don't "crawl" your site, so they don't regard that they need to be told how/where to look. They're just coming looking for a given page.

A second is that the crawl-delay is not honored by all spiders.

The third problem is that some bots simply ignore the robots.txt, or don't honor all of it. For instance, while Google honors the file in terms of what it should look at, my understanding is that it does not regard it with respect to how often it should come. Instead, Google requires you to implement the webmaster toolkit for your site to control its crawl rate.

Then, too, if you may have multiple sites on your server, the spider or bot may not consider that in deciding to send a wave of requests to your server. It may say "I'll only send requests to domain x at a rate of 1 per second", but it may not realize that it's sending requests to domains x, y, z (and a, b, and c) all of which are one server/cluster, which could lead a single server to in fact be hit far more than once a second (in that scenario). It may seem that's an edge case, but honestly it's not that unusual from what I've observed.

Finally, another reason all this becomes a concern is that of course there can be many spiders, bots, and other automated requests all hitting your server at once sometimes. My tool can't help with that, but it can at least the other points above.

(As with so much in IT and this very space, things do change, so what's true today may change, or one may have old knowledge, so as always I welcome feedback.)

The code

So I hope I've made the case for why you should consider some sort of throttling, such that too many requests from one IP address are rejected. I've done it in a two-fold approach, sending both a plain text warning message and an http header that is appropriate for this sort of "slow down" kind of rejection. You can certainly change it to your taste.

I've just implemented it as a UDF (user-defined function). Yes, I could have also written at in all CFscript (which would run in any release, as there nothing that couldn't be written in script in that code--well, except the CFLOG, which could be removed). But since CF6 added the ability to define UDFs with tags, and to keep things simplest for the most people, I've just done it as tags. Feel free to modify it to all script if you'd like. It's just a starting point.

I simply drop the UDF into my application.cfm (or application.cfc, as appropriate). Yes, one could include it, or implement it as a CFC method if they wished.

<cffunction name="limiter">
   <!---
      Written by Charlie Arehart, charlie@carehart.org, in 2009, updated 2012
      - Throttles requests made more than "count" times within "duration" seconds from single IP.
      - sends 503 status code for bots to consider as well as text for humans to read
      - also logs to a new "limiter.log" that is created automatically in cf logs directory, tracking when limits are hit, to help fine tune
      - note that since it relies on the application scope, you need to place the call to it AFTER a cfapplication tag in application.cfm
      - updated 10/16/12: now adds a test around the actual throttling code, so that it applies only to requests that present no cookie, so should only impact spiders, bots, and other automated requests. A "legit" user in a regular browser will be given a cookie by CF after their first visit and so would no longer be throttled.
      - I also tweaked the cflog output to be more like a csv-format output
   --->

   <cfargument name="count" type="numeric" default="3">
   <cfargument name="duration" type="numeric" default="3">

   <cfif not IsDefined("application.rate_limiter")>
      <cfset application.rate_limiter = StructNew()>
      <cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
      <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
      <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
   <cfelse>
      <cfif cgi.http_cookie is "">
         <cfif StructKeyExists(application.rate_limiter, CGI.REMOTE_ADDR) and DateDiff("s",application.rate_limiter[CGI.REMOTE_ADDR].last_attempt,Now()) LT arguments.duration>
            <cfif application.rate_limiter[CGI.REMOTE_ADDR].attempts GT arguments.count>
               <cfoutput><p>You are making too many requests too fast, please slow down and wait #arguments.duration# seconds</p></cfoutput>
               <cfheader statuscode="503" statustext="Service Unavailable">
               <cfheader name="Retry-After" value="#arguments.duration#">
               <cflog file="limiter" text="'limiter invoked for:','#cgi.remote_addr#',#application.rate_limiter[CGI.REMOTE_ADDR].attempts#,#cgi.request_method#,'#cgi.SCRIPT_NAME#', '#cgi.QUERY_STRING#','#cgi.http_user_agent#','#application.rate_limiter[CGI.REMOTE_ADDR].last_attempt#',#listlen(cgi.http_cookie,";")#">
               <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
               <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
               <cfabort>
            <cfelse>
               <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
               <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
            </cfif>
         <cfelse>
            <cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
            <cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
            <cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
         </cfif>
      </cfif>
   </cfif>
</cffunction>

Then I call the UDF, using simply cfset limiter(), as shown below. That's it. No arguments need be passed to it, unless you want to override the defaults of limiting things to 3 requests from one IP address within 3 seconds.

<!-- the following must be done after cfapplication -->
<cfset limiter(count=3,duration=5)>

Note that since the UDF relies on the application scope, you need to place the call to it AFTER a cfapplication tag if using application.cfm.

Caveats

There are definitely a few caveat to consider, and some concerns/observations that readers may have. The first couple have to do with the whole idea of doing this throttling by IP address:

  • First, some will be quick to point out a potential flaw in the approach of throttling by IP address is that you may have some visitors who are behind a proxy, where they appear to your server to all be coming from one ip address. This is a dilemma that requires more handling. For instance, one idea would be to key on yet another field in the request headers (like the user agent), so that you use two keys to identify "a user" rather then just the IP address. If you think that's an issue for you, feel free to tweak it and report back here for others to benefit. I didn't choose to bother with that, as in my case (on my site), I just am not that worried about the problem. Note that the log that I create will help you determine if/when the UDF is doing any work at all.
  • Other folks will want to be sure I point that many spiders and other automated request tools now may come to your site from different IP addresses, still within that short timespan. My code would not detect them. For now, I have not put in anything to address this (it wouldn't be trivial). But the percentage of hits you'd fail to block because of this problem may be relatively low. Still, doing anything is better than doing nothing.
  • Speaking of the frequency with which this code would run, someone might reasonably propose that this sort of "check" might only need to be done for requests that look like spiders and bots. As I've talked about elsewhere, spiders and bots tend not to present any cookies, and so you could add a test near the top to only pay attention to requests that have no cookie (cgi.http_cookie is ""). I'll leave you to do that if you think it worthwhile. Since there's a chance that some non-spider requesters could also make such frequent requests, I'll leave such a test out for now. (Update: I changed this on 10/16/12 to add just that test, so the code above now only blocks such requests that "look like spiders". A legit browser visitor would get the cookie set by CF on the first request, so won't be impacted by this limiter.)
  • Someone may fear that this could cause spiders and bots to store this phrase "You are making too many requests too fast, please slow down and wait" (or whatever value you use). But I will note that I have searched Google, Bing, and Yahoo for this phrase and not found it as the result shown for a page on any site that may have implemented this code. (Since I give the status code of 503, I think that's why it would not store it as the result.)
  • Here's a related gotcha to consider, if you implement this and then try to test it from your browser and find "I can't ever seem to get the error to show" even when I refresh the page often. Here's the explanation: some browsers have a built-in throttling mechanism of their own and they won't send more than x requests to a given domain from the browser at a time. I've spoken on this before, and you can read more from yslow creator Steve Souders. So while you may think you can just hit refresh 4 times to force this, it may not quite work that way. What I have found is that if you wait for each request to finish and then do the refresh (and do that 4 times), you'll get the expected message. Again, use the logs for real verification of whether the throttling is really working for real users, and to what extent. (Separately, after the update above on 10/16/12 to only limit spiders/bots/requests without a cookie, that's another reason you'll never be throttled by this in a regular browser.)
  • Finally, someone may note that technically I ought to be doing a CFLOCK since I am updating a shared scope (application) variable. The situation in which this code is running is certainly susceptible to a "race condition" (two or more threads running at once, updating the same variable). But in this case, it's not the end of the world if two requests modify the data at once. And I'd rather not have code like this doing any CFLOCKing since it's prospectively running on all requests.

Some other thoughts

Beyond those caveats, there are a few more points about this idea that you may want to consider:

  • Of course, an inevitable question/concern some may have is, "but if you slow down a bots, might that that not affect what they think about your site? Might they stop crawling entirely?" I suppose that's a consideration that each will have to make for themselves. I implemented this several months ago and haven't noticed any change either in my page ranks, my own search results, etc. That's all just anecdotal, of course. And again, things can change. I'll say that of course you use this at your own risk. I just offer it for those who may want to consider it, and want to save a little time trying to code up a solution. Again, I welcome feedback if it could be improved.
  • Some may recommend (and others may want to consider) instead that this sort of throttling could/should be done at the servlet filter level, rather than in CFML (filters are something I've written about before .) Yep, since CF runs atop a servlet engine (JRun by default), you could indeed do that, which could apply then to all applications on your entire CF server (rather than implemented per application like above.) And there are indeed throttling servlet filters, such as this one. Again, I offer this UDF for those who aren't interested in trying to implement such a filter. If you do, and want to share your experience here, please do.
  • BlueDragon fans will want to point out that they don't need to code a solution at all (or use this), because it's had a CFTHROTTLE tag for several years. Indeed it has. I do wish Adobe would implement it in CF (I'm not aware of it existing in Railo). Until then, perhaps this will help others has it has me. (The BD CFThrottle tag also implements a solution for the problem of possible visits by folks behind a proxy, with a TOKEN attribute allowing you to key on yet another field in the request headers.)
  • There is another nasty effect of spiders, bots, and other automated requests, and that's the risk of an explosion of sessions which could eat away at your java heap space. People often accuse CF of a memory leak, which it's really just this issue. I've written on it before (see the related entries at the bottom here, above the comments). This suggestion about throttling requests may help a little with that, but it really is a bigger problem with other solutions, that I allude to in the other entries.
  • It would probably be wise to add some sort of additional code to purge entries from this application-scoped array, let they grow in size forever over the life of a CF server. It's only really necessary to worry about entries that are less than a minute old, since any older than that would not trigger the throttle mechanism (since it's based on x requests in y seconds). It may not be wise to do this check on every request, but it may be wise to add some another function that could be called, perhaps as a scheduled task, to purge any "old" entries.
  • Finally, yes, I realize I could and should post this UDF to the wonderful CFlib repository, and I surely will. I wouldn't mind getting some feedback if anyone sees any issues with it. I'm sure there's some improvement that could be made. I just wanted to get it out, as is, given that it works for me and may help others.

Besides feedback/corrections/suggestions, please do also let me know here if it's helpful for you.

Comments
A potential gotcha would be any support scripts like cfcombine or cf-powered asynchronous Ajax calls that are performed during page load. You could add a bypass rule based on either script name or URL/POST variable so that these requests are negatively affected.
# Posted By James Moberg | 5/21/10 2:06 PM
I meant to state "aren't negatively affected."
# Posted By James Moberg | 5/21/10 2:07 PM
Thanks, James, and sure, that's why I alluded to how the BD CFThrottle tag added just such a token option to let one pick an additional item to key on, and that may be useful here too, but I do leave that as an exercise for the reader. If anyone's interested in proposing a tweak, feel free.

As for "cfcombine", I have to say you've stumped me there. I've never heard of it. I realize it's not a tag, but you must have something else in mind, perhaps some project or tool. I googled but could find nothing obvious. Can you share a little more?
# Posted By Charlie Arehart | 5/21/10 2:52 PM
Sorry... it was "Combine.cfc". (I didn't check before I posted.)
http://combine.riafo...
This project "combines multiple javascript or CSS files into a single, compressed, HTTP request."

You'd want to be careful to exclude scripts like this as maximum limits could be met within a single page load by visitors if not properly configured.

On another note: To deal with abusive requests, I've written a SpiderCheck custom tag to identify both good and bad/abusive spiders. Identified abusive spiders receive an automatic "403 Forbidden" status and a message. I've also written a "BlackListIP" script that blocks POST requests by known exploited IPs and works with Project HoneyPot. I haven't published any of my internal projects/scripts before because I hadn't had much time. I primarily communicate on third-party blogs on topics of interest and don't attend many conferences. (I hope this doesn't make me a troll.) I wouldn't mind sharing my scripts if someone is interested in reviewing them and distributing them. (I personally don't have time to provide any customer support.)

Thanks for all you do.
# Posted By James Moberg | 5/21/10 3:11 PM
Thanks for those, James.

As for your scripts, I'm sure many would appreciate seeing them. And as I wrote, I too just would hand out my script on request so just finally decided to offer it here.

But you don't need a blog to post it, of course. You could post them onto RIAForge, and then you don't have to "support it" yourself. A community of enthusiastic users may well rally around the tool--and even if not, no one "expects" full support of things posted on riaforge.

I'd think it worth putting it out there just to "run it up the flag pole and see who salutes". I'm sure Ray or others would help you if you're unsure how to go about getting them posted. (I don't know that you should expect too many volunteers from readers here, though. My blog is not among the more widely read/shared/promoted, but maybe someone will see this discussion and offer to help.)

I myself have been meaning to get around to posting something to riaforge also, just to see what the experience is like. I'm sure someone will pipe in to say "it's incredibly easy". I don't doubt that. Just haven't that the right project at the right time (when I could explore it) to see for myself.

But back to your tools, there are certainly a lot of ways to solve the spider dilemma. I've been reluctant to do a check of either those or blacklisted IPs just because of the overhead of checking them on each page request. I imagine those are big lists! :-) But certainly if someone is suffering because of them (or fears that), then it may be worth doing. Again, I'm sure many would appreciate seeing your tools. Hope you'll share them. :-)
# Posted By Charlie Arehart | 5/21/10 3:21 PM
Interesting stuff, Charlie. May well give it a go as it's always a problem and often adds hours to a developers day as they try and find a bug that is crashing the site... but which isn't there!
--
Mike
# Posted By Michael Horne | 5/22/10 5:31 AM
Charlie,

Thank you very much for this timely post. @cfchris and I were just discussion this problem last week and this is exactly what he was recommending. I'm curious as to the other steps that you recommend when dealing with this?

Here are a few things we've tried and our experience:
We've adopted a suggesting of the community and sniffed the user agent and forcing the session to timeout for known bots. Specifically what we were seeing is that each bot would throw away the cookie, generating a new session for each request. At standard session timeout rates 15+ minutes, this quickly added up and overtook all of the servers memory. The obvious challenge with that is keeping up with the known bots out there. For instance, a Russian search index http://www.yandex.co... indexes many of our client pages and completely ignores the Robots.txt recommendation.

In general, with every problem we've experienced with search engines taking down our servers it had to do with abandoned sessions slowly eating up memory and ultimately crashing the server.

Robots.txt crawl rate - One thing we employ although not many respect is the Robot 'rate request'. If you set this to the same as the suggested crawl rate in this code say, once every five seconds or Request-rate: 1/5 in Robots.txt, that might help them sync up.

Challenges / Suggestions:
The one challenge I see with this code is that many of our clients images are dynamically re-sized using a ColdFusion call that checks to see if the thumbnail exists, generates it if necessary and then redirects to the file. This in theory would cause pages with heavy dynamic images to trigger this code. However, we may need to revisit that solution in general as the very fact that it does take up a ColdFusion thread for the processing is ultimately what causes the load from our BOT servers to crash the server.

Ultimately you addressed this already, potential impacts on SEO. I thought about this for a while, but the experience right now is that rouge bots that don't play nice also get the same treatment as first class engines such as google, yahoo, msn, and ultimately the same as real users. One other addition might be to add a 'safe user agent list'. For instance, if you notice it's a 'trusted user agent', you simply exclude them from this check. Obvious problems are that user agents are sometimes faked by crawlers to be Firefox or IE, but playing around with that might also prevent this from taking down real users and priority bots, while still keeping most of the ones that don't play well away.

I'm going to play with this on our staging enviroment for a while but plan on launching this into a few sites later this week. I'll let you know my findings.

Thanks again for your help on this topic,
Roy
# Posted By Roy Martin | 5/24/10 5:38 PM
Thanks, Roy. I really should have addressed that point as I'm definitely not unaware. In fact, I've written on it before, and so I was making a bit of a presumption that people would connect that dot, but that's of course a problem if someone hears of this entry or comes to my blog later.

So I have just added a new bullet to my caveats/notes above, and I point people to the two older entries which are list as "related entries", shown just above the comments here.

It's without question a huge problem, and as I note in the caveat, this limiter wasn't really focused on solving that problem, though it will help. Your ideas are among many that have been considered and that I discuss some in the other entries. Thanks for bringing it up, though.
# Posted By Charlie Arehart | 5/24/10 5:57 PM
Why not utilize something of this sort at the session level. Instead of filtering by IP address at the application level, why not specify a session scoped timer and utilize that instead.

Wouldn't bots/crawlers or individual internet browsers be assigned their own session? That way you wouldn't have to worry about affecting multiple users in an IP address and wouldn't have to check for CFToken.
# Posted By Braden Lake | 4/5/11 11:44 AM
Nope, Braden that wouldn't work. In my experience, most bots/crawlers/spiders and other automated tools (which generate the kind of load of concern) do NOT honor the cookie sent from CF to track a session.

So every visit they make creates a new session (CF creates a new session for each page visit they make)--and therefore each new page visit would not be able to access the session scope created in the previous visit.

What you're thinking could make sense for "typical" browsers perhaps, but not for this problem of automated requests. Why not try it out for yourself, though, if you're really interested in this topic. It can be a compelling learning experience.

I've written about it previously a few years ago here: http://tinyurl.com/s...
# Posted By Charlie Arehart | 4/5/11 10:12 PM
Thanks for this function! It ended up being a very good starting point for me.

One thing I wanted to point out is that this function is a bit more aggressive in blocking bots than might at first be apparent. For instance, I set it up to allow up to 6 requests every 3 seconds. I found immediately that Googlebot was being blocked even though it was making only one request per second. The reason is that the "last_attempt" value is updated at every request so that the number of attempts keeps going up. One way around this is to only update the "last_attempt" value when the number of requests is reset.

Incidentally, I ended up rewriting the function a bit and changing it to use ColdFusion's new built-in cache functionality to automatically prune old IP addresses. Thought I'd throw it out there:
https://www.modernsi...
# Posted By David Hammond | 4/2/12 3:24 PM
Hey David, thanks for that, and I'm sorry that I missed this when you posted it a few weeks ago.

Good stuff. Thanks for sharing. I'll do some testing of your recommendation about last_attempt and will tweak the code above after that.

Finally, the use of CF9 caching is nice, though I did want to write something that worked more generically, across CF versions and CFML engines.

Still, folks should definitely check out your variation. Two heads are better than one! :-)

PS I suppose we may want to consider posting this somewhere that can be better shared, tweaked, etc. but in this case I think yours would have been a fork, being so different, so there would still end up being two out there, so I don't know if it's really worth our bothering. :-)
# Posted By charlie arehart | 5/7/12 11:19 AM
I'll add as an update here that there seems to be some debate in the IT community about whether such a rate limiting scheme should return a 503 or 403, or some subpoint of that (like 403.8), or even 429. There's no seeming unanimous answer.

One thing I've learned is that the 500 errors are meant more for "server" errors, while 400-level ones are meant more for "client" errors, like the client made a mistake (as in this case, where the client is making too many requests.)

It has been encouraging to see other tools referring to this same approach I've outlined above.

For instance, it seems there is an IIS settings for this, in IIS 7's "Dynamic IP Restrictions" module. More at http://learn.iis.net..., in the section "Blocking of IP addresses based on number of requests over time" (where they show returning a 403.8).

And also, here is a Ruby implementation, http://datagraph.rub..., which also discusses the debate between returning a 403 or 503 code.

I guess we'll see as the industry moves toward more of a consensus on the best status code, but at least it validates the approach I put forth. :-)
# Posted By charlie arehart | 5/14/12 10:41 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.005.

Managed Hosting Services provided by
Managed Dedicated Hosting