[Looking for Charlie's main web site?]

Some code to throttle rapid requests to your CF server from one IP address

Note: This blog post is from 2010. Some content may be outdated--though not necessarily. Same with links and subsequent comments from myself or others. Corrections are welcome, in the comments. And I may revise the content as necessary.
Some time ago I implemented some code on my own site to throttle when any single IP address (bot, spider, hacker, user) made too many requests at once. I've mentioned it occasionally and people have often asked me to share it, which I've happily done by email. Today with another request I decided to post it and of course seek any feedback.

It's a first cut. While there are couple of concerns that will come to mind for some readers, and I try to address those at the end, it does work for me and has helped improve my server's stability and reliability, and it's been used by many others.

Update in 2020: I have changed the 503 status code below to 429, as that has become the norm for such throttles. I had acknowledged it as an option originally. I just want to change it now, in case someone just grabs the code and doesn't read it all or the comments. Speaking of comments, do see the discussion below with thoughts from others, especially from James Moberg who created his own variant addressing some concerns, as offered on github, and the conversation that followed about that, including yet another later variant.

Update in 2021: Rather than use my code, perhaps you would rather have this throttling done by your web server or another proxy. It is now a feature offered in IIS, Apache, and others. I discuss those in a new section below.

Background: do you need to care about throttling? Perhaps more than you realize

[....Continue Reading....]

Comments
A potential gotcha would be any support scripts like cfcombine or cf-powered asynchronous Ajax calls that are performed during page load. You could add a bypass rule based on either script name or URL/POST variable so that these requests are negatively affected.
# Posted By James Moberg | 5/21/10 2:06 PM
I meant to state "aren't negatively affected."
# Posted By James Moberg | 5/21/10 2:07 PM
Thanks, James, and sure, that's why I alluded to how the BD CFThrottle tag added just such a token option to let one pick an additional item to key on, and that may be useful here too, but I do leave that as an exercise for the reader. If anyone's interested in proposing a tweak, feel free.

As for "cfcombine", I have to say you've stumped me there. I've never heard of it. I realize it's not a tag, but you must have something else in mind, perhaps some project or tool. I googled but could find nothing obvious. Can you share a little more?
Sorry... it was "Combine.cfc". (I didn't check before I posted.)
http://combine.riafo...
This project "combines multiple javascript or CSS files into a single, compressed, HTTP request."

You'd want to be careful to exclude scripts like this as maximum limits could be met within a single page load by visitors if not properly configured.

On another note: To deal with abusive requests, I've written a SpiderCheck custom tag to identify both good and bad/abusive spiders. Identified abusive spiders receive an automatic "403 Forbidden" status and a message. I've also written a "BlackListIP" script that blocks POST requests by known exploited IPs and works with Project HoneyPot. I haven't published any of my internal projects/scripts before because I hadn't had much time. I primarily communicate on third-party blogs on topics of interest and don't attend many conferences. (I hope this doesn't make me a troll.) I wouldn't mind sharing my scripts if someone is interested in reviewing them and distributing them. (I personally don't have time to provide any customer support.)

Thanks for all you do.
# Posted By James Moberg | 5/21/10 3:11 PM
Thanks for those, James.

As for your scripts, I'm sure many would appreciate seeing them. And as I wrote, I too just would hand out my script on request so just finally decided to offer it here.

But you don't need a blog to post it, of course. You could post them onto RIAForge, and then you don't have to "support it" yourself. A community of enthusiastic users may well rally around the tool--and even if not, no one "expects" full support of things posted on riaforge.

I'd think it worth putting it out there just to "run it up the flag pole and see who salutes". I'm sure Ray or others would help you if you're unsure how to go about getting them posted. (I don't know that you should expect too many volunteers from readers here, though. My blog is not among the more widely read/shared/promoted, but maybe someone will see this discussion and offer to help.)

I myself have been meaning to get around to posting something to riaforge also, just to see what the experience is like. I'm sure someone will pipe in to say "it's incredibly easy". I don't doubt that. Just haven't that the right project at the right time (when I could explore it) to see for myself.

But back to your tools, there are certainly a lot of ways to solve the spider dilemma. I've been reluctant to do a check of either those or blacklisted IPs just because of the overhead of checking them on each page request. I imagine those are big lists! :-) But certainly if someone is suffering because of them (or fears that), then it may be worth doing. Again, I'm sure many would appreciate seeing your tools. Hope you'll share them. :-)
Interesting stuff, Charlie. May well give it a go as it's always a problem and often adds hours to a developers day as they try and find a bug that is crashing the site... but which isn't there!
--
Mike
# Posted By Michael Horne | 5/22/10 5:31 AM
Charlie,

Thank you very much for this timely post. @cfchris and I were just discussion this problem last week and this is exactly what he was recommending. I'm curious as to the other steps that you recommend when dealing with this?

Here are a few things we've tried and our experience:
We've adopted a suggesting of the community and sniffed the user agent and forcing the session to timeout for known bots. Specifically what we were seeing is that each bot would throw away the cookie, generating a new session for each request. At standard session timeout rates 15+ minutes, this quickly added up and overtook all of the servers memory. The obvious challenge with that is keeping up with the known bots out there. For instance, a Russian search index http://www.yandex.co... indexes many of our client pages and completely ignores the Robots.txt recommendation.

In general, with every problem we've experienced with search engines taking down our servers it had to do with abandoned sessions slowly eating up memory and ultimately crashing the server.

Robots.txt crawl rate - One thing we employ although not many respect is the Robot 'rate request'. If you set this to the same as the suggested crawl rate in this code say, once every five seconds or Request-rate: 1/5 in Robots.txt, that might help them sync up.

Challenges / Suggestions:
The one challenge I see with this code is that many of our clients images are dynamically re-sized using a ColdFusion call that checks to see if the thumbnail exists, generates it if necessary and then redirects to the file. This in theory would cause pages with heavy dynamic images to trigger this code. However, we may need to revisit that solution in general as the very fact that it does take up a ColdFusion thread for the processing is ultimately what causes the load from our BOT servers to crash the server.

Ultimately you addressed this already, potential impacts on SEO. I thought about this for a while, but the experience right now is that rouge bots that don't play nice also get the same treatment as first class engines such as google, yahoo, msn, and ultimately the same as real users. One other addition might be to add a 'safe user agent list'. For instance, if you notice it's a 'trusted user agent', you simply exclude them from this check. Obvious problems are that user agents are sometimes faked by crawlers to be Firefox or IE, but playing around with that might also prevent this from taking down real users and priority bots, while still keeping most of the ones that don't play well away.

I'm going to play with this on our staging enviroment for a while but plan on launching this into a few sites later this week. I'll let you know my findings.

Thanks again for your help on this topic,
Roy
Thanks, Roy. I really should have addressed that point as I'm definitely not unaware. In fact, I've written on it before, and so I was making a bit of a presumption that people would connect that dot, but that's of course a problem if someone hears of this entry or comes to my blog later.

So I have just added a new bullet to my caveats/notes above, and I point people to the two older entries which are list as "related entries", shown just above the comments here.

It's without question a huge problem, and as I note in the caveat, this limiter wasn't really focused on solving that problem, though it will help. Your ideas are among many that have been considered and that I discuss some in the other entries. Thanks for bringing it up, though.
Why not utilize something of this sort at the session level. Instead of filtering by IP address at the application level, why not specify a session scoped timer and utilize that instead.

Wouldn't bots/crawlers or individual internet browsers be assigned their own session? That way you wouldn't have to worry about affecting multiple users in an IP address and wouldn't have to check for CFToken.
# Posted By Braden Lake | 4/5/11 11:44 AM
Nope, Braden that wouldn't work. In my experience, most bots/crawlers/spiders and other automated tools (which generate the kind of load of concern) do NOT honor the cookie sent from CF to track a session.

So every visit they make creates a new session (CF creates a new session for each page visit they make)--and therefore each new page visit would not be able to access the session scope created in the previous visit.

What you're thinking could make sense for "typical" browsers perhaps, but not for this problem of automated requests. Why not try it out for yourself, though, if you're really interested in this topic. It can be a compelling learning experience.

I've written about it previously a few years ago here: http://tinyurl.com/s...
Thanks for this function! It ended up being a very good starting point for me.

One thing I wanted to point out is that this function is a bit more aggressive in blocking bots than might at first be apparent. For instance, I set it up to allow up to 6 requests every 3 seconds. I found immediately that Googlebot was being blocked even though it was making only one request per second. The reason is that the "last_attempt" value is updated at every request so that the number of attempts keeps going up. One way around this is to only update the "last_attempt" value when the number of requests is reset.

Incidentally, I ended up rewriting the function a bit and changing it to use ColdFusion's new built-in cache functionality to automatically prune old IP addresses. Thought I'd throw it out there:
https://www.modernsi...
# Posted By David Hammond | 4/2/12 3:24 PM
Hey David, thanks for that, and I'm sorry that I missed this when you posted it a few weeks ago.

Good stuff. Thanks for sharing. I'll do some testing of your recommendation about last_attempt and will tweak the code above after that.

Finally, the use of CF9 caching is nice, though I did want to write something that worked more generically, across CF versions and CFML engines.

Still, folks should definitely check out your variation. Two heads are better than one! :-)

PS I suppose we may want to consider posting this somewhere that can be better shared, tweaked, etc. but in this case I think yours would have been a fork, being so different, so there would still end up being two out there, so I don't know if it's really worth our bothering. :-)
I'll add as an update here that there seems to be some debate in the IT community about whether such a rate limiting scheme should return a 503 or 403, or some subpoint of that (like 403.8), or even 429. There's no seeming unanimous answer.

One thing I've learned is that the 500 errors are meant more for "server" errors, while 400-level ones are meant more for "client" errors, like the client made a mistake (as in this case, where the client is making too many requests.)

It has been encouraging to see other tools referring to this same approach I've outlined above.

For instance, there is now an IIS settings for this, in IIS 7's "Dynamic IP Restrictions" module. More at http://learn.iis.net..., in the section "Blocking of IP addresses based on number of requests over time" (where they show returning a 403.8).

[As an update in 2021, regarding that last mention of the IIS feature, in IIS 8 and above it's now built-into IIS. See my discussion of this above, as a new section in the blog post, "Rather than use my code, you may want to have your web server do the throttling". I wanted to update this comment here, rather than offer a new one that would appear well below it, to help folks using IIS 8 and above to not get that older module. And though long links within comments can look look ugly, here is a link to that new section above. https://www.carehart...]

Finally, here is a Ruby implementation, http://datagraph.rub..., which also discusses the debate between returning a 403 or 503 code.

So clearly it's not a new idea, and I guess we'll see as the industry moves toward more of a consensus on the best status code as well as other aspects of such blocking, but at least it validates the approach I put forth. :-)
Adding to this 2010 blog entry on throttling rapid requests by frequently-visiting clients, I'll note that there is an IETF standard which discusses the exact same concept, including the use of the 503 status code and retry-after header that I used in that code:

http://tools.ietf.or...

And I was made aware of it in reviewing the API for Harvest (my preferred timesheet/invoicing system), who use the same approach for throttling rapid requests. Just nice to see I wasn't off in left field, and that even today this seems the same preferred approach for throttling. (If anyone has another, I'm open to ideas.)
Those considering my rate limiting code here may be interested to hear (as I discovered just tonight) that the code has been embedded in ContentBox (since early-mid 2015), https://github.com/O..., and for those using ContentBox its use is controlled in the CB Admin and its System>Settings menu and its Security Options section.

And it seems that it came to be because Brad Wood had found and used the code in his own PiBox blog, which he discussed here: http://pi.bradwood.c...

Glad to see the code helping still more people beyond just as shared here.
Charlie, I made some minor tweaks to your limiter.cfc to use CacheGet/Put (for automatic collection flushing) , return HTTP Status Code 429 "Too Many Requests" and enable a "none|all|list" cookie filter.
https://gist.github....
# Posted By James Moberg | 2/10/16 9:37 AM
James, thanks for that, both putting it into github and making the tweaks you did. A couple of thoughts:

First, I should note that David Hammond had done much the same thing, changing it to use ehcache, as discussed at https://www.modernsi... (and mentioned it in a comment above, both in 2012).

And he hadn't put it into github. Of course back then and indeed in 2010 when I first posted this entry, it wasn't as much "the thing", but surely putting it there is helpful, not only to make it easier for folks to tweak and discuss, but even just to get the code (versus using copy/paste from blog posts like this). I'd made a comment to that point in reply to his, but neither of us got around to doing it, it seems.

Second, besides comments from others and myself here, there are surely many other ideas for improvements that I foresaw (and mentioned in the blog post). Folks who may be motivated to "take this ball and run with it" and try to implement still more tweaks should review both my post and the comments (since those don't appear in the github repo).

For instance, I see you changed it to use status code 429, versus the 503 I used initially. You may have noticed I discussed that very point in a comment (also in 2012): http://www.carehart...., where I showed different tools like this (far more popular ones, like those embedded in web servers and app servers) that used a variety of values: 503, 403, 429. I then added the next comment after it, in 2014, pointing to an IETF standard suggesting 503 and a retry-after header.

But you say in your gist comments that you chose 429 based on "best practices". I'd love to hear if you found something more definitive on this, since it was up in the air in years past. :-)

As for ehcache, it just wasn't an option when I first wrote the code in 2009 (CF9 came out later that year), and I didn't think of it when I posted the entry in 2010. Even so, at that time lots of people would have still been on CF8 or earlier at the time, so I was leaving it generic enough to work not only on multiple version of CF but also other CFML engines of the time (Railo, BlueDragon, and since then Lucee).

BTW, James, speaking of older comments here, did you ever get around to posting the related scripts you'd mentioned in a comment in 2010 (http://www.carehart....)? If not, I'm sure those could be useful for folks also, I'm sure.

I do see that you have over 100 gists posted there on github, but it only lets us see 10 at a time, and doesn't seem to offer searching just in a particular person's gists, so I couldn't readily tell on my own.

Keep up the great work, and I hope to find time to get involved in using and perhaps contributing to the version you've posted.
We've been stripping ".cfm" from our URLs using IIS Rewrite Module and using CFLocation w/301 to redirect users (and bots) to new URLs. To avoid being negatively impacted by the rate limiter, I wrote a CF_NewLocation tag that mimics the CFLocation tag and rolls back the "attempts" count by "1". This will prevent automatic/immediate redirects from triggering the rate limiter script and blocking access when users come to the website using outdated links.
https://gist.github....
# Posted By James Moberg | 5/18/17 12:20 PM
Cool stuff, James. Thanks for the heads-up and your work to improve and to share things.
Copyright ©2024 Charlie Arehart
Carehart Logo
BlogCFC was created by Raymond Camden. This blog is running version 5.005.
(Want to validate the html in this page?)

Managed Hosting Services provided by
Managed Dedicated Hosting