CF911: Lies, Damned Lies, and CF Request Timeouts...What You May Not Realize
Note: This blog post is from 2010. Some content, links and indeed comments from others may be outdated--though not necessarily. Corrections are welcome, in the comments. I may revise the content if necessary.How often have you seen (or seen others complain of getting) a CF page running longer than it's "supposed to" by a timeout you have set. Maybe you've set the CF Admin "request timeout" (first setting on first page of the Admin), or used the cfsetting requesttimeout tag or the timeout attribute on some specific tag if it's available, trying to get the request to "end" in 60 seconds, and yet you see a request running for 3 minutes, 3 hours, or 3 days! How can that happen?
Or same with if you've set the request to timeout using an alerting feature in a CF monitor like CF Enterprise server monitor, FusionReactor, or SeeFusion.
And perhaps you've seen this error from ColdFusion, in your logs or on-screen:
Do you know what this means? It's usually not what you think, and it may appear as I said 3 hours after a request was "supposed to timeout" in 60 seconds. I've even seen experienced CF developers who get thrown by this challenge. It's not new (and for those reading this even in the CF2016 era, it still happens). And it's not so much a "bug" (in either CF or the monitor tools) but just a situation that you need to understand, and there can be some ways to resolve things.
In this entry I'll try to help explain this surprisingly common problem and I hope to correct some equally common misconceptions. I'll even contend that the info in this error message is often useless and indeed misleading (and therefore the feature producing it ought not be relied upon completely, and should perhaps even be turned off for many). More important, again, there may be a way to "really" kill such a long-running request. Along the way, I'll share some things that I've not seen documented elsewhere.
I also share a solution that may work for query processing but it's NOT about a tag attribute but rather a CF Admin setting in the datasource "advanced settings" to set a "query timeout". This was added in CF 9, but many never noticed. If that's your problem and you want to skip to more on that here, feel free. but you may want to come back and read the rest as it is STILL not a perfect solution.
Strap on your seatbelts. We're going for a bit of a ride (if this situation was easy to understood in the length of a tweet, then perhaps everyone would already understand it and not find it challenging!) As always, I welcome feedback.
A) What the error does and does not mean, typically
So people are often mystified by an error like the above, wondering, "Why in the heck would a CFOUTPUT take a long time?"
Or perhaps they're a little more savvy as to what's happening, and they assume, "No, it's just that the CF timeout time was reached when it got to the CFOUTPUT". That could be.
Sadly though, in most cases, neither is what has happened. CF is usually NOT reporting that "here where the app timed out".
"OK, smarty-pants. What does the error really mean? Are you saying that CF is lying to me?" Well, often, yes, I'm afraid so, but it's not something nefarious.
Rather, what's more typically the explanation is that some previous activity in the page/request, such as a CFQUERY, CFHTTP, invocation of a web service, or the like is what really took a "long time".
If it was this which caused the request to exceed the timeout (either as defined in the CF Admin Settings page, or using CFSETTING RequestTimeout, or a Timeout attribute on a tag), you'd of course expect CF to report it then and there. The problem is that, often, it cannot report it "right then". And it's not its fault.
B) There are simply some operations which CF/the JVM cannot interrupt
So the crux of the problem is that CF (and the JVM) cannot interrupt a request while it's processing what's called a "native method". That is quite typically the mode that a request is in while it's waiting for a reply from a CFQUERY, CFHTTP, and so on. These operations talk to something outside of CF (like a database with CFQUERY, or another server with CFHTTP or web service call--which could even be requesting a page from the same CF instance, but technically the underlying Java httpclient process doesn't know that.) It could also happen with file or network operations.
So the request will wait for this long-running operation to finish. It can't stop it, not with the CF Admin request timeout, not with CFSETTING RequestTimeout, not with the kill features in the CF Server Monitor, FusionReactor, and SeeFusion. Nothing. It's like the Anti-Terminator: "it absolutely will not stop" (can't be terminated) until its task is completed.
C) So what happens when the long-running operation finishes? Is that when the request times out? An example, and more
"Ok, I got it. The long-running operation (CFQUERY, CFHTTP, whatever) will not stop until it's finished. What happens then?"
Well, you see, that's where the confusion comes in. Let's use an example to make things crystal clear.
Say that the CF admin timeout is 60 seconds (not at all uncommon), or perhaps you set the timeout to 60 for a given template using CFSETTING. Anyway, let's say that the request in question gets 2 seconds into processing when it starts running a long-running query (for example). Let's say that query then takes 75 seconds. When the query is done, the request has now run for 77 seconds, which is 17 seconds beyond the timeout time.
We already know that it won't stop the CFQUERY itself (at least until the query is finished). But guess what: it also will NOT report that the timeout has been exceeded on that line (whatever it was, CFQUERY, CFHTTP, etc.) From my experience, CF doesn't check the time against the timeout at the end of operations, but rather at the beginning.
So instead, it will proceed to the next line of code. You'd think, "ok, then, it will stop on whatever is the next line of code and give you the error there, right?" Sadly, not necessarily, and it only adds to the confusion of the timeout message.
CF checks the time at the start of the next operation, but sadly only on SOME tags
So it's bad enough that it won't report the error on the tag that DID run long. Instead, we saw that it will proceed to the next tag/function. But curiously (tragically), CF will often NOT stop on THE next line of code.
Instead, I've observed that it only seems to check the time (against the timeout) at the beginning of CERTAIN tags, such as CFOUTPUT, CFLOOP, CFQUERY, and so on. Yes, I'm saying that I've confirmed that it will skip over various other tags (such as CFSET, or CFSCRIPT code, and more). I've not yet found any documentation as to the details of this.
So this is where the error gets confusing
So the bottom line is that not only does the the request NOT stop on the tag/function that took a long time, it doesn't even stop on "the next line" after that, which can make things all the more confusing/challenging to resolve.
Indeed, this is why you often see the error reporting as having occurred on a tag other than what was really the problem, and why you also can't just look at whatever was *the* line of code preceding that.
So what can you do with this information?
I don't mean to paint an entirely bleak picture. All is not lost. It's just a little more challenging than it should be.
At least first of all you can now know that when you see this error, you should NOT assume that it's reporting the line that really caused the problem. You can and should consider whether some earlier operation in the code could have taken a long time. In my experience, this is usually the situation.
I'll talk in a moment about some other tools that can help you understand where the time is really being taken. First, I do want to offer a clarification, lest anyone read my meaning too literally.
Are you saying the error message is always lying?
Well, no. You'll notice that I peppered my opening paragraphs with "usually", because it's certainly possible that a request could indeed be stopped on the very line that DID exceed the timeout.
Consider in our example that if the long-running query had run for only 57 seconds. Now, since it had taken 2 seconds before that, it now is one second short of timing out. Let's say the request then proceeds to loop over a query resultset or do some other operations that might take it a couple more seconds. When it does finally exceed the timeout, it may well happen right on the very tag that CF Reports as having "crossed" the timeout time.
But given the problem of how it only reports that on some tags (and not all), it could still be in this situation that it reports the wrong line of code. Just consider all the above as you evaluate what to make of the situation.
D) So how can I know what tag did take a long time? either while the request is running or after the fact
So we know we have a long-running request. We understand now that its some tag or function or statement or other feature that is somehow itself hung up (typically).
So how can you know what specifically CF is waiting for? or perhaps why it DID take a long time, if it finished in the past? This is a bit more challenging. The good news is that there are tools that can help, including the CF Enterprise Server Monitor, FusionReactor, and SeeFusion.
Let's focus first on using these tools to catch requests while they're still running, which could be valuable if your server is hanging up because of some long-running requests. Then we'll talk about using the tools to capture the same information and make it available by email to review later.
FusionReactor and SeeFusion might easily show what a request is waiting for
Both FusionReactor and SeeFusion let you drill into a given request while it's running to see some details about it.
(Update: In the case of FusionReactor, it automatically monitors each CF datasource as of FR 5. SeeFusion requires configuration of each datasource to be monitored, even as of SF 5.1)
And with each, you can show you clearly if a given request is running a query. Since that's about the most common cause of hung requests, just having that information alone is valuable. And both tools also keep a history of recent slow requests, so you can also use these to see why some recent slow requests were slow.
[Update: later releases of FusionReactor also track separately the time spent waiting for a CFHTTP as well as some other things. See the "relations" tab within a request's details, whether while it's running or afterward.)
While the CF Server Monitor can also show if a request is running a query, it only does that if you turn on "start profiling". I will talk later here why that's generally not a good idea for busy sites, and it's indeed off by default. The FR and SF means of doing such query monitoring is far less impactful by default.
Shine a spotlight on what the request is doing AT A GIVEN MOMENT, via stack tracing
But sometimes these tools will help you see that the reason a request is running slow is NOT because of a query (or cfhttp either in the case of FR).
In that case, and in all three tools, it IS possible to identify exactly what a request is doing at a point in time, by "stack tracing" that request.
This is actually a feature built-into the JVM, which is exposed easily by these tools, but missed entirely by many. More than that, some misunderstand stack tracing as something only shown at the bottom of error pages. (That is indeed a stack trace, but it's not nearly as useful as what I'm referring to here, which is for getting information on request while they're running, not when they have had an error.)
Stack tracing a running request will allow you to see exactly what line of CFML (if any) is running at that very moment, which again can be vital for resolving problems of long-running requests.
Stack tracing in CF Enterprise Server Monitor
First, if you run CF Enterprise (8 or 9), you can use the CF Server Monitor to watch requests while running and see more details about them. If you use the available "start monitoring" button, you can see what requests are running in "Active Requests". Further, if you enable "start profiling", then if you double-click a running request, you can see in the middle of the next page a "stack trace", which shows the exact line of code that was executing at the time you double-clicked the request.
(Again, yes, I'm aware of the potential overhead of using various features the Server Monitor, though some people do over-state it in my experience. I'll point to other resources I've done on the Monitor, where I discuss its pros and cons, in a moment.)
Of course, viewing that stack trace at a random point in time during the life of a request could well mean simply that you'd see it executing just any random line, where perhaps a millisecond later CF will have moved on to another. The key is to refresh the stack trace, to see if CF indeed HAS moved on to a new line. If not, that line would be a smoking gun to investigate. Sadly the refresh icon in the Monitor doesn't update the details while viewing a running request. You need to go back to the list of active requests, open the request again, and repeat your observation.
Stack tracing with FusionReactor and SeeFusion
Fortunately, tools like FusionReactor and SeeFusion make that refresh a lot easier, to obtain a stack trace while request is running. Each offers a button to take a stack trace of a running request. From the page they show you can usually determine the line of CFML code that's running (they each offer a little more stack trace detail than the CF Server Monitor does, but I'll point you soon to a resource to help you better understand them.)
More important, each of these tools offer a refresh button to refresh the stack trace, so that you can properly determine if the line that's executing has changed while you're refreshing.
That said, I will note that FusionReactor offers an important advantage with respect to that refresh operation: it ties the stack trace display to the specific CFML page that was being viewed (in Running Requests page) when you selected it. So if that request ends while you're looking at its stack trace, and you refresh it, FusionReactor will report that it's finished.
SeeFusion, on the other hand, would not. It knows only the thread id on which the requested page was running, so that if the request ends and you refresh the stack trace, it only knows to refresh the stack trace for whatever request is running on that thread. It can't (and won't) tell you if the given request has in fact ended, so you could now be looking at a new (and different) request, which could be quite confusing in this situation. It's incumbent upon you to notice (when using SeeFusion) that the stack trace you see in indeed for the same request you started with. (FR gives each request its own internal request id, which is how it avoids that problem.)
Catching running requests details when you're not watching the monitor tools
Of course, it's only possible to use the stack tracing features above if you can be on the server running the monitor tools when the problem occurs, right?
Well, not exactly: all three tools offer features to watch for a long-running request which can then send you notification by email of the details that would include a thread dump, which is a stack trace of all running requests.
In the CF Server Monitor, these are called Alerts. FusionReactor refers to them as Crash Protection notifications, and SeeFusion refers to this as "Active Monitoring Rules". See the documentation for each tool to find more information.
Learning more about stack tracking and the monitor tools
For more on all this, I discuss the idea of taking stack traces and thread dumps (which is a list of all stack traces for all current threads) in another blog entry.
I also discuss the CF Server Monitor, FusionReactor, and SeeFusion in several blog entries. The links just used are to the respective categories about each here in my blog. I've also discussed these topics (monitoring, stack tracing, and more) in various articles and presentations I've done.
The step debugger
Some may point out that you can also get an idea of the time spent on any tag/function within a request if you use the interactive Step Debugger (whether that built into CFBuilder or the commercial FusionDebug alternative [update:, or as available via the web UI in FR 6 Ultimate). As you step through the code, it would be clear if you got "hung up" on a line, though I don't know that I'd favor this as a solution here. Still, I've discussed these also in various blog entries, articles, and presentations.
(Sadly, you can't rely on the typical end-of-page debugging output, as enabled in the CF Admin, because that output is only shown if the request completes. We're referring here to pages that end in error.)
E) Can't I force CF to timeout some specific tags?
Again, in my experience (as I focus on CF server troubleshooting as a consultant), the root cause of problems in most "long-running" requests in CF is that some one tag or function is running long.
So could we perhaps force CF to timeout that specific tag? Well, yes and no.
You can, in fact, (and should) consider whether the tag in question might have its own TIMEOUT attribute or feature (and whether it will really help, as I'll explain.) Let's look at each of them.
Setting Timeout on CFHTTP, CFINVOKE, and others
There is indeed a TIMEOUT attribute on CFHTTP. Unfortunately, it won't ALWAYS keep the operation from exceeding that timeout. I've not quite put my finger on it (just haven't experimented completely), but if I had to guess, I'd say that it could be that if the operation is in the midst of returning data (from the server to CF), then it could perhaps time it out, whereas if it's waiting for the output then it may not be able to. Anyone know for sure?
There is also a TIMEOUT on CFINVOKE (for use when calling web services). Curiously, though, there is no TIMEOUT for use with CFOBJECT when calling web services (try it, it won't work, and none is documented). More curious still is that there IS a timeout available for use with createObject() (when calling a web service), though only by way of an argstruct argument that's new in CF8, which I have blogged about. Note as well that, according to the docs, that only times out the process of obtaining the WSDL, not the execution of any method in the web service.
There are also timeouts on various other operations that talk to something outside of CF (cfmail, cfftp on open/close operations, cfldap, cfpop, cffeed), though again it seems reasonable to expect that these may not always honor the timeout at the exact time given, as discussed above.
What about the elephant in the room, CFQUERY? Well, yes, it does have a TIMEOUT attribute, but many have found that it often does not timeout the query. Like the CFHTTP, I wonder if it may be a question of whether it's waiting for output (which likely can't be interrupted) or starting to receive it (which likely can be).
I will note that there's some promise in this regard, though for now not from CFML itself, but rather from the updated JDBC drivers in CF9.
And it is enabled via the addition of a new timeout option in the CF Admin Datasource Advanced Settings page. You'll see that there is a new "query timeout" option that was not in CF before 9. I have blogged about it in more detail.
It's not perfect: first, note that it's only available for the CF-provided db drivers (which are themselves licensed from DataDirect, which added the option, and the CF admin exposes it as setting.) Second, people are reporting different experiences with it (see the comments in the blog entry), and note (more important) that for now there seems no corresponding connection between this and the CFQUERY TIMEOUT attribute. (As I note there, I have raised a bug about this.) Still, it may be better than nothing and could help many, if you're on CF9 or above.
F) So is there really nothing I can do for the hung requests?
OK, so we've explained why the requests don't timeout, often because they're talking to some remote process that is not responding. But what CAN you do when you're in this boat? Well, other than trying to add timeouts to the code as discussed above, generally nothing, at least for the requests that are already running.
And certainly a restart of CF will kill them off, or at least stop CF trying to talk to the remote process. (Of course, it's possible that upon restart, new requests will come in and try to connect to the same non-responsive or slow-responding remote process, so it could come right back.)
Stop the request on the remote server
But while you can't do much from WITHIN CF for these hung requests, there's one other way you may be able to stop the madness: stop the request on the remote server.
Once you can determine exactly what tag it is that's hung up (with the stack tracing tools above), you could then target whatever it was waiting for: the database server, a remote page called via CFHTTP, an exchange server using CFLDAP, etc.
Since the tools that let you stack trace the running request also show you the time the request started, you could use that info to go to the administrator of whatever service you're calling and ask if THEY may be able to kill the request. As soon as what you're waiting for stops, the CF request will continue. (Of course, it may only continue for a few milliseconds before it will be timed out by CF, as I discussed above, which is why I'm no fan of the CF request timeout feature, and think it should be turned off. More on that in a moment.)
Beware: you may not always find the remote server still "hung up"
Back to this issue of finding and killing the remote process that CF may be waiting for (that's causing your hung request), I should note that there may be times when you would go to the remote administrator and say, "look, I have this long-running CF requests that's waiting for this process (query, ldap request, web page, etc.) that is waiting forever for something that is running long on your server". And they may look and see nothing on their ends that's running long. Doh!
Yep, it can happen, for various reasons, so just be sensitive to this. You may really then have no way at all to kill the hung request. But note, again, that you may be able to use this observation to do something more to prevent the problem in the future, perhaps on the remote server side.
For instance, I've heard some describe problems where CFQUERY processing has hung talking to an Oracle database and (if I've got it right) the problem is an inconsistency between the CF datasource connection timeout and Oracle's "session" timeout. If anyone has more details on that, please do share.
But my point is simply that CF may be "waiting" for a call that will never be answered and can't be terminated from the other end. Again, in such cases, you can only kill them by restarting CF, and then you need to investigate how/why the call to the remote server are getting hung up in the first place. That's where logging information for diagnostic purposes may really come in handy, as is discussed next.
Logging what CF is getting hung up, to show to the remote administrator
If this problem (of calls to remote servers that take too long or get hung up) is happening often, and/or you can't always be logged in to see when it's happening using the tools above, another idea is to log for yourself whenever you make such a call to a remote server (that you know tends to hang up), such as putting a CFLOG statement before and after the CFHTTP, CFLDAP, CFQUERY, etc.
At least then you'll be able to see when it does and doesn't take a long time. The log would also help you by showing when it logs a start but no stop.
You could also code it so that it only logs when it's slow, but being able to confirm that it's generally fast and only sometimes slow may be itself useful diagnostic info.
Note that CF 9.0.1 by default adds new logging that does automatically log the start and end of calls to cfhttp, cffeed, and more to corresponding new logs (http.log, feed.log, etc.), which could also help.
Finally, as for logging the queries, you can get that from FusionReactor and SeeFusion automatically, as their "jdbc wrapper" features allow you to log every query (or optionally only those slower than a certain time). There is also a new "log activity" feature in the "advanced settings" of a CF datasource definition that could also log DB activity, though it is quite verbose and a tad unwieldy (not one line/row per query like the other two tools).
G) Bottom line: I'm no fan of request timeout features
So all that said, I'll repeat and clarify that I'm no fan of request timeout features, not that in the CF Admin, nor that offered in CF monitoring tools that offer to "kill requests" automatically, like the CF Server Monitor Alerts, FusionReactor's crash protection, and SeeFusion's active monitoring rules. I don't think they should be used, personally.
Let me be clear: I do love those tools and use them and help people use them daily. And I do love and highly recommend the features in those tools for sending you *alerts* when requests exceed a given time. What I don't like is them trying to kill them automatically, for all the reasons I outlined above. So I tell clients to turn off the "timeout requests" feature (though it does still make sense to use TIMEOUT attributes on certain tags, or may make sense to implement the CFSETTING RequestTimeOut on some page where you know that the reason it runs long is not one of these things that can't be killed anyway.)
Instead, I recommend (and help my clients daily) to use the alert info (from the CF Server Monitor, or FR or SF) to be notified if/when requests ARE taking too long--and NOT to kill them. Note that these tools all send the notice *as soon as* the request takes too long (whatever time you set), whereas CF's "log slow requests" feature only logs when requests end--and that's only IF they do end without failing.
So yes, get notified that requests are taking too long. Use the info in the alerts, which includes the stack trace info I discuss above. Do find and resolve the problem. Don't rely on (or in my opinion even use) auto-kill features, when in fact they nearly never are able to kill really problematic requests anyway.
Yes, yes, I do realize that there are some requests that CAN be interrupted by these timeout/kill features, but I'll assert that such requests are far less commonly the cause of any serious problems. Your mileage may vary, of course. But I make my statement based on several hundred instances of helping folks solve typical CF server problems.
So why is the "timeout requests" setting there?
One last thought worth considering: someone might reasonably ask, "Charlie, why are you such a hater of the setting? If Adobe has it there, it must be for a good reason."
Here's what I'd say to that: sure, when CF originally ran on C++ (prior to CF 6), perhaps this setting could be reasonably relied upon to ensure that requests would not run any longer than the set time. (I don't recall, but perhaps even then there may have been at least SOME tags that it couldn't interrupt.) But clearly since CF 6, in the Java model, this is no longer the case.
And yet if you read the Admin page, or its help, or the docs, or the comments from nearly anyone who considers the setting, the presumption is that this WILL stop requests from running longer than the x number of seconds indicated.
Why am I so impassioned/manic about this?
I hope I've made clear in this entry why I think that's not only wrong to conclude (in nearly all cases), but worse it sets up a tragic misconception of how CF works. If you think this should and will stop long requests (or that the alert features of the monitors will kill them), then you're going to be in for a shock when requests do hang up for an extended period of time. What are the implications?
- You may totally under-estimate how many simultaneous request threads you should enable.
- You may never pay attention to tools like CFSTAT or jrun metrics (to observe at least "how many requests are running" at any given time), which will help you see if/when requests are hung.
- You may never bother to learn how to use the CF Server Monitor (or FR or SF), all of which can go still further and show not just how many requests are running (possibly hung) but a) how long, b) what the URL is, c) what the IP address is, and so much more, which can help you find and resolve problems.
- You may never bother to learn how to do the stack tracing that I discuss above, which is often vital to understanding where and why any given request is hung (or was at the time an alert was thrown)
- You may never bother to analyze logs that show the activity patterns (how many requests are running at periodic intervals, such as the FusionReactor "resource log" reports.) It's really THAT information that is vital to your understanding what to set for your simultaneous requests setting.
- and so on
All of this info (and understanding) is VITAL to a very important and common class of CF server troubleshooting: why is CF up but not responding? It may be that requests are hung.
But if you assume, "well, they can't be running any more than x seconds", then you'll start to think "so it must be something else", and you figure you may as well just restart CF. Or you start reading about how someone suggests you change your JVM settings (which may have NOTHING TO DO with this problem, and not only not solve it but could cause new ones), and so on.
Again, I see this all the time. I hope by this entry to have helped avoid some of the very common misunderstandings on this subject that I frequently see either on lists, or in emails to me, or in my consulting engagements. If I seem passionate about it, it's because I am. Same with the memory issues I discuss in the related and similarly titled entry, CF911: Lies, damned lies, and when memory problems not be at all what they seem, Part 1.
H) Need More Help?
I mentioned above that I provide CF Server Troubleshooting consulting. If you need some help understanding how to apply the information above to your specific problem (or need help with any CF server, or CFBuilder, problem), I'm happy to help.
I don't need to come on-site, nor do you need to give me remote access. Instead, we can work easily and securely right over the web using Adobe Connect.
And I don't have any minimum time-block requirement--and I even offer a satisfaction guarantee. To learn more, including rate plans, see my consulting page. (I hope some will forgive this brief commercial here. I don't generally mention it, but since some say that they didn't know I offer such services, it seemed an appropriate point to mention it.)
So phew, another really long blog entry. But I hope it may help some people (and help some who help others).
As always, I welcome your feedback, corrections, additions, etc. Really, I ask for your feedback. If it helped, please say so. My blog doesn't get the traffic of many others. I often see that hundreds of people have read things, but few ever comment. I can't know if it's that I've answered every question (I can hope so), or that you weren't impressed. Like the guy said in Dirty Harry, "I gots to know". :-) Sometimes, all it takes is a few people to "prime the pump" and start commenting to lead others to do so. Why not grab the handle? :-) And if you think this would be helpful info for others, please do share it (tweet about it, mention it on mailing lists/forums when you see the problem raised, etc.)
I'm planning to better organize and package CF server troubleshooting resources (mine and others). We have a lot of great info out there for those solving CF problems. It can just be a challenge to sort through it all. I hope to help solve that. Look for more news to come on that front in time.
For more like this: