Middleware Decoded: Weblogic Stuck Threads 101

Weblogic's thread monitoring can be somewhat confusing for people who are just getting started with this application server.

From experience, the indicator triggering the most alarm is the stuck thread warning. I'm not saying that we shouldn't pay attention to it, on the contrary, we definitely should, but understanding what we're dealing with and what it really means.

There is extensive literature available about this issue, but the articles always seem to focus on just a few aspects of the issue and you need to search for further information elsewhere. I'll try to compile as much information as possible while focusing in the following key aspects:

Monitoring
What it really means
What can you do

Monitoring

On Weblogic console [under server → monitoring → threads] we can find the thread monitoring page. Here we can access and monitor all the threads present in the system and their corresponding states. On this page we've got the following thread indicators(amongst others):

Active Execute Threads - number of threads currently processing requests.
Standby Thread Count - number of threads visible from the JVM thread dump but not yet available to process client requests (waiting to be marked 'eligible').
Idle Execute Thread Count - number of Active threads currently “available” to process a client request.
Hogging Thread Count - number of threads taking more time to finish than average. The current execution average time is calculated by Weblogic.
And Stuck Threads - number of thread which have passed the execution time configure for stuck threads (we're going to talk them in detail next).

Whenever checking the system, one should take into consideration the whole picture. A key balance to pay attention to is whether available standby and idle threads are enough so that there are no pending or overload rejected requests.

Stuck threads only become relevant within this context. Considering that the server is not configured to perform any action(such as a shutdown or restart) in case a specific number of stuck threads has been detected, its importance is relative. For instance we may have a system where requests are being rejected because no standby nor idle threads are available, but no stuck threads are detected.

So, what's causing these stuck threads and how does the system detect them?

What it really means?

A thread can be marked as stuck due to a combination of two factors:

A sequence of instructions running for a some time
The amount of time a thread needs to be running until weblogic marks it as stuck (which can be configured, see picture below)

According to the configuration above, if a thread is busy for longer than 10 minutes it will be marked as stuck. This does not mean stopped or blocked. Sometimes there is a huge concern from people who think these threads are in a deadlock or blocked for some reason, it may not be the case. This means that that thread is running over a long period of time. Stuck threads may still be doing useful work and they may go away as soon as the task is done.

When stuck threads are detected, a warning sign will be visible on the "System Status" and on the thread monitoring "heath" column. Also, one can identify the stuck in the thread pool table, it will be the one with true in the "Stuck" column.

What can you do?

Let's say you've got an alert saying your weblogic server instance is in critical state due to stuck threads. Now comes into play a key issue. You need to know the system you're analysing and what is running on it.

If you're expecting some tasks to run for longer than the max stuck thread time, perhaps the best option would be to increase this value or create a dedicated work manager, which could then ignore these long running tasks.
On the other hand, if this is not expected, then the next step should be to identify the stuck thread and pinpoint the root cause. A thread stack dump may be useful to help detect which lines of code is that thread executing, and get you closer to the cause of the problem.

thread dump sample:

Conclusion

Having a system on which stuck threads are constantly being created and don't finish may lead to resource outage as this threads are still be consuming server resources which then can cause the system to stop responding/working. Hence the urgency to complete a diagnosis as soon as possible.

But it is also possible that these threads are only created from time to time and go away after a while, which may point to system slowness or configuration and system behaviour misalignment.

The important thing is to not overreact, the system may still be working as normal. Keep a rational approach to it, plan your work to try to identify the root cause:

Review system health (see whether the stuck thread is causing other issues)
Check configuration for stuck threads max time
If this behaviour is expected

Make necessary adjustments (work manager/stuck thread configuration)

If there's no reason for the thread still be running

Identify problematic threads and generate a thread dump
Fix the code

Middleware Decoded

Weblogic Stuck Threads 101

Monitoring

What it really means?

What can you do?

Conclusion

No comments:

Post a Comment

Weblogic Stuck Threads 101