Concurrency is very important in any modern system, and this is one topic many software engineers struggle to have a good grasp. The complexity in concurrency programming stems from the fact that the threads often need to operate on the common data. Each thread has its own sequence of execution, but accesses common data.
Debugging concurrency issues and fixing any thread starvation, dead lock, and contention require skills and experience to identify and reproduce these hard to resolve issues. Here are some techniques to detect concurrency issues.
How would you detect thread safety issues?
- Manually reviewing the code for any obvious thread-safety issues. There are static analysis tools like Sonar, ThreadCheck, etc for catching concurrency bugs at compile-time by analyzing their byte code.
- List all possible causes and add extensive log statements and write test cases to prove or disprove your theories.
- Thread dumps are very useful for diagnosing synchronization problems such as deadlocks. The trick is to take 5 or 6 sets of thread dumps at an interval of 5 seconds between each to have a log file that has 25 to 30 seconds worth of runtime action. For thread dumps, use kill -3
in Unix and CTRL+BREAK in Windows. There are tools like Thread Dump Analyzer (TDA), Samurai, etc. to derive useful information from the thread dumps to find where the problem is. For example, Samurai colors idle threads in grey, blocked threads in red, and running threads in green. You must pay more attention to those red threads. - There are tools like JDB (i.e. Java DeBugger) where a “watch” can be set up on the suspected variable. When ever the application is modifying that variable, a thread dump will be printed.
- There are dynamic analysis tools like jstack and JConsole, which is a JMX compliant GUI tool to get a thread dump on the fly. The JConsole GUI tool does have handy features like “detect deadlock” button to perform deadlock detection operations and ability to inspect the threads and objects in error states. Similar tools are available for other languages as well.
Here are some best practices to keep in mind relating to writing concurrent programs.
- Favor immutable objects as they are inherently thread-safe.
- If you need to use mutable objects, and share them among threads, then a key element of thread-safety is locking access to shared data while it is being operated on by a thread. For example, in Java you can use the synchronized keyword.
- Generally try to keep your locking for as shorter duration as possible to minimize any thread contention issues if you have many threads running. Putting a big, fat lock right at the start of the function and unlocking it at the end of the function is useful on functions that are rarely called, but can adversely impact performance on frequently called functions. Putting one or many larger locks in the function around the data that actually need protection is a finer grained approach that works better than the coarse grained approach, especially when there are only a few places in the function that actually need protection and there are larger areas that are thread-safe and can be carried out concurrently.
- Use proven concurrency libraries (e.g. java.util.concurrency) as opposed to writing your own. Well written concurrency libraries provide concurrent access to reads, while restricting concurrent writes.
Fixing production issues
There could be general runtime production issues that either slow down or make a system to hang. In these situations, the general approach for troubleshooting would be to analyze the thread dumps to isolate the threads which are causing the slow-down or hang. For example, a Java thread dump gives you a snapshot of all threads running inside a Java Virtual Machine. There are graphical tools like Samurai to help you analyze the thread dumps more effectively.
- Application seems to consume 100% CPU and throughput has drastically reduced – Get a series of thread dumps, say 7 to 10 at a particular interval, say 5 -8 seconds and analyze these thread dumps by inspecting closely the “runnable” threads to ensure that if a particular thread is progressing well. If a particular thread is executing the same method through all the thread dumps, then that particular method could be the root cause. You can now continue your investigation by inspecting the code.
- Application consumes very less CPU and response times are very poor due to heavy I/O operations like file or database read/write operations– Get a series of thread dumps and inspect for threads that are in “blocked” status. This analysis can also be used for situations where the application server hangs due to running out of all runnable threads due to a deadlock or a thread is holding a lock on an object and never returns it while other threads are waiting for the same lock.