This is a story about extreme debugging.
Once I worked as a software engineer at a large eCommerce company, and we had a serious problem. We had a large farm of Java webapp servers and many were dying suddenly due to HotSpot errors. This was causing the loss of transactions and I was assigned to find a solution as a top priority.
Normally if a fatal error occurs you can look at the log and figure out what’s going on. In this case, I only had an “Internal Error” long hexadecimal string for an identifier, and all the Googling I did turned up nothing to explain it.
The first best practice to follow in terms of service management is to ask, “what changed?” We looked at the logs to see when these crashes started to happen and tried to correlate the onset with a software deployment. Unfortunately the company was releasing changes at such a rapid pace that it involved reviewing thousands of lines of code, and I had no clue what I was looking for due to the vague HotSpot Internal Error Id.
What information do we have?
Not a lot… because this was an “internal error” there were no details and the docs said ‘In general this “error string” is useful only to engineers working on the HotSpot Virtual Machine.’ Huh… I guess I need to find the source code to the JVM. This was before the JVM was open source, so it took a lot of Googling before I finally I found a download. Then, I needed to dust off my C/C++ skills and dive in. First step was to find the function that generated the error string, and decode it. Then, I had the source module and line number where the error originated. I was damn lucky that the source code that I downloaded was roughly the same version as what we were running in production. So… I found the problematic code, but it was a pure virtual function that had something to do with Thread creation. Weird.
Where can I get help?
There is no implementation of a pure virtual function except in derived classes, and I had no information indicating where to look next. Except the function did suggest that the error occurred during thread construction.
We were partners with Sun Microsystems but didn’t have a contract for Java support. I went back to Google and somehow found the contact info for an engineer there and sent him the info I had, as a Hail Mary. He actually replied and said “that should never happen!” I got excited and replied asking for help, but he never responded again.
How can I gather more information?
There was a signal that we could detect when a JVM was about to crash. I gave instructions to our service operations center for pausing the JVM process before it crashed, which was tricky since there was a very small time window before the process exited. Then they took the JVM out of the load balancing pool, and I was able to attach GDB and debug it. I don’t remember the details of what I found, but the native stack traces definitely reinforced that the error was due to thread construction at a very low level. Question… has anyone else ever used GDB in prod? It’s certainly not a best practice but we were grasping at straws.
Next I went back to reviewing changes that were deployed around the time when the problem started. I looked closely at any code that was creating Thread objects but didn’t find anything substantial. Then I looked at a whole bunch of experimental JVM startup flags, including PrintCompilation, which I’d used before for performance tuning. In this case, I was stunned to find the java.lang.Thread constructor listed in the report. This was odd because HotSpot compilation of bytecode to native code only occurs after 10k executions by default (or some number in that range). Who the hell was creating so many Threads?!?
This is when I went rogue again and convinced the Operations team to let me deploy a modified Class file for java.lang.Thread with a one line change to print a stack trace each time the constructor was invoked. We deployed to a single JVM using a pre-ClassLoader option to introduce my change. Lo and behold, I quickly found the culprit by analyzing the resulting stack traces. Ironically it was a monitoring system that was collecting JMX stats once per minute. But it was creating a new pool of 100 Threads every minute to collect the stats quickly. So in the end it was a simple fix to just create the thread pool once and leave it in the target JVM as a static member.
The reason we didn’t find the problematic change earlier was that the monitoring system code was in a different repo apart from the webapp code. It also had a different deployment schedule. I’m not sure what to conclude except that if you manage an app, and someone gives you a library or jar file to include, make sure your release notes highlight the change. Also this is a great example where a code review could have caught the design flaw. But again, at this company we were releasing tons of code very rapidly due to business drivers, and reviewing every line of code would have been prohibitively expensive.
Next time, I’ll tell the story about how a major version upgrade of the JVM caused our apps to hang on startup… a fun story about the difference between /dev/random and /dev/urandom!