Categories
Uncategorized

HotSpot error from Hell

This is a story about extreme debugging.

Once I worked as a software engineer at a large eCommerce company, and we had a serious problem. We had a large farm of Java webapp servers and many were dying suddenly due to HotSpot errors. This was causing the loss of transactions and I was assigned to find a solution as a top priority.

Normally if a fatal error occurs you can look at the log and figure out what’s going on. In this case, I only had an “Internal Error” long hexadecimal string for an identifier, and all the Googling I did turned up nothing to explain it.

What changed?

The first best practice to follow in terms of service management is to ask, “what changed?” We looked at the logs to see when these crashes started to happen and tried to correlate the onset with a software deployment. Unfortunately the company was releasing changes at such a rapid pace that it involved reviewing thousands of lines of code, and I had no clue what I was looking for due to the vague HotSpot Internal Error Id.

What information do we have?

Not a lot… because this was an “internal error” there were no details and the docs said ‘In general this “error string” is useful only to engineers working on the HotSpot Virtual Machine.’ Huh… I guess I need to find the source code to the JVM. This was before the JVM was open source, so it took a lot of Googling before I finally I found a download. Then, I needed to dust off my C/C++ skills and dive in. First step was to find the function that generated the error string, and decode it. Then, I had the source module and line number where the error originated. I was damn lucky that the source code that I downloaded was roughly the same version as what we were running in production. So… I found the problematic code, but it was a pure virtual function that had something to do with Thread creation. Weird.

Where can I get help?

There is no implementation of a pure virtual function except in derived classes, and I had no information indicating where to look next. Except the function did suggest that the error occurred during thread construction.

We were partners with Sun Microsystems but didn’t have a contract for Java support. I went back to Google and somehow found the contact info for an engineer there and sent him the info I had, as a Hail Mary. He actually replied and said “that should never happen!” I got excited and replied asking for help, but he never responded again.

How can I gather more information?

There was a signal that we could detect when a JVM was about to crash. I gave instructions to our service operations center for pausing the JVM process before it crashed, which was tricky since there was a very small time window before the process exited. Then they took the JVM out of the load balancing pool, and I was able to attach GDB and debug it. I don’t remember the details of what I found, but the native stack traces definitely reinforced that the error was due to thread construction at a very low level. Question… has anyone else ever used GDB in prod? It’s certainly not a best practice but we were grasping at straws.

Next I went back to reviewing changes that were deployed around the time when the problem started. I looked closely at any code that was creating Thread objects but didn’t find anything substantial. Then I looked at a whole bunch of experimental JVM startup flags, including PrintCompilation, which I’d used before for performance tuning. In this case, I was stunned to find the java.lang.Thread constructor listed in the report. This was odd because HotSpot compilation of bytecode to native code only occurs after 10k executions by default (or some number in that range). Who the hell was creating so many Threads?!?

This is when I went rogue again and convinced the Operations team to let me deploy a modified Class file for java.lang.Thread with a one line change to print a stack trace each time the constructor was invoked. We deployed to a single JVM using a pre-ClassLoader option to introduce my change. Lo and behold, I quickly found the culprit by analyzing the resulting stack traces. Ironically it was a monitoring system that was collecting JMX stats once per minute. But it was creating a new pool of 100 Threads every minute to collect the stats quickly. So in the end it was a simple fix to just create the thread pool once and leave it in the target JVM as a static member.

Conclusion

The reason we didn’t find the problematic change earlier was that the monitoring system code was in a different repo apart from the webapp code. It also had a different deployment schedule. I’m not sure what to conclude except that if you manage an app, and someone gives you a library or jar file to include, make sure your release notes highlight the change. Also this is a great example where a code review could have caught the design flaw. But again, at this company we were releasing tons of code very rapidly due to business drivers, and reviewing every line of code would have been prohibitively expensive.

Next time, I’ll tell the story about how a major version upgrade of the JVM caused our apps to hang on startup… a fun story about the difference between /dev/random and /dev/urandom!

Categories
Uncategorized

Math and Music

My parents instilled an appreciation for music with me and my siblings. As the youngest of five, I was fortunate to learn a bunch of instruments. I feel that my experience learning to play music directly correlates to excelling in math and later computer science. So here’s a rundown of my history inspired by many Facebook posts I’ve seen recently…

Keyboards

Mom was a classically trained pianist and even played in a radio broadcast once. So I took piano lessons and learned to play some simple songs, but damn it’s a challenging instrument. I have an electric piano now with lots of nice features but it takes a lot of time to compose something. Later I stayed at a home with an organ. It had foot-pedals, two keyboards, knobs etc. I composed a 3-part song on it that still runs through my head sometimes so eventually I’ll record it. Today I also have a Pocket Piano which is lots of fun since I’ve always enjoyed bands like Kraftwerk and DEVO.

Brass

The first instrument I played in a school band was the trumpet. It is a powerful instrument but you also have to learn how to maintain it by keeping the valves lubed, polishing it so the oil from your hands doesn’t affect the brass, and things like that. My uncle liked to borrow it when my parents hosted parties and he’d play Herb Albert songs. He also liked to drink bourbon so I’d have to do some cleaning and liberal use of the spit valve so I wouldn’t blow bourbon breath on my bandmates later. Next I switched to slide trombone. If you like jazz it’s awesome. When I see a concert like Preservation Hall Jazz Band I always focus on the trombone player.

Woodwinds

Next I switched to bass clarinet. Not sure why… I think our music teacher just needed one to fill out the band. This is another favorite due to the unique sound. I had fun in marching band and we’d walk around the streets surrounding the school for practice. I liked to learn how to make weird squeaking and honking noises to crack jokes. After that I tried my sister’s soprano clarinet and that was fun but they don’t sound nearly as cool a bass clarinet. Next up was my brother’s alto saxophone… I never had proper training on it and just played by feel. I didn’t spend too much time on it since I was soon to move on to guitars.

Guitars

I was fortunate to make friends with a local garage band that had a practice space in the basement of a house. I asked one of my friends if he could teach me how to play electric guitar and he was game. However once we started he saw I was having trouble because my fingers are long. So we switched to the bass guitar and I was hooked. I ended up being one of the bass players in that band and we even made a demo tape in a professional studio. I played in a couple other bands too and even did a couple shows at Iowa City’s venue Gabe’s Oasis and other venues. I still have a couple basses and plan to eventually record some tracks combined with the Pocket Piano perhaps. Next I was determined to learn electric guitar. Nothing fancy, just basic 3 chord rock. I think I achieved my goal but never advanced to playing solos and that sort of thing.

Drums

While I was in bands as a bass player I needed to observe the drummer at all times to keep the foundation solid. Along the way I wanted to play the drums myself so taking advantage of the band’s downtime I’d noodle away. Playing drums is like walking and chewing gum at the same time, yadda yadda, so the key was to start with slow tempo and keep things simple. Once I got the hang of it I was probably qualified to play for a decent garage band.

Strings

I’ve played a fretless bass in the past and never loved the sound, but I also noodled with a friend’s cello. Matt Haimovitz is so awesome so I’d love to try cello again someday.

Vocals

Singing is hard for me but like anything else, with training you improve. It was always hard for me to sing and play an instrument at the same time. I have had some well received karaoke performances however.

Summary

One of my Dad’s principles was “never stop learning”. That may be why I kept trying new instruments. I also took a music theory course in college which really, really enforced the idea that math and music are related very closely. So if you have a child who likes computer science and things like that, encourage them to learn how to play music. If you google “math and music” there’s a lot of good reading out there.

Categories
Uncategorized

GOTO Chicago

I’ve really enjoyed the zoom meetings hosted by Trifork supporting the GOTO conference. It reminds me of how much I enjoyed doing a presentation there.

Check them out at https://www.meetup.com/goto-nights-chicago

Categories
Uncategorized

Disrupting the Hospitality Industry

I’m back! This blog has been dormant for a long time since the advent of Twitter. But now I have something to share that is a bit longer than 140 characters…

About a year and 4 months ago I accepted the position of Chief Technology Officer at Hyatt Hotels Corporation. It was really a no-brainer decision since I now have the pleasure of working with Alex Zoghlin again. I’ve known Alex for a long time, ever since I joined his first company as a software engineer at Neoglyphics.

We have convinced the senior leadership team at Hyatt that technology will be a strategic advantage that will enable us to disrupt the hospitality industry. As software eats the world, all companies in all industries will eventually become technology companies, or they will perish. At Hyatt, for strategic initiatives this means that we are reversing the trend of outsourcing technology and treating it like a cost center. We are hiring a plethora of technologists in many disciplines from product, to dev, to ops, to help us develop platform and application products in a Continuous Delivery model. Check out hyatt.jobs for details.

Especially in my area, we are hiring technologists with some or all of these characteristics:

  • You are just as skilled at communicating with humans as you are with machines.
  • You are a software craftsman who values quality over quantity, but you are not a zealot or perfectionist.
  • You like Agile. You like DevOps. Thus you like Continuous Delivery. However, you understand that Agile doesn’t mean you can skip planning, and DevOps and ITIL can coexist in harmony.
  • You like Clouds, providing agility, provided by automation, made possible by standards, discipline, and pragmatic governance.
  • You like to build platforms composed of loosely coupled, contractually obligated services. Terms like API Façade make you smile.
  • You love Open Source and are willing to contribute back to the communities.
  • You want to help software take over the world, and help provide authentic hospitality in the process.
  • You want Mobile apps to be first class citizens in the software world that anticipate your needs.
  • You like to laugh in the face of adversity.
  • You are a maker and you are driven by the thought of seeing your creation in the hands of millions of customers.
  • You like the challenge of simplifying complex systems, and you always consider the big picture even when acting locally.
  • You are a pleasure to work with and value a great company culture.

We have so much interesting work ahead of us, and the best part is that you can actually visit one of our hotel properties and see how your work enhances the experience of our guests and hotel associates.  So as they say in infomercials… don’t wait, act now! -> hyatt.jobs <- Do it, do it!

Screen Shot 2015-02-04 at 10.29.57 PM

-Matt

Categories
Uncategorized

Erlang Workshop at Flourish 2009

I attended an excellent Erlang workshop presented by Martin Logan Friday morning at the Flourish conference hosted by UIC, my alma mater.  Martin is a great presenter who is a lead developer of the Erlware open source project as well as an author of an upcoming Erlang book.  I recorded parts of the workshop using the Flip Mino HD.  If you missed this event you might want to check the upcoming Erlang Factory conference where Martin will be presenting again.  Otherwise, check out the videos at the end of this post.  There was a great turnout at this event.

flourish2

Is it worth your while to learn a new language with such a strange syntax?  IMHO, it certainly is!  I was first convinced after reading The End Of The Free Lunch which explains the paradigm shift in processor design from higher speed to multi-core and the subsequent need for concurrency oriented programming.  I continued to read up on concurrency oriented languages and the Actor model, and I found out about all the fuss about Erlang.  I have had far too much experience with Java applications that crash under load due to concurrency issues related to the Java shared memory model, so Erlang really piqued my interest.

It was pretty easy for me to commit to Erlang/OTP for new distributed services middleware when I worked at Orbitz Worldwide, especially since Martin is employed there as a Technical Manager.  He mentored a very small team of developers who wrote an awesome RESTful web services reverse proxy using Erlang/OTP.  It provides for robust and fault tolerant service registration, request routing and monitoring in only a few hundred lines of code.  Congrats to the team at Orbitz for recently deploying this app to production!  I plan to apply the same design for Sears Holdings’ Online Division as we continue to build out our platform. 😉

I have a couple more videos that I’ll upload later.

p.s. We’re hiring! If you are interested email me for details @ matt at mattokeefe dot com, or DM me.

Categories
Uncategorized

JavaOne 2009

Last Thursday I received this notification from Sun regarding a JavaOne technical session proposal:

Congratulations! Your submission entitled ‘RESTful Protocol Buffers’ has been accepted by the JavaOne[sm] Conference Program Committee as an ALTERNATE session for the 2009 JavaOne conference in San Francisco, California, June 2-5, 2009.

As an alternate speaker, your badge will allow you full access to the Conference sessions, BOFs, Hands-On Labs, and the Pavilion.

It is really exciting that I might be called upon to present again.  Last year I learned a lot about how to prepare a technical session, and Complex Event Processing at Orbitz was very well received.

Here is the abstract for our proposed presentation:

At Orbitz, Jini has served us well, but at the cost of tight coupling due in part to shared code and Java serialization rules. In order to improve agility, we are migrating to a RESTful web services architecture using Protocol Buffers to define message formats. The result is loosely coupled services with autonomous life cycles supporting evolvability and innovative mashup-style development.

This session is intended for experienced architects and tech leads that are familiar with distributed systems and data encoding methods.

What you will get from this session:

– using document schemas to constitute language neutral contracts

– using standard HTTP plumbing and intermediaries

– implementing a reverse proxy for request routing based on RESTful URLs

– applying OLAs for governance and service isolation

– writing automated service layer tests to ensure backward compatibility

I’ll see you at JavaOne, with Alex Antonov!

Categories
Uncategorized

Hello World

One of my New Year’s resolutions is to blog on a regular basis.  I plan to write about my professional interests including the Internet, distributed systems, application monitoring and management, event driven architecture, complex event processing and customer driven innovation.  I am involved with a couple of open source projects now, ERMA and Graphite, and I’d love to share some experiences that might motivate you to check them out.