Three detailed HipChat changes we needed to help us scale during hyper-growth

Reading Time: 2 minutes

In today’s software markets it’s increasingly important to be agile and have the ability to release fast and often. The recent release of the new HipChat web client shows us why in a neat example. In a recent blog post from our HipChat web engineers they detail the issues they saw when they brought their new architecture online for the first time at scale.

Lessons learned

Since the new web client became instantly very popular, HipChat saw some performance issues after the initial launch. Issues like mass reconnections deadlocking servers and redundant DB connections that might not show up in local environments can really be a problem when at load and at scale.

Client connection attempts

Previously, HipChat's web client attempted reconnection every 10–30 seconds following a disconnection. This time around, we wanted a better experience: reconnecting as "automatically" as possible, hoping users never noticed a thing.

To do this, we decreased the connection retry from 10-30 seconds, down to 2 seconds. This drastically shortened time, combined with a surge of new users, strained our system.

The initial reconnection attempts were too aggressive for the amount of traffic we saw. So, our first action was to quickly update the back-off rate and initial poll time to be more reasonable.

Being able to release fast and often made this an easy fix.

Then a related issue popped up when a session node failed and all the disconnected users tried to reconnect at once.

As always, things get complicated when we consider this at scale (webscale). Let's say a large number of clients become disconnected at once due to a BOSH node failure.

All of the clients on that node then tried to reconnect at the exact same time based on the connection retry setting outlined above.

We've effectively just bunched all the reconnection requests into a series of incredibly high-load windows where all of the clients compete with each other. What we really want is more randomness. We implemented a heavily jittered algorithm design. This gives us the benefit of having the least number of competing clients, and encourages the clients to back off over time.

Again being able to see this in real-time and pushing out a fix quickly minimized the chance of this occurring again.

Cache hits

Finally, the last issue that cropped up was through normal monitoring. Engineers noticed that the load seemed to be double what is normal.

Since we knew session acquisition was our biggest pain point, we combed through our connection code, looking for ways to make it less expensive. We noticed that it was double-hitting Redis in some cases. A fix was quickly deployed.

Results

Developing for architecture that runs at scale and is always rock-solid is hard. Adding hyper-growth to that scale makes it even harder. Being able to monitor your site in real-time and deploy changes and fixes fast are now essential. Otherwise you risk driving users aways due to poor performance that doesn’t get fixed quickly.

Since we made these changes, distribution of load on our system has been much improved.

For all the results and pretty charts, the full post is well worth the two-minute read on how we’re scaling a big, modern web service. One that’s now passing billions of messages for our users per year. Thanks to Atlassian’s Open Company – No Bullshit value we all can learn from HipChat’s experiences and ability to be agile.