Scaling form 1k to 1M in 2 weeks

-- Artur, we have a problem -- those were the first words I heard over the phone -- we have grown from 1k to 100k online users in a few days and we expect to hit 1M online users within 2 weeks and probably much more when we go live.
In most cases, such a growth would be called a great success. In fact it is a great success, you just need to make sure you can maintain that success by not disappointing users with unreliable services. First impressions are what count...
The next day I was on an intercontinental flight to the office of a startup created by a group of students and just post-graduate folks. Extremely friendly people and what's more important, passionate about what they do. For them, it wasn't about money, it was about fun and the excitement of creating something new which will be used by millions of people. I was excited too, I like challenges and this assignment sure looked like one.
Thanks to timezone differences and despite the freezing temperatures I showed up at the office the same evening straight from the plane. It was dark outside already, but the small place was lit pleasantly. 8 or 10 were desks spread in a chaotic order and cables were all over the place. It sure looked like a place where work rather than tidiness was the priority. They welcomed me with smiles but also some degree of mistrust. I was already familiar with those smiles and that mistrust. It wasn't the first time I was called like a fireman to put down a wildfire, both literally and figuratively....
-- What's up? -- I asked and looked around. Not all faces were new to me, I spent 10 days with them already a few months ago helping with code migration to Tigase. -- Can you give some details before I start?
-- We had less than 1k online users a few days ago, just our friends and other people who wanted to help us test during the development time -- They started to describe the situation. -- Then, last Tuesday we switched to open Beta. Interest exceeded our expectation. We now have almost 1M registered accounts and up to 100k online users. Our servers are on their knees. We want to go live in the next 2 weeks but we're afraid that we can't handle the load.
-- Can we? -- They asked.
-- Of course we can. -- I replied without hesitation. -- We just need to do some work.
I had no doubts Tigase can handle 1M and more online users. However, there is no software which works for large installations out of the box, it always needs some custom optimizations. No two installations that I ever worked on were ever the same with regards to traffic and user behavior. This is why we implemented the clustering strategy framework to allow for such customizations and applications of different clustering logics (strategies) adjusted to a use-case.
Before we could start any work I needed to know what exactly is going on inside Tigase in that particular installation, where are the bottlenecks. This was necessary to prepare a plan and to prioritize work. I collected server statistics and talked to the team about any custom code they created. It was well after midnight when we decided to take a break and continue the next day.
With a head full of new information and system performance metrics I drove to the hotel. It was dark and very cold, though there wasn't any snow yet. It was a pleasant drive without any traffic so I could still think about the problem.
It turns out they have lots of custom code embedded in Tigase, all of which was unoptimized and did not even use the Tigase API. Instead they just modified the original Tigase code all over the place. Their modifications to the system ensured that 0% of messages were lost and that all IQ packets went through the database. That was very nice, but it created all sorts of performance issues. On top of this, they attempted a system that would deliver messages even if the XMPP client is not running on the mobile device. They combined push technologies for iOS, Android, BB and... SMS if there is no other option.
The next day at 6AM, I had a plan ready. Folks told me they come to the office at about 9AM, so I had a few hours to prepare. I reviewed the Tigase code making note of which API should be used in the future and also looked into possible changes that could be made to the Tigase core to make integration with the client code easier.
-- So, what do you think? -- I was asked the moment I got to the office. -- Can we do it? Can we do it in 2 weeks?
-- I have no doubts we can. -- I replied without hesitation. -- However, we can't get everything done in time so we must prioritize the critical stuff and prepare the code so that it's easy to add the next elements.
The plan was to:

  1. Write a custom clustering strategy and run Tigase in cluster mode.
  2. Extract any custom code out of the Tigase core for plugins and components.
  3. DB intensive tasks (the whole QoS system) put it in a component and then deploy it to several external components to distribute the load.
  4. Take all "push" code which interacts with iOS, Android, etc... systems and implement it as components to make it possible to deploy them as external components if necessary as well.
  5. Optimize the slow code based on the system metrics.

The team was eager to work on the code and they all were very dedicated. My main role was teach them the Tigase API, Tigase architecture, and overall the whole XMPO concept, instruct and help them design the most optimal way to implement certain features. I also did reviewed the code to make sure that it integrated correctly with the Tigase core code.If time allowed, I also did some coding myself.
The results exceeded our expectations. Though not fully polished, everything was ready in 10 days! During that time the online user count rose to 300k which gave us a good testing ground. Everything went so well that they decided to go live on the 12th day....
Indeed, about 6 hours after publishing the new service, user registration skyrocketed. The registration frequency became so high that we trouble with DB performance. We quickly put a fix to the throttle user registration requests were left at acceptable levels. Soon there were a couple million new user accounts in the DB and about 500k online users.
The main reward for all our hard work in last days was that now we could easily scale up and down the whole installation by simply adding new machines, while leaving the load well distributed. Tigase cluster nodes could be brought up without touching the rest of the system, we could add more external components to deal with DB intensive tasks as well as more external components for message pushing. The code was refactored and prepared for an easy addition of new features. I also taught them how to look and understand Tigase metrics, to detect bottlenecks, and slow code, so they would know how to optimize and fix any potential problems.
The whole project was a great success, and even after several years the system still works great and serves hundreds of millions users. I must say, I am really happy to see Tigase software doing hard work.

Follow us on:


Back to Top