When we think about quality at Plivo, it is often in conjunction with our growing customer base and our promise to them of 99.95% guaranteed uptime, minimum latency, and pre-agreed SMS throughput rates. Last year, our customer base grew by 400%, with SMS and Voice traffic increasing by 500%, and 150% respectively.
More customers equals greater traffic, which naturally puts pressure on our infrastructure. This pressure will be intensified if sudden and unpredictable spikes in traffic are caused by some customers running large-scale voice and SMS campaigns. This can happen, as Plivo Short Code customers alone can send out 10-15x their promised throughput of 40 SMS/second.
In such instances, the automatic scaling we have in place may not kick in quickly enough, resulting in a system overload. This would adversely affect the customers in question, and also affect or slow down the API requests of other customers. It would not make business sense to over-provision our infrastructure in anticipation of such spikes in traffic. We had to take other steps to prevent such issues from occurring. We also had to adhere to any throughput limitations imposed by our carriers, while sending out SMS or voice requests from our customers.
We love a good challenge. We love attacking a potential problem head on and coming up with a solution that really works. So that’s exactly what we did. We knew we had to introduce a robust queuing system for our customers, but we needed one that would meet Plivo’s exacting requirements. The result was SHARQ, a flexible, rate limited queuing system, designed to allow deployment in a scalable, highly available manner. The architecture of SHARQ is based on the Leaky Bucket Algorithm that enforces a rigid output pattern at the average rate, no matter how high and sporadic the spikes in traffic are.
Why we built SHARQ from the ground up
As in most modern web services, there are various tasks in our architecture that are best handled by an asynchronous queue. For the past three years, we have been using Celery for tasks such as storing message/call detail records, sending registration emails, and sending invoices to customers. This has worked very well for those tasks, but for controlling unpredictable spikes in traffic, we needed a system that would dynamically create new queues for customers, and also be able to adjust their rate limits in real time. Celery lacked this fine-grained control and therefore we decided to build our own queuing system to meet our specific needs. We built SHARQ from the ground up, to allow us to control heavy SMS and voice traffic at the individual customer level.
Tried and tested
Over the last three months we have successfully battle-tested SHARQ in production with some customers who generate large volumes of traffic. Based on the outstanding results, we’ve decided to roll out SHARQ for all customers, over the next few months.
Check out SHARQ for yourself
Being open source enthusiasts, we’ve made the code for SHARQ available to all. You can read about how to use SHARQ and find all relevant documentation at sharq.io. Remember that if you have a use case where you need to dynamically rate limit at a more granular level, then SHARQ gives you the flexibility to do so. It could be as simple as rate limiting large drip email campaigns, or an unpredictable input pattern like absorbing vast amounts of data from the Twitter firehose. The possible uses for SHARQ are endless.
The future of SHARQ
In the coming weeks, we are definitely looking at improving SHARQ by adding new features such as, bulk enqueue, the ability to check the status of jobs, and much more. If you have any ideas or suggestions to improve SHARQ, do check out the open source code for yourself and send us your pull requests. Here is the slide presentation of our talk at PyCon India.
You can also view our presentation at PyCon India, here.
Product Engineer at Plivo and primary author of SHARQ