Microsoft

Hackbox Hackathon App

Scalable Backend Performance: Microsoft's Hackathon App Handles 60k Requests Per Hour

Background

When Microsoft needed a reliable, scalable internal web application built quickly, they turned to Formidable. The app, called Hackbox, is a web application for easily running hackathons within Microsoft. Hackbox is used to manage the annual company-wide hackathon, //oneweek, and also supports custom hackathons both big and small at any time.

hackbox-chart

Challenge

When Formidable joined the project, there was a prototype as a starting point. However, the prototype would not handle the scale of a Microsoft-wide event, so Formidable took on the backend to ensure minimal load times in periods of extremely high usage.

Shortly before launch, Hackbox caught the attention of Microsoft CEO Satya Nadella, which meant that there would be higher-than-anticipated traffic. We needed the best reliability we could manage under short notice.

Solution

Specs

Hackbox is built entirely in Azure for flexibility and scalability. It consists of:

  • an Express front-end, running as an Azure App Service

  • a Node.js/hapi.js API layer, running as an Azure App Service

  • MySQL running on a Bitnami-provided Ubuntu Azure VM

API Infrastructure

During development, the API lived on a single small Azure instance. For peak traffic, the API layer was fluidly scaled to 5 Azure S3 instances (Standard - Large; 4 cores, 7GB RAM, 50GB storage). The API layer uses Azure load balancing to distribute traffic, and can be scaled to more or fewer instances using the Scale Out feature, as well as moved to larger or smaller instances using the Scale Up feature. To minimize variables, we disabled automatic Scale Out and instead left 5 instances running at all times. Each instance is configured to use a maximum of 100 database connections.

The MySQL VM is a single DS12_V2 Standard Azure VM (28GB RAM, 200GB local SSD). MySQL is configured to allow 505 database connections: 100 per instance, and 5 for direct access for admin or debugging work.

6.59m

Requests Served

99.1%

Returned Successfully

99.9%

API Reliability

Results

Traffic Served

Over the course of the peak period, the API saw the following traffic served:

  • Unique Users: 47912

  • User Sessions: 201490

  • API Requests

    • (24 July - 30 July): 3.4m

    • (24 July - 12 August): 6.59m
  • An API request is logged whenever the application queries the takes an action, such as performing a search or editing data.

Traffic was not evenly distributed during this time period. At different times of day, traffic would range from fewer than 1,000 reqs/hour to greater than 60,000 reqs/hour, with request-per-minute peaks at times exceeding 1.2k reqs/minute.

Errors & Performance

Of the 6.59 million requests served over the period in question, approximately 99.1% returned successfully (HTTP 2). Requests that generated an error code were approximately 90% HTTP 4 (i.e. a request that was denied by the server for being unauthorized or incorrect) with remaining errors were HTTP 503, indicating a problem with the API layer, for a final API reliability of 99.9%.

Response times to API requests had a TP90 of <300ms, and a TP99 of <1s, indicating that the vast majority of users enjoyed a fast and responsive experience on the Hackbox webapp.

Further digging indicates that site performance was even better for normal users than these data would suggest. A particular script used by administrators of the site was written in such a way that it queried many pages of results simultaneously, before MySQL had returned its initial pagination query, forcing the database to run many full-table-traversal queries in parallel.

When examining an hour of moderately-busy traffic, we noticed that for just a couple of minutes, the average response time made a huge jump to well over 10s, despite having no correlation to an increase in traffic. The peak in response time is directly attributable to the admin script being run.

Benefits

Capacity vs. Traffic

Viewing statistics from the busiest hour during peak period (~64k requests), we find that key metrics suggest the current setup has ample overhead for additional traffic.

  • API CPU usage: 8%

  • API TP90: <300ms

  • API Memory Usage: 7-12%, 11% average

  • Average DB Connections: 83/500 (16%)

  • Database IOPS: 7% of max

  • Database CPU Usage: 9%

Conclusion

Formidable has ensured that Microsoft hackathon participants have a reliable experience using Hackbox, even under heavy load conditions. Microsoft can be confident that Hackbox will scale with its hackathons.

Work With Us

We partner with our clients to build all manner of digital products. Wherever you are in your product lifecycle, from concept to launch, our team of design and engineering experts will ensure you meet your web or mobile app's goals.