Gliffy Offline: 71 Hours Race Between The Tortoise And The Hare

Gliffy logoGliffy is a popular HTML5 cloud-based diagramming software and flowchart web app. On March 21st, 2016, the startup was having one of the worst nightmare a company can ever experience: the team accidentally deleted a live production database out of its system.

As a service on the clouds, Gliffy was on its way to migrate its service to an Infrastructure as a Service (IaaS) offering such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform. Its Atlassian add-on product, Gliffy Diagrams in JIRA Cloud, was built from the ground up using an IaaS and architected as a collection of microservices enabled by Docker.

The team has taken the steps to migrate Gliffy and its services towards the new architecture over time rather than taking a complete modification and rewrite. Therefore, Gliffy's infrastructures were still residing at its hosting provider. The result was the downtime caused by a simple human error.

While there are features built into cloud services that could have minimized such risk to prevent the error, Gliggy wasn't ready for it.

The issue started when the team at Gliffy found a problem in one of its backup, and they scheduled a maintenance to solve it. When the team were working on the issue, the administrator accidentally deleted a live database, entirely.

The service went offline in a sudden, and over than two million of its registered users couldn't retrieve any of their data or charts from its system.

Gliffy quickly turn to its backups of database as an attempt to restore it, but its sheer size made it slow to complete. But because the service is cloud-based, the team was confident that the recovery would be possible with no data lost.

While the maintenance was commenced to solve the problem, Gliffy asked users to use its Google Chrome's app to allow them to store their work locally.

After four days, on March 24th, 2016, Gliffy returned online with less to no damage. Chris Kohlhardt, the CEO of Gliffy, stated his apology on behalf of the company.

The Story Of The Tortoise And The Hare

Tortoise - Hare

What began as a simple mistake has caused more trouble than expected. It started on March 17th when Gliffy's system alerted an issue about the replication of one of its databases. What happened was its secondary master node has gotten too far behind the primary master node. Because the system required the team to restart the replication, Gliffy scheduled a maintenance break to repair the issue over the weekend.

On March 20th, as a part of the replication seed, the system administrator was given the task to resolve the problem by starting a command to drop all tables from the schema. Unfortunately, the administrator failed to sever the master-master link before executing the restore command. The inevitable happened: the tables in the database were dropped to its primary node and also to the slaves.

In just seconds, all of its database was deleted from the system.

The engineering team was immediately notified and dispatched as Gliffy's alerting system triggered once again. The team immediately initiated a restoration process from its last daily backup which occurred on late Saturday evening. And with the help of its retention policy of the binlogs, the team was confident to do a complete restore of all data up to the point of disaster.

According to its experience, restoration would take 10-12 hours. For Gliffy that is a popular service, that amount of downtime in really inconvenient, but not catastrophic. But Gliffy's recent change to use table-level compression, the sheer amount of data, and the fact that its using a single-threaded MySQL restore process, Gliffy estimated 4+ days to complete the restoration.

"We left this process running and called this the 'turtle' (tortoise) in the race," said head of Gliffy's engineering, Eric Chiang.

To make the process of restoration shorter, the team brainstormed to come up with several new ideas. One of the ideas was to ship Gliffy's backup to AWS, use Elastic Compute Cloud (EC2) instance to restore the data without the use of compression, and then to ship it back again. The team believed this would be significantly faster, labeling it as its "hare".

So here the slow-moving tortoise had a head start, and the faster hare was just leaving the start line.

The tortoise has an advantage because it poses lesser risk. But 4+ days being offline wasn't what Gliffy had in mind. The company was hoping for the hare to make it to the finish line sooner. So with leaving the tortoise running at its pace, the team members then shifted their focus on the hare, making the two to run in a parallel race. But yet another issue awaits.

After the team got Gliffy's backup into AWS and began restoring its data, Gliffy's system ran out of storage space. Before, the storage was sufficient because the data was compressed. But when it came uncompressed, the data came with a lot larger in size and Gliffy wasn't ready for it. They've tried two attempts which both failed for the same reason.

For the third attempt restoring data in its EC2 instances, the process was significantly faster. But it needed some investigation to find exact start location to match its Saturday's backup, and to also remove the offending drop table commands. The team made snapshots of Gliffy's data in AWS so they can begin their binary log test to initiate the restoration process.

On March 23rd, the data restoration was completed. The team was able to copy the full data and restored it in its production environment. While there were several other processes running in parallel in this race to reduce downtime, the team at first thought the "hare" would beat them all. But the team then decided the less risky attempt by waiting the tortoise to finish rather than allowing the hare's "over-confidence" to give them more unsuspected work.

So as the story goes, "the 'turtle' beats the 'hare' after all," said Chiang.

The team then spent the remaining time seeding other database nodes in order to reconfigure master-master-slave replication.

Gliffy's nightmare was caused by a simple mistake with huge consequences. The total downtime for Gliffy was 71 hours and 15 minutes.