The Extra Second That Crashes the Web

Leap second illustration

On Saturday, at midnight Greenwich Mean Time (GMT), as June turned into July 2012, an extra second was added by the Earth’s official time keepers: the International Earth Rotation and Reference Systems Service (IERS).

According to reports, some of the internet’s software platforms including the Linux operating system and the Java application platform were unable to cope with this extra second.

Depending on how quick the Earth is spinning, the planet’s official time keepers periodically add an extra second to these clocks to keep them in sync with the planet’s rotation. Apparently, this can cause problems with computing systems that are connected and plugged into these clocks, and are not capable enough to deal with the extra second.

A leap second is a one-second adjustment that is occasionally applied to Coordinated Universal Time (UTC) in order to keep its time of day close to the mean solar time. The UTC time standard, which is widely used for international timekeeping and as the reference for civil time in most countries, uses the international system (SI) definition of the second, based on atomic clocks.

Many computing systems use Network Time Protocol, or NTP, to keep themselves in sync with the world’s atomic clocks, and when an extra second is added, some systems just don’t know how to handle it.

The "leap second bug" hit just as the web was recovering from a major outage to Amazon Web Services, an online operation that runs as much as 1 percent of the internet. Some services, including the search giant Google, saw the leap second coming and prepared for it, but most others didn't.

One of the most popular news aggregation and discussion site, Reddit, said on a post to Twitter that they were experiencing problems with "Java/Cassandra." Cassandra is an open-source database that was originally designed by Facebook before it is used across the web. Cassandra is written in Java. Reddit did not immediately respond to a request for comment.

Meanwhile, Eric Ziegenhorn, a site reliability engineer with Mozilla, posted a bug report on the organization’s site saying that Mozilla was experiencing problems with Hadoop, another open-source platform built with Java. Ziegenhorn also blamed the leap second, since the problems had hit at midnight GMT.

Others that also complained about the issues were: Foursquare, LinkedIn, Yelp, Netflix, Gawker Media, and StumbleUpon. A few days prior the leap second, on Friday, with a post to Twitter, Foursquare said that its site was down due to the massive outage that hit Amazon’s cloud services. But it does not seem that the company has publicly acknowledged a leap second bug.

The issues for these sites was because they are using Linux servers to hold their database. Amadeus, an online ticketing site owned by Qantas Airlines also reported similar issue.

Marco Marongiu, a senior system administrator with Opera Software, discussed the general leap second issue with a blog post on June 1st, providing potential workarounds for the issue. But, as he notes, the leap second problem is nothing new. There have now been more than 25 leap seconds since they were first introduced to atomic clocks in the early 1970s.

In January 2009, for instance, the leap second reportedly caused problems with Sun Microsystems’ Solaris operating system and an Oracle software package.

"Almost every time we have a leap second, we find something," said Linux’s creator, Linus Torvalds. "It’s really annoying, because it’s a classic case of code that is basically never run, and thus not tested by users under their normal conditions."

Inside the Clock

The problem could be traced to a glitch in the Linux kernel, the core of the open source operating system. A Linux subsystem called "hrtimer", or short for high-resolution timer, got confused by the time change, and suddenly created a hyperactivity on those servers that causes the machines' CPU to lock up.

Hrtimer is a subsystem that is used when an application is "sleeping," waiting for the OS to complete some other task. In some cases, it sets a kind of alarm clock for these sleeping applications that will go off when the OS is taking too much time with its other work.

Judging from a mailing post by John Stultz, a Linux kernel hacker, when the leap second hit and these hrtimers were suddenly a second ahead of the core OS, they started ringing those alarm clocks, waking up countless sleeping applications at once and overloading the machines’ CPUs.

Reddit, firstly, saw something a little different. Its servers, Cassandra that was built with Java, runs atop Linux. At the moment of the leap second, Cassandra was failing to pause Java processes, and these processes were caught in constantly spinning loops, eating up the CPU power on Reddit’s servers.

Eventually, Reddit solved the problem by rebooting its servers. The site was disabled for about 30 to 40 minutes, and it was entirely offline for about an hour and a half.

While Reddit was struggling with its Cassandra servers, Gawker Media had issues with its Tomcat servers, and Mozilla had trouble with Hadoop.

Other systems, however, experienced problems a day before the leap second arrived. On Friday, NTP began warning servers that this year’s leap second was on its way, and according to Marco Marongiu, at least some Opera servers started locking up when they received the announcement.

The Anticipating Plans

Since predicting the future is impossible, the next leap second is still unknown. It depends on how quickly the Earth spins. The Earth's rotation can slow down or speed up, depending on tides, weather and the flow of molten metals in the Earth’s core. But when the next leap second come, there could be more problems.

"Whenever you mess around with time, things have a pretty good chance of going wrong," Torvalds says. "Developers may test their theory and plans before-hand, but it is difficult to predict how things will play out in the real world." he continued.

"Leap seconds and daylight savings time changes are particularly painful, though, because they have the added complexity without strict rules," he added. "And of those two, leap seconds are the even more painful of the two."

As Torvalds points out, synching up the Earth with the time measured by atomic clocks is a tricky business. But, in general, the tech industry haven’t got much experience with leap seconds over the past decade and a half. In fact, that may be part of the problem, says Steve Allen, a programmer with the Lick Observatory, just outside of San Jose, California. "From 1999 to 2005, there hadn’t been leap seconds. So all of the notions of cloud services and multiprocessors and so on came into existence during a period of time when leap seconds weren’t happening," he says.

Some have called for an end to leap second so that these problems can be avoided. But in the meantime, others have proposed master fixes that seek to hide the sudden time changes from systems such as Linux. Opera’s Marongiu suggests pausing a system’s NTP system for a second, rather than actually moving a system’s clock back.

"Basically, you trick NTP, so it won’t take that sudden step back, but still adds the extra second," says Marongiu. But he calls this a "poor man’s workaround." The better solution, he says, is the one used by Google.

In September of last year, with a blog post, Google detailed how it deals with leap seconds. The web giant uses a technique it calls "leap smear," where it gradually adds milliseconds to its system clocks prior to the official arrive of the leap second.

Google said that with the extra milliseconds added, a total second is added at the midnight when the leap second happens. Its clock has already taken this issue without problem by skewing the time over the course of the day. Google stated that the servers were able to continue as normal, unaware that a leap second had just occurred.

It has been proposed that media clients using the Real-time Transport Protocol (RTP) to inhibit generation or use of NTP timestamps during the leap second and the second preceding it.