Out in the Open: Ex-Googlers Building Cloud Software That’s Almost Impossible to Take Down

COCKROACHES ARE SOME of the most resilient creatures on earth. They can live for 45 minutes without air and over a month without food. Cutting their heads off won’t even kill them—at least not immediately. Their bodies can live on for several days without their heads.

At technology giants like Google, Amazon, and Facebook, engineers have pioneered techniques that help make their websites just as hard to kill.

If a server goes on the fritz, a series of servers shut down, or even an entire data center goes dark, these sites are supposed to just keep chugging along. That’s vitally important since every second of downtime means lost revenue.

Now, a team of open source developers wants to make it easier for just about any company to build the sort of resilient cloud computing systems that run online empires like Google. They call their project CockroachDB, billing it as a database with some serious staying power. That may sound like an odd name for a piece of software, but co-creator Spencer Kimball—a former Google engineer—says it’s only appropriate. “The name is representative of its two most important qualities: survivability, of course, and the ability to spread to the available hardware in an almost autonomous sense.”

Like so many other open source projects designed to drive large online operations, CockroachDB is based on ideas published in a Google researcher paper, in this case a detailed description of a massive system called Spanner.

Spanner is a sweeping software creation could eventually allow Google to spread data across millions of computer servers in hundreds of data centers across the world, and it took Google over five years to build. Even with Google’s research paper in hand, the CockroachDB coders still have their work cut out for them. But it’s a noble ambition.

Ex-Googlers Everywhere

At the moment, the project is in the “alpha” development phase, and it’s nowhere near ready for use with production services. But if anyone is up for the challenge of rebuilding Spanner—one of the most impressive systems in the history of computing—it’s the CockroachDB team. Many of them were engineers at Google, though none of them worked on Spanner. Kimball and Peter Mattis — best known for co-creating the open source Photoshop alternative GIMP — helped build Google’s massive file storage system, known as Colossus. Ben Darnell worked on Google Reader. And Andy Bonventre was on Chrome and Google Tasks.

Much of the team now work for payments startup Square—following its acquisition of photo-sharing startup Viewfinder last year—including Kimball, his brother Andy Kimball, Darnell, Mattis and Shawn Morel. But Kimball makes a point of saying that CockroachDB is not backed by Square. He and his collaborators are developing this in their spare time. Some, such as Bonventre and Tobias Schottdorf, don’t work for Square at all.

Trying to rebuild Spanner on nights and weekends may sound like bad idea—even with the best engineers on earth. Not every company needs to reach the sort of scale that Google does. But Kimball says the Viewfinder team could have used some of Google’s technology, and has run into situations at Square that would have come in handy as well. And since there’s nothing on the market that does what Spanner does, Kimball and company have resolved to build it themselves.

CockroachDB isn’t trying to replicate the most unusual aspect of Spanner—a rather clever way of syncing time across a global network of data centers using atomic clocks. Considering that most online operations aren’t even approaching the size of Google—which likely runs tends of thousands of machine at the moment—they don’t need that. What companies do need, Kimball says, is a reliable way of automatically replicating their information across multiple data centers, so that one data center outage won’t bring things down—and so that the service can operate in one part of the world just as it operates in another. This is what CockroachDB aims to provide.

A Bigger Table

Spanner is a successor to another Google database called BigTable, which helped pioneer new ways of building highly scalable software by breaking with many long standing traditions in the database world. After Google published a paper on BigTable in 2006, its ideas were quickly adapted into open source clones such as Cassandra and Hbase—which are now core technologies at companies like Facebook, Twitter, and Netflix—kicking off the so-called “NoSQL” revolution.

But while NoSQL databases helped companies store information across much larger number of machines, they also made life harder in some ways. A database like BigTable sacrificed an old-school database concept called consistency, which basically meant that when you make changes in one part of the database, they won’t necessarily line up with something happening in another part.

The problem is that it’s relatively straightforward to do consistency when your database lives on just one server. But as you scale-up and spread out across multiple data centers, consistency becomes much harder. For many applications—such as instant messaging—that’s not much of a problem. But if you’re doing something like online banking, it’s a very big deal. If part of your database may think someone has a ton of money in their account, not realizing that all the money was withdrawn in another part. Plus, without consistency, you run into problems when one part of your database goes down.

Spanner solves these issues, and CockroachDB is following in its footsteps.

The Name Doesn’t Die, Either

Spanner ensures consistency across data centers without sacrificing (much) performance. Plus, through an additional layer called F1, it lets companies query a database using standard SQL commands, the lingua franca of information retrieval. Despite spanning thousands of servers, a Spanner database acts like one database on one single machine. And if a data center goes down, an application can simply ping another data center to find the information it needs, because all of the data is seamless synced across data centers. CockroachDB will allow for something similar—though, without the atomic clocks, it may not operate as quickly or across quite as much data.

That said, Kimball and crew aim to create something that’s far easier to setup than Google’s creation. Google infrastructure projects tend to all depend on each other. Spanner requires Colossus, which in turn requires a system maintenance system called Chubby. But the goal for CockroachDB is to make it a standalone system that doesn’t depend on any particular file system or system manager. The team also plans to add the SQL query tools of F1 to the project. And Kimball says that if Amazon and other cloud hosting companies started adding atomic clocks to their data centers, CockroachDB could eventually tap this as well.

Kimball says that eventually, if the database is going to catch-on beyond a handful of large companies with the internal resources to manage it, some sort of commercial company will need to provide support for the software. But Kimball says it’s still way to early to start thinking about that. If that does happen, will the project need to find a more corporate friendly name? Kimball doesn’t think so. “It’s well proven that people remember things better when there’s a strong positive or negative emotional context,” he says. “I’d love to find a name with a super-strong gut punch positive emotional context that you can remember, but I couldn’t find one. ‘RainbowDB’ sounds pretty lame.”

The original article was first published on Wired authored by Klint Finely