Google’s Redundant, Fault-Tolerant System Worked with Cheap, Low-Quality, Failure-Prone Equipment

(p. 183) Google was a tough client for Exodus; no company had ever jammed so many servers into so small an area. The typical practice was to put between five and ten servers on a rack; Google managed to get eighty servers on each of its racks. The racks were so closely arranged that it was difficult for a human being to squeeze into the aisle between them. To get an extra rack in, Google had to get Exodus to temporarily remove the side wall of the cage. “The data centers had never worried about how much power and AC went into each cage, because it was never close to being maxed out,” says Reese. “Well, we completely maxed out. It was on an order of magnitude of a small suburban neighborhood,” Reese says. Exodus had to scramble to install heavier circuitry. Its air-conditioning was also overwhelmed, and the colo bought a portable AC truck. They drove the eighteen-wheeler up to the colo, punched three holes in the wall, and pumped cold air into Google’s cage through PVC pipes.
. . .
The key to Google’s efficiency was buying low-quality equipment dirt cheap and applying brainpower to work around the inevitably high failure rate. It was an outgrowth of Google’s earliest days, when Page and Brin had built a server housed by Lego blocks. “Larry and Sergey proposed that we design and build our own servers as cheaply as we can– massive numbers of servers connected to a high-speed network,” says Reese. The conventional wisdom was that an equipment failure should be regarded as, well, a failure. Generally the server failure rate was between 4 and 10 percent. To keep the failures at the lower end of the range, technology companies paid for high-end equipment from Sun Microsystems or EMC. “Our idea was completely opposite,” says Reese. “We’re going to build hundreds and thousands of cheap servers knowing from the get-go that a certain percentage, maybe 10 percent, are going to fail,” says Reese. Google’s first CIO, Douglas Merrill, once noted that the disk drives Google purchased were “poorer quality than you would put into your kid’s computer at home.”
(p. 184) But Google designed around the flaws. “We built capabilities into the software, the hardware, and the network–network– the way we hook them up, the load balancing, and so on– to build in redundancy, to make the system fault-tolerant,” says Reese. The Google File System, written by Jeff Dean and Sanjay Ghemawat, was invaluable in this process: it was designed to manage failure by “sharding” data, distributing it to multiple servers. If Google search called for certain information at one server and didn’t get a reply after a couple of milliseconds, there were two other Google servers that could fulfill the request.

Source:
Levy, Steven. In the Plex: How Google Thinks, Works, and Shapes Our Lives. New York: Simon & Schuster, 2011.
(Note: ellipsis added.)

Leave a Reply

Your email address will not be published. Required fields are marked *