How Hadoop Makes Short Work Of Big Data

Hadoop was named after a toy elephant, sounds like a Dr. Seuss character, and it's the hottest thing in big-data technology.

It's a software framework that makes short work of a tall task—managing really big data. In fact, it greases the gears of the cloud, keeping the data flowing for online giants like Google, Facebook, Yahoo and Twitter.

Visualization of Twitter social connections. Image courtesy Marc Smith, via Flickr (CC:BY)

They aren't the only ones who can benefit from Hadoop, however. Enterprises are also adopting it: "We're helping enterprises that are having problems managing their big data," says Richard Treadway, director of big-data solutions marketing for NetApp.

"We're taking technology from the big Internet companies and adapting it for the enterprise environment."

The inspiration for Hadoop came from Google's work on MapReduce, a programming model for distributed computing—a way to allow big data to be stored and accessed on multiple server computers. Doug Cutting, then working at Yahoo, saw its potential and created Hadoop, which works with off-the-shelf hardware to manage data too big for conventional databases.

Cutting, who named Hadoop after his son's yellow stuffed elephant, left Yahoo in 2009 to join Cloudera, a company specializing in Hadoop.

NetApp works with both Cloudera and another Hadoop company, Hortonworks (note the Dr. Seuss elephant reference). The companies are collaborating to develop enterprise systems that bring the benefits of Hadoop to the corporate world—but without the need for enormous resources. "Management at Internet scale requires a lot of people, and that doesn't fit the investment approaches of enterprises," says Treadway.

"We use a unique configuration designed for enterprises so they can use Hadoop and get it up and running quickly," Treadway adds. Also reliably, with built in resilience—eliminating the need to keep multiple copies of data, as is the case with Internet-scale Hadoop deployments.

That gives NetApp's customers a much-needed sense of comfort. "It's one thing if you lose a photo on Facebook," says Treadway. "It's another if you're...late on an algorithm for trading and you lose a billion dollars."

One of Hadoop's biggest advantages is speed. Overnight, Hadoop is able to generate a comprehensive report that incorporates data stored in billions of records. Previously, it would have taken weeks to generate the same report.

That means business intelligence winds up in the hands of those who can act on it almost instantly.

The possibilities for Hadoop in the enterprise are almost endless. It has the potential to make it much easier to manage images, video, and sound. For instance, a doctor can quickly search a healthcare company's data center for a patient's complete medical history, including X-rays and MRI scans.

The only major downside to Hadoop has been its inability to work across network facilities in different geographical locations. The reason is simple: Information can't travel between data centers fast enough to be useful.

As a result, cloud computing companies like Facebook have been forced to build larger and larger facilities to accommodate all the status updates, comments, photos and other information posted by users (some 2.5 billion pieces of content each day).

Even that problem is about to go away, thanks to a new Facebook project called Prism, which can automatically replicate and copy data wherever it's needed in a geographically dispersed Hadoop network.

“[Prism] allows us to physically separate this massive center of data but still maintain a single logical view of all of it,” Facebook engineering guru Jay Parikh told Wired in a recent interview. “We can move the centers around, depending on cost or performance or technology. ... We’re not bound by the maximum amount of power we wire up to a single data center.”

Learn More About Hadoop:

Dave Einstein is a veteran print and digital journalist, having worked for The Associated Press, Los Angeles Times and Forbes.com. He currently writes the weekly Computing Q&A column for the San Francisco Chronicle.