Apache Hadoop
/
0 Comments
History
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.
What Is Apache Hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Where did Hadoop come from?
The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
What problems can Hadoop solve?
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.
Hadoop: It's all about growth
Fortunately for Hadoop, specifically, and big data vendors in general, nearly every company sees that their data matters. They may not know what to do with it, as Gartner found (Figure A), but they know they can't give up:
Figure A
Top big data challenges.
The reason is growth. In a recent Gartner survey, 33% of respondents named growth as their top priority, which nearly equals the sum of the next three issues on the list of top strategic business priorities. If this were just a matter of replacing expensive data warehouses, no one would bother with learning Hadoop or any other big data technologies.
This is why we're seeing enterprises retreat from earlier expectations that Hadoop would be a "good enough and cheap" replacement technology for expensive, legacy infrastructure (Figure B):
Figure B
Data Warehouse Reference Survey.
But in this "retreat," organizations are actually advancing. While the media may love a good rip-and-replace story, it's not very interesting to replace clunky, legacy software. It's far more interesting, and far more important to a company's prospects, to embrace technology like Hadoop to invent the future
Who are the current users?
Yahoo!
On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.
There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple datacenters. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
On June 10, 2009, Yahoo! made the source code of the version of Hadoop it runs in production available to the public. Yahoo! contributes all the work it does on Hadoop to the open-source community. The company's developers also fix bugs, provide stability improvements internally and release this patched source code so that other users may benefit from their effort.
In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. On June 13, 2012 they announced the data had grown to 100 PB. On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PB per day.
How to install Hadoop ?