Top 五 Reasons Not to Use Hadoop for Analytics-数据库教程-爱易网页

Top 五 Reasons Not to Use Hadoop for Analytics

日期：2014-05-16　浏览次数：20436 次

Top 5 Reasons Not to Use Hadoop for Analytics
原文地址：http://www.quantivo.com/blog/top-5-reasons-not-use-hadoop-analytics
As a former diehard fan of Hadoop, I LOVED the fact that you can work on up to Petabytes of data. I loved the ability to scale to thousands of nodes to process a large computation job. I loved the ability to store and load data in a very flexible format. In many ways, I loved Hadoop, until I tried to deploy it for analytics. That’s when I became disillusioned with Hadoop (it just "ain't all that").

At Quantivo, we’ve explored many ways to deploy Hadoop to answer analytical queries (trust me – I made every attempt to include it in my day job). At the end of the day, it became an exercise much like trying to build a house with just a hammer - Conceivably, it’s possible, but it’s unnecessarily painful and ridiculously cost-inefficient to do.

Let me share with you my top reasons why Hadoop should not be used for Analytics.

1 - Hadoop is a framework, not a solution – For many reasons, people have an expectation that Hadoop answers Big Data analytics questions right out of the box. For simple queries, this works. For harder analytics problems, Hadoop quickly falls flat and requires you to directly develop Map/Reduce code directly. For that reason, Hadoop is more like J2EE programming environment than a business analytics solution.

2 - Hive and Pig are good, but do not overcome architectural limitations – Both Hive and Pig are very well thought-out tools that enable the lay engineer to quickly being productive with Hadoop. After all, Hive and Pig are two tools that are used to translate analytics queries in common SQL or text into Java Map/Reduce jobs that can be deployed in a Hadoop environment. However, there are limitations in the Map/Reduce framework of Hadoop that prohibit efficient operation, especially when you require inter-node communications (as is the case with sorts and joins).

3 - Deployment is easy, fast and free, but very costly to maintain and develop – Hadoop is very popular because within an hour, an engineer can download, install, and issue a simple query. It’s also an open source project, so there are no software costs, which makes it a very attractive alternative to Oracle and Teradata. The true costs of Hadoop become obvious when you enter maintenance and development phase. Since Hadoop is mostly a development framework, Hadoop-proficient engineers are required to develop an application as well as optimize it to execute efficiently in a Hadoop cluster. Again, it’s possible but very hard to do.

4 - Great for data pipelining and summarization, horrible for AdHoc Analysis – Hadoop is great at analyzing large amounts of data and summarizing or “data pipelining” to transform the raw data into something more useful for another application (like search or text mining) – that’s what’s it’s built for. However, if you don’t know the analytics question you want to ask or if you want to explore the data for patterns, Hadoop becomes unmanageable very quickly. Hadoop is very flexible at answering many types of questions, as long as you spend the cycles to program and execute MapReduce code.

5 - Performance is great, except when it’s not – By all measures, if you wanted speed and you are required to analyze large quantities of data, Hadoop allows you to parallelize your computation to thousands of nodes. The potential is definitely there. But not all analytics jobs can easily be parallelized, especially when user interaction drives the analytics. So, unless the Hadoop application is designed and optimized for the question that you want to ask, performance can quickly become very slow – as each map/reduce job has to wait until the previous jobs are completed. Hadoop is always as slow as the

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

Top 五 Reasons Not to Use Hadoop for Analytics

相关资料更多>

推荐阅读更多>