Hadoop is disruptive

Hadoop is hard, but the right tools make it easier

Over the last five years, there have been few more disruptive forces in information technology than big data – and at the center of this trend is the Hadoop ecosystem. While everyone has a slightly different definition of big data, Hadoop is usually the first technology that comes to mind in big data discussions.

When organizations can effectively leverage Hadoop, putting to work frameworks like MapReduce, YARN, and Spark, the potential IT and business benefits can be particularly large. Over time, we’ve seen pioneering organizations achieve this type of success – and they’ve established some repeatable value-added use case patterns along the way. Examples include optimizing data warehouses by offloading less frequently used data and heavy transformation workloads to Hadoop, as well as customer 360-degree view projects that blend operational data sources together with big data to create on-demand intelligence across key customer touch points.

Organizations have achieved what can be best described as “order of magnitude” benefits in some of these scenarios, for instance:

  • Reducing ETL and data onboarding process times from many hours to less than an hour
  • Cutting millions of dollars in spending with traditional data warehouse vendors
  • Accelerating time to identify fraudulent transactions or other customer behavior indicators by 10 times or more

Given these potentially transformational results, you might ask – “Why isn’t every organization doing this today?” One major reason is simply that Hadoop is hard. As with any technology that is just beginning to mature, barriers to entry are high.

Specifically, some of the most common challenges to successfully implementing Hadoop for value-added analytics are:

  • A mismatch between the complex coding and scripting skillsets required to work with Hadoop and the SQL-centric skillsets most organizations possess
  • High cost of acquiring developers to work with Hadoop, coupled with the risk of having to interpret and manage their code if they leave
  • Sheer amount of time and effort it takes to manually code, tune, and debug routines for Hadoop
  • Challenges integrating Hadoop into enterprise data architectures and making it “play nice” with existing databases, applications, and other systems

These are some of the most readily apparent reasons why Hadoop projects may fail, leaving IT organizations disillusioned that the expected massive ROI (return on investment) has not been delivered. In fact, some experts are expecting the large majority of Hadoop projects to fall short of their business goals for these very reasons.

The good news is that traditional data integration software providers have begun to update their tools to help ease the pain of Hadoop, letting ETL developers and data analysts integrate and process data in a Hadoop environment with their existing skills. However, leveraging existing ETL skill sets alleviates just one part of a much larger set of big data challenges.

Reference: PENTAHO Hadoop and the Analytic Data Pipeline

Pentaho, a Hitachi Group company, is a leading data integration and business analytics company with an enterprise-class, open source-based platform for diverse big data deployments. Our mission is to help organizations across industries harness the value from all their data, including big data and Internet of Things (IoT), enabling them to find new revenue streams, operate more efficiently, deliver outstanding service and minimize risk.

Leave a Reply

Where does Data Integration add value for Analytics

Three Things That Happen When You Collaborate in the Cloud