The Big Data Transformation

Once enterprises are able to successfully pull a variety of data into Hadoop in a flexible and scalable fashion, the next step involves processing, transforming and blending that data at scale on the Hadoop cluster.

This enables complete analytics, taking all relevant data into account, whether structured, semi-structured, or unstructured.

As touched on earlier, it is essentially a “table stakes” requirement to leverage an intuitive and easy-to-use data integration product to design and execute these types of data integration workflows on the Hadoop cluster. Providing drag and drop Hadoop data integration tools to ETL developers and data analysts allows enterprises to avoid hiring expensive developers with Hadoop experience.

In a rapidly evolving big data world, IT departments also need to design and maintain data transformations without having to worry about changes to the underlying technology infrastructure. There needs to be a level of abstraction away from the underlying framework (whether Hadoop or something else), such that the development and maintenance of data-intensive applications can be democratized beyond a small group of expert coders.

This is possible with the combination of a highly portable data transformation engine (“write once, run anywhere”) and an intuitive graphical development environment for data integration and orchestration workflow. Ideally, this joint set of capabilities is encapsulated entirely within one software platform. Overall, this approach not only boosts IT productivity dramatically, but it also accelerates the delivery of actionable analytics to business decision makers.

Ease of installation and configuration is a related element that enterprises can look to in order to drive superior time to value in Hadoop data integration and analytics projects. This is fairly intuitive – the more adapters, node-by-node installations, and separate Hadoop component configurations required, the longer it will take to get up and running. However, underlying solution architecture, and by extension configuration processes, can have important additional operational implications.

For instance, as more node-by-node software is installed and more cluster variables are tuned, it is more likely that an approach will risk interfering with policies and rules set by Hadoop administrators.

Also, more onerous and cluster-invasive platform installation requirements can create problems including:

  • Repetitive manual installation interventions • Increased risk to change and reduced solution agility
  • Inability to work in a dynamic provisioning model
  • Reduced architectural flexibility
  • Lower cost effectiveness

Organizations taking a holistic approach to Hadoop data analytics will look beyond simply insulating traditional ETL developers from the complexity of Hadoop to providing different roles with the additional control and performance they need. If a broader base of Hadoop developers, admins, and data scientists should be involved in the overall data pipeline, those roles need to be empowered to work productively with Hadoop as well. Enterprises should be wary of “black box” approaches to data transformation on Hadoop, and instead, opt for an approach that combines ease of use and deeper control and visibility.

This includes native, transparent transformation execution via MapReduce, direct control over spinning up or down cluster resources via YARN, ability to work with data in HBase, and integration with tools like Sqoop for bulk loads and Oozie for workflow management. It can also extend out to providing the ability to orchestrate and leverage pre-existing scripts ( Java, Pig, Hive, etc.) that organizations may still want to use in conjunction with other visually designed jobs and transformations.

An alternative approach to big data integration involves the use of code generation tools, which output code that must then be separately run. In addition, because these tools generate code, that code is often maintained, tuned, and debugged directly – which can create additional overhead for Hadoop projects. Code generators may provide fine-grained control, but they normally have a much steeper learning curve. Use of such code generators mandates iterative and repetitive access to highly skilled technical resources familiar with coding and programming. As such, total cost of ownership (TCO) should be carefully evaluated.

Reference: PENTAHO Hadoop and the Analytic Data Pipeline

Pentaho, a Hitachi Group company, is a leading data integration and business analytics company with an enterprise-class, open source-based platform for diverse big data deployments. Our mission is to help organizations across industries harness the value from all their data, including big data and Internet of Things (IoT), enabling them to find new revenue streams, operate more efficiently, deliver outstanding service and minimize risk.

Leave a Reply

Three Things That Happen When You Collaborate in the Cloud

7 Trends of IoT in 2017