Introduction to Data Mashups and Data Blending

Three years after the McKinsey Global Institute highlighted the value of data blending (see above), we are seeing how increasingly important it is for organizations to bring data from disparate sources together in an analytics-ready fashion.

Indeed, the majority of organizations (52%) surveyed in a 2015 Forrester Consulting study are, on average, blending 50 or more data sources — and 12% are blending over 1,000 sources. Highlighting the importance of emerging data to the mashup trend, two thirds of companies surveyed are using unstructured data from sources like social media, 65% from Internet of Things or device/sensor data, and 58% from consumer mobile device data such as geolocation and wearables. This need to blend data to create value will only increase as new types and sources of data and information continue to emerge.

Yet, without a business goal to achieve, working with big data may just be a “science experiment.” How do businesses drive results? Increasingly, enterprise business issues are best addressed with a blended data approach. For instance, a telecom company might blend semi-structured network data with customer service data to understand the relationship between dropped calls and customer behavior across geographies. Wherever you’re at, it’s important to understand the benefits and restrictions of different types of data architecture.

The most powerful insights come from blending data on demand and at the source of the data. It takes a well-architected, trusted process with IT and business collaboration underlying it to put data blending into production in an analytics-ready format. This goes well beyond the mere act of visualizing or reporting on a data set.

Blending Data on Demand and at the source

An enterprise-grade data mashup strategy requires an architected and trusted approach, designed with full knowledge of underlying systems and constraints to utilize the most efficient point of processing. This is important in order to provide fast access to refined data and to avoid unnecessary staging in intermediary databases. On-demand data blending can mean using virtualized data sets to deliver blended data to production analytics applications quickly, while maintaining governance rules, semantics, and auditability.

Some visualization vendors in the market talk about blending – but it’s not an apples to apples comparison. Blending “at the glass,” i.e. blending done by end users or analysts away from the source who attempt to analyze their own available data sets with limited knowledge of the underlying corporate source systems, has the potential to deliver inaccurate or even completely incorrect results. For instance, there may be no way to ensure that the fields being matched for analysis are truly the same across different data sources.

To elaborate, think what happens when someone matches two fields from different sources both named “revenue” in records that match on “customer,” but one is a monthly sum total and the other is a daily total. This won’t be apparent to an analyst since the blending is done based on similar names. The analyst then runs a summation that adds the two together as the day’s total revenue from that customer. Unwittingly, the monthly figure is added into each day’s total, distorting the actual revenue generated from that customer dramatically.

The business then targets that customer as highly profitable and offers significant discounts to maintain their interest. Not only have you targeted the wrong customer and potentially ignored the real profitable customers — you’ve also now given undeserved discounts. The net result lowers your revenue from this customer, and potentially causes a loss of other profitable customers who were more deserving but left in favor of competitors offering them discounts. You’ve made the wrong decision because the analytics were based on a faulty process for bringing data together. Now imagine how this issue can be compounded when you have many different users dealing with over 50 different data sources (as suggested by the Forrester study).

The only choice to avoid these types of issues without tools that blend at the source is to train every user and analyst on detailed characteristics of underlying data sources and systems to ensure reliable results. The solution is of course largely infeasible for most organizations as it would take far too much time and expense while impacting productivity.

Even if you can take on this level of investment in training, you still face issues with the timeliness of the data, since other tools do not pull from the source systems on a controlled basis. It is impossible to know if the data pulled is indeed the latest and therefore the most accurate on that level. You need to be sure the analytics lead to accurate information so you can make the right decision.

You need architected data blending at the source. Ultimately, having a full understanding of the data you are working with, all the way from raw data and source systems to end user analytics insights, creates a higher degree of trust in the data being used. This is true both from the perspective of the IT team and the perspective of line of business users. Furthermore, this type of trust, coupled with process transparency and auditability, make it easier to ensure that data governance policies are being followed.

“When individual sources include automated and/or manual inputs, originate from disparate systems with different architectures, and are subject to different levels of governance, an effective integration process is essential.”

“Delivering Governed Data for analytics at scale”, an August 2015 commissioned study conducted by Forrester Consulting on behalf of Pentaho

Where else might this be relevant?

For example, a telecom company could use Pentaho to blend data to allow analysts and customer call center agents to get accurate, up-to-the-minute information in near-real-time to determine the best action to take regarding call quality. This is vital because quality of service changes rapidly depending on the network, such as if the customer was able to connect. Only an on-demand data blend will suffice in this case. The telecom company can create architected, blended views across both the traditional Call Detail Records in the data warehouse, and the network data streaming into a NoSQL store (like MongoDB) without sacrificing governance or performance.

Pentaho, a Hitachi Group company, is a leading data integration and business analytics company with an enterprise-class, open source-based platform for diverse big data deployments. Our mission is to help organizations across industries harness the value from all their data, including big data and Internet of Things (IoT), enabling them to find new revenue streams, operate more efficiently, deliver outstanding service and minimize risk.

Leave a Reply

Big Data Transitions to the Cloud

IBM and the Apache Spark innovation