Inside The Data Transformation Cement Mixer

Pipelines run the world. Whether it is petroleum, gas, hydrogen, sewage or some form of slurry (water mixed with a solid) or indeed the cabling pipelines that help to create the Internet, pipelines are pretty much at the heart of everything. As we move from the physical world to the digital fabric now underpinning society, we also talk about data pipelines.

What is a data pipeline?

An increasingly common term in the world of big data and cloud databases, a data pipeline is a computing term used to describe the passage of data from its inception and creation point (where it is typically raw, unstructured or otherwise more jumbled than we would like) onwards through various stages of classification, deduplication, integration and standardization – a data pipeline also encompasses core processes related to analytics, storage, applications, management and maintenance. There is no faucet, valve or tap system as such, but a data pipeline can be turned on and off just like a physical one.

When we look at how we work with data pipelines today, many of the functions (inside the pipe, so to speak) are related to data transformation to get it from one state or place to another. Because not all data in the pipeline is in the right format to be used by modern data analytics tools, a degree of data transformation (not to be confused with digital transformation) permeates throughout.

Some data transformation techniques change formats like JSON (JavaScript Object Notation – a text-based human-readable data interchange format used to exchange data) into tables, while others take tabular data and flatten or ‘denormalize’ it. Both allow the analytical database to process the data more easily and are done in a series of steps to extract, load and transform it.

What is data denormalization?

For clarification here, data denormalization is the action of adding precomputed redundant (but harmless and matching) data to a relational database in order to elevate and accelerate the database’s ability to ‘read’ information from its data store repository. Conversely, database normalization requires us to remove any redundant data and deduplicate the data set so that we only have a single version of each data record. All clear so far then, but what does life really look like in the data transformation zone and does the very act of doing so clog up the pipeline?

“These [data] transformation pipelines are required so that data consumers can query data efficiently, but the pipelines themselves often contribute to latency, stale data, complexity and higher costs,” said Andy Oliver, director of marketing at real-time analytics company CelerData. “Maintaining these pipelines and potentially scheduling them is complex, error-prone and labor-intensive to debug. In many datasets, a significant portion of data is never queried. Which data this will be isn’t known in advance, but transforming non-queried data is a waste of money. Real-time systems should avoid transformations to preserve freshness but all systems should weigh transformations against cost.”

Oliver suggests that many organizations turn to transformation pipelines in response to the limitations of query engines and databases. Many solutions for real-time analytics don’t handle joins efficiently and require data to be denormalized into flat tables. This essentially requires transformation pipelines to pre-join the data before it enters the database. Finding solutions that handle joins efficiently means finding a native-language query engine that uses new techniques like massively parallel processing (MPP), cost-based optimization and the Single Instruction/Multiple Data (SIMD) instruction set to do efficient joins across billions of rows in seconds or sub-second speed.

“In addition to joins, the need for aggregation is another major reason organizations turn to transformation pipelines,” explained CelerData’s Oliver. “Data is often pre-aggregated for summation, grouping, counts etc. Databases with efficient cost-based optimizers that support materialized views can avoid the pipeline by performing ‘Extract, Load, Transform’ (ELT) instead of traditional ‘Extract, Transform, Load’ (ETL). While this may still involve a transformation, it can be virtually invisible to the data infrastructure and applications as well as more cost efficient than external aggregation via tools like Spark or Flink.”

Besides the costs and complexities associated with denormalizing data in data pipelines, we may also need to consider real-time updates.

Append appendectomy

Having worked with a wide variety of organizations on these issues, Oliver and the CelerData team remind us that some types of data ‘change’ existing data. One way to deal with it is a structure that is ‘append only’, but that can lead to inefficiencies. If data is heavily processed through a pipeline, then updates must be similarly processed. Update pipelines not only introduce potential row-by-row latency but update and delete pipelines are even more complex than append pipelines.

“By using a newer query engine and data lakehouse technologies, it’s possible to avoid transformation pipelines entirely. There will still be sources and destinations, but no independent stages. Complex queries may not be as efficient as pre-flattened data, but the architecture will be simpler and there is no cost in terms of compute, memory, or storage to pre-transform data that is ultimately left unqueried. This ‘pipeline free’ architecture also ensures that data is fresher,” said Oliver.

But, he reminds us, not everything makes sense to go pipeline-free. For data without many unqueried rows that are always joined, it may make more sense to pre-join the data through a pipeline. Many systems may be a hybrid of normalized and denormalized data. In Airbnb’s famed Minerva metric platform, for example, data is left normalized by default. Frequently queried dashboard data is still optimized into flatter tables, which balances query-time efficiency with flexibility.

“Analytics is changing,” surmized Oliver. “With AI comes the need for larger datasets. With modern customer profiling, personalization techniques and the Internet of Things comes the demand for lower latency. All of this requires new technologies that support larger amounts of more complex mutable [liable and/or capable of change] data. Newer query engines are required to support these requirements along with an eye towards continuous evolution and integration. Adopting new technologies allows data platform architects and engineers to escape the stranglehold of ever more complex data pipelines.”

As we create a new world of massive data (perhaps unintentionally or serendipitously coining the term as a progression onward part old-fashioned big data), we may need a new approach to data pipeline maintenance if we are going to be able to keep the flow flowing, keep the valves well-greased and keep the structure itself sound and free of cracks and fissures.

Read the full article here