Startup DreamersStartup Dreamers
  • Home
  • Startup
  • Money & Finance
  • Starting a Business
    • Branding
    • Business Ideas
    • Business Models
    • Business Plans
    • Fundraising
  • Growing a Business
  • More
    • Innovation
    • Leadership
Trending

The Impact Of Parasocial Relationships With Anthropomorphized AI

July 19, 2025

29-Year-Old’s Side Hustle: $10k in 2 Days, 6 Figures a Month

July 19, 2025

I Took My Side Hustle Full-Time and Earned $222,000 Last Year

July 19, 2025
Facebook Twitter Instagram
  • Newsletter
  • Submit Articles
  • Privacy
  • Advertise
  • Contact
Facebook Twitter Instagram
Startup DreamersStartup Dreamers
  • Home
  • Startup
  • Money & Finance
  • Starting a Business
    • Branding
    • Business Ideas
    • Business Models
    • Business Plans
    • Fundraising
  • Growing a Business
  • More
    • Innovation
    • Leadership
Subscribe for Alerts
Startup DreamersStartup Dreamers
Home » Inside The Data Transformation Cement Mixer
Innovation

Inside The Data Transformation Cement Mixer

adminBy adminOctober 17, 20230 ViewsNo Comments6 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email

Pipelines run the world. Whether it is petroleum, gas, hydrogen, sewage or some form of slurry (water mixed with a solid) or indeed the cabling pipelines that help to create the Internet, pipelines are pretty much at the heart of everything. As we move from the physical world to the digital fabric now underpinning society, we also talk about data pipelines.

What is a data pipeline?

An increasingly common term in the world of big data and cloud databases, a data pipeline is a computing term used to describe the passage of data from its inception and creation point (where it is typically raw, unstructured or otherwise more jumbled than we would like) onwards through various stages of classification, deduplication, integration and standardization – a data pipeline also encompasses core processes related to analytics, storage, applications, management and maintenance. There is no faucet, valve or tap system as such, but a data pipeline can be turned on and off just like a physical one.

When we look at how we work with data pipelines today, many of the functions (inside the pipe, so to speak) are related to data transformation to get it from one state or place to another. Because not all data in the pipeline is in the right format to be used by modern data analytics tools, a degree of data transformation (not to be confused with digital transformation) permeates throughout.

Some data transformation techniques change formats like JSON (JavaScript Object Notation – a text-based human-readable data interchange format used to exchange data) into tables, while others take tabular data and flatten or ‘denormalize’ it. Both allow the analytical database to process the data more easily and are done in a series of steps to extract, load and transform it.

What is data denormalization?

For clarification here, data denormalization is the action of adding precomputed redundant (but harmless and matching) data to a relational database in order to elevate and accelerate the database’s ability to ‘read’ information from its data store repository. Conversely, database normalization requires us to remove any redundant data and deduplicate the data set so that we only have a single version of each data record. All clear so far then, but what does life really look like in the data transformation zone and does the very act of doing so clog up the pipeline?

“These [data] transformation pipelines are required so that data consumers can query data efficiently, but the pipelines themselves often contribute to latency, stale data, complexity and higher costs,” said Andy Oliver, director of marketing at real-time analytics company CelerData. “Maintaining these pipelines and potentially scheduling them is complex, error-prone and labor-intensive to debug. In many datasets, a significant portion of data is never queried. Which data this will be isn’t known in advance, but transforming non-queried data is a waste of money. Real-time systems should avoid transformations to preserve freshness but all systems should weigh transformations against cost.”

Oliver suggests that many organizations turn to transformation pipelines in response to the limitations of query engines and databases. Many solutions for real-time analytics don’t handle joins efficiently and require data to be denormalized into flat tables. This essentially requires transformation pipelines to pre-join the data before it enters the database. Finding solutions that handle joins efficiently means finding a native-language query engine that uses new techniques like massively parallel processing (MPP), cost-based optimization and the Single Instruction/Multiple Data (SIMD) instruction set to do efficient joins across billions of rows in seconds or sub-second speed.

“In addition to joins, the need for aggregation is another major reason organizations turn to transformation pipelines,” explained CelerData’s Oliver. “Data is often pre-aggregated for summation, grouping, counts etc. Databases with efficient cost-based optimizers that support materialized views can avoid the pipeline by performing ‘Extract, Load, Transform’ (ELT) instead of traditional ‘Extract, Transform, Load’ (ETL). While this may still involve a transformation, it can be virtually invisible to the data infrastructure and applications as well as more cost efficient than external aggregation via tools like Spark or Flink.”

Besides the costs and complexities associated with denormalizing data in data pipelines, we may also need to consider real-time updates.

Append appendectomy

Having worked with a wide variety of organizations on these issues, Oliver and the CelerData team remind us that some types of data ‘change’ existing data. One way to deal with it is a structure that is ‘append only’, but that can lead to inefficiencies. If data is heavily processed through a pipeline, then updates must be similarly processed. Update pipelines not only introduce potential row-by-row latency but update and delete pipelines are even more complex than append pipelines.

“By using a newer query engine and data lakehouse technologies, it’s possible to avoid transformation pipelines entirely. There will still be sources and destinations, but no independent stages. Complex queries may not be as efficient as pre-flattened data, but the architecture will be simpler and there is no cost in terms of compute, memory, or storage to pre-transform data that is ultimately left unqueried. This ‘pipeline free’ architecture also ensures that data is fresher,” said Oliver.

But, he reminds us, not everything makes sense to go pipeline-free. For data without many unqueried rows that are always joined, it may make more sense to pre-join the data through a pipeline. Many systems may be a hybrid of normalized and denormalized data. In Airbnb’s famed Minerva metric platform, for example, data is left normalized by default. Frequently queried dashboard data is still optimized into flatter tables, which balances query-time efficiency with flexibility.

“Analytics is changing,” surmized Oliver. “With AI comes the need for larger datasets. With modern customer profiling, personalization techniques and the Internet of Things comes the demand for lower latency. All of this requires new technologies that support larger amounts of more complex mutable [liable and/or capable of change] data. Newer query engines are required to support these requirements along with an eye towards continuous evolution and integration. Adopting new technologies allows data platform architects and engineers to escape the stranglehold of ever more complex data pipelines.”

As we create a new world of massive data (perhaps unintentionally or serendipitously coining the term as a progression onward part old-fashioned big data), we may need a new approach to data pipeline maintenance if we are going to be able to keep the flow flowing, keep the valves well-greased and keep the structure itself sound and free of cracks and fissures.



Read the full article here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Articles

The Impact Of Parasocial Relationships With Anthropomorphized AI

Innovation July 19, 2025

California Sues Trump To Hang Onto $4 Billion Of Bullet Train Funds

Innovation July 18, 2025

Why Even Sharks Avoid Electric Rays

Innovation July 17, 2025

Claressa Shields Tags 3 Legends In Latest Callout

Innovation July 16, 2025

Seagate HDDs For AI And Panmnesia’s Composable AI Infrastructure

Innovation July 15, 2025

A Cybersecurity Primer For Businesses In 2025

Innovation July 14, 2025
Add A Comment

Leave A Reply Cancel Reply

Editors Picks

The Impact Of Parasocial Relationships With Anthropomorphized AI

July 19, 2025

29-Year-Old’s Side Hustle: $10k in 2 Days, 6 Figures a Month

July 19, 2025

I Took My Side Hustle Full-Time and Earned $222,000 Last Year

July 19, 2025

How Bookshop’s Founder Raised $39M+ for Small Businesses

July 19, 2025

Tech Billionaires Back Erebor in the Wake of Silicon Valley Bank Collapse

July 19, 2025

Latest Posts

How to Cut Costs in the Right Places and Do More With Less

July 18, 2025

Patrick Mahomes is Entering the Coffee Game With a Bold Claim: ‘Consumers Deserve Better’

July 18, 2025

Microsoft and OpenAI’s AGI Fight Is Bigger Than a Contract

July 18, 2025

Why Even Sharks Avoid Electric Rays

July 17, 2025

Here’s the Average Disposable Income in Every State: Report

July 17, 2025
Advertisement
Demo

Startup Dreamers is your one-stop website for the latest news and updates about how to start a business, follow us now to get the news that matters to you.

Facebook Twitter Instagram Pinterest YouTube
Sections
  • Growing a Business
  • Innovation
  • Leadership
  • Money & Finance
  • Starting a Business
Trending Topics
  • Branding
  • Business Ideas
  • Business Models
  • Business Plans
  • Fundraising

Subscribe to Updates

Get the latest business and startup news and updates directly to your inbox.

© 2025 Startup Dreamers. All Rights Reserved.
  • Privacy Policy
  • Terms of use
  • Press Release
  • Advertise
  • Contact

Type above and press Enter to search. Press Esc to cancel.

GET $5000 NO CREDIT