Kunal Agarwal is the CEO of Unravel Data, specializing in data observability and FinOps for the modern data stack.
When gold was discovered in California in 1848, it triggered one of the largest migrations in U.S. history, accelerated a transportation revolution and helped revitalize the U.S. economy. There’s another kind of Gold Rush happening today: a mad dash to invest in artificial intelligence (AI) and machine learning (ML).
The speed at which AI-related technologies have been embraced by businesses means that companies can’t afford to sit on the sidelines. Companies also can’t afford to invest in models that fail to live up to their promises.
But AI comes with a cost. McKinsey estimates that developing a single generative AI model costs up to $200 million, up to $10 million to customize an existing model with internal data and up to $2 million to deploy an off-the-shelf solution.
The volume of generative AI/ML workloads—and data pipelines that power them—has also skyrocketed at an exponential rate as various departments run differing use cases with this transformational technology. Bloomberg Intelligence reports that the generative AI market is poised to explode, growing to $1.3 trillion over the next 10 years from a market size of just $40 billion in 2022. And every job, every workload and every data pipeline costs money.
Because of the cost factor, winning the AI race isn’t just about getting there first; it’s about making the best use of resources to achieve maximum business goals.
The Snowball Effect
There was a time when IT teams were the only ones utilizing AI/ML models. Now, teams across the enterprise—from marketing to risk to finance to product and supply chain—are all utilizing AI in some capacity, many of whom lack the training and expertise to run these models efficiently.
AI/ML models process exponentially more data, requiring massive amounts of cloud compute and storage resources. That makes them expensive: A single training run for GPT-3 costs $12 million.
Enterprises today may have upwards of tens—even hundreds—of thousands of pipelines running at any given time. Running sub-optimized pipelines in the cloud often causes costs to quickly spin out of control.
The most obvious culprit is oversized infrastructure, where users are simply guessing how much compute resources they need rather than basing it on actual usage requirements. Same thing with storage costs, where teams may be using more expensive options than necessary for huge amounts of data that they rarely use.
But data quality and inefficient code often cause costs to soar even higher: data schema, data skew and load imbalances, idle time and a rash of other code-related issues that make data pipelines take longer to run than necessary—or even fail outright.
Like a snowball gathering size as it rolls down a mountain, the more data pipelines you have running, the more problems, headaches and, ultimately, costs you’re likely to have.
And it’s not just cloud costs that need to be considered. Modern data pipelines and AI workloads are complex. It takes a tremendous amount of troubleshooting expertise just to keep models working and meeting business SLAs—and that doesn’t factor in the costs of downtime or brand damage. For example, if a bank’s fraud detection app goes down for even a few minutes, how much would that cost the company?
Optimized Data Workloads On The Cloud
Optimizing cloud data costs is a business strategy that ensures a company’s resources are being allocated appropriately and in the most cost-efficient manner. It’s fundamental to the success of an AI-driven company as it ensures that cloud data budgets are being used effectively and providing maximum ROI.
But, business and IT leaders need to first understand exactly where resources are being used efficiently and where waste is occurring. To do so, keep in mind the following when developing a cloud data cost optimization strategy.
• Reuse building blocks. Everything costs money on the cloud. Every file you store, every record you access, every piece of code you run incur a cost. Data processing can usually be broken down into a series of steps, and a smart data team should be able to reuse those steps for other processing. For example, code written to move data about a company’s sales records could be reused by the pricing and product teams rather than both teams building their own code separately and incurring twice the cost.
• Truly leverage cloud capabilities. The cloud allows you to quickly adjust the resources needed to process data workloads. Unfortunately, too many companies operate under “just in case” scenarios that lead to allocating more resources than actually needed. By understanding usage patterns and leveraging cloud’s auto-scaling capabilities, it’s possible for companies to dynamically control how they scale up and, more importantly, create guardrails to manage the maximum.
• Analyze compute and storage spend by job and by user. The ability to really dig down to the granular details of who is spending what on which project will likely yield a few surprises. You might find that the most expensive jobs are not the ones that are making your company millions. You may find that you’re paying way more for exploration than for data models that will be put to good use. Or, you may find that the same group of users are responsible for the jobs with the biggest spend and the lowest ROI (in which case, it might be time to tighten up on some processes).
Given the data demands that generative AI models and use cases place on a company, business and IT leaders need to have a deep understanding of what’s going on under the proverbial hood. As generative AI evolves, business leaders will need to address new challenges. Keeping cloud costs under control shouldn’t be one of them.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
Read the full article here