Startup DreamersStartup Dreamers
  • Home
  • Startup
  • Money & Finance
  • Starting a Business
    • Branding
    • Business Ideas
    • Business Models
    • Business Plans
    • Fundraising
  • Growing a Business
  • More
    • Innovation
    • Leadership
Trending

Seagate HDDs For AI And Panmnesia’s Composable AI Infrastructure

July 15, 2025

How Much Money You Need to Be Wealthy: Survey

July 15, 2025

‘People Are Going to Die’: A Malnutrition Crisis Looms in the Wake of USAID Cuts

July 15, 2025
Facebook Twitter Instagram
  • Newsletter
  • Submit Articles
  • Privacy
  • Advertise
  • Contact
Facebook Twitter Instagram
Startup DreamersStartup Dreamers
  • Home
  • Startup
  • Money & Finance
  • Starting a Business
    • Branding
    • Business Ideas
    • Business Models
    • Business Plans
    • Fundraising
  • Growing a Business
  • More
    • Innovation
    • Leadership
Subscribe for Alerts
Startup DreamersStartup Dreamers
Home » Why The Future Of Generative AI Lies In A Company’s Own Data
Innovation

Why The Future Of Generative AI Lies In A Company’s Own Data

adminBy adminOctober 17, 20230 ViewsNo Comments5 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email

Alex Ratner is the co-founder and CEO at Snorkel AI, and an Affiliate Assistant Professor of Comput. Sci. at the University of Washington.

The age of large language models (LLMs) and generative AI has sparked excitement for business leaders. But those who want to launch their own LLM face many hurdles between wanting a production generative AI tool and developing one that delivers real business value and sustained advantage.

Foundation models themselves have quickly become commoditized. Any developer can build on the Google Bard or OpenAI APIs. More mature organizations can deploy models like Llama 2 in their own walled gardens. But if their competitors also use Llama 2, what advantage do they have? Proprietary data—and the knowledge of how to develop and use it—provides the only sustainable enterprise AI moat.

As generative AI has reached the peak of the Gartner AI Hype Cycle, enterprises are learning that off-the-shelf LLMs can’t solve every problem—particularly not unique, high-value problems. Proprietary data can close the gap, but only when properly curated and developed.

Your Data, Your Moat

Off-the-shelf LLMs yield fun experiments and demos, but using them in a business setting will rarely achieve the accuracy needed to deliver business value. Businesses don’t need chatbots that can discuss poetry as competently as they explain computer code—they need highly accurate specialists.

Private data is a moat—a potential competitive advantage. By leveraging your proprietary data and subject matter expertise, you can build generative models that work better for your domain, your chosen tasks and your customers.

Enterprises can gain these advantages in three ways:

1. Retrieval augmentation.

2. Fine-tuning with prompts and responses.

3. Self-supervised pre-training.

Let’s briefly look at each.

Retrieval Augmentation

Retrieval augmented generation, better known as RAG, allows your generative AI pipeline to enrich prompts with query-specific knowledge from a company’s proprietary databases or document archives. This generally yields better, more accurate answers, even with a standard LLM.

But this is similar to giving an intern access to your intranet: Even with all the information before them, the intern may misunderstand or miscommunicate. To get better performance, your data team needs a customized model.

Fine-Tuning With Prompts And Responses

Data-driven organizations can fine-tune LLMs with curated prompts and responses. This sharpens and improves the model’s output on the organization’s most important tasks. To use a metaphor, a doctor needs access to a patient’s medical chart (retrieval augmentation) and specialty training (fine-tuning) to render an accurate diagnosis. Data scientists can carefully choose the prompts and responses used to fine-tune the LLM to improve performance on a wide variety of tasks or greatly boost performance on a very narrow set of them.

Self-Supervised Pre-Training

Some organizations may want to take their LLM customization further and build a model from scratch. However, this can demand more effort than it’s worth. Firms with business vocabularies well-represented in the embedding spaces of off-the-shelf LLMs can often achieve the necessary performance gains through fine-tuning alone.

If an organization feels that it needs a model custom-built from the ground up, its data team first selects a model architecture and then trains it on unstructured text—initially on a large, generalized corpus, then on proprietary data. This teaches the model to understand the relationships between words in a way that’s specific to the company’s domain, history, positioning and products. The data team can then further train the model on prompts and responses to make it not just knowledgeable but task-oriented.

The Data Lift

The ideal deployment would incorporate all three of the above approaches, but that represents a heavy data labeling load. Studies from McKinsey and Appen show that a lack of high-quality labeled data blocks enterprise AI projects more often than any other factor—and, to be clear, all three of these approaches require labeled data.

Fine-tuning with prompts and responses requires data teams to identify and label prompts according to the task and then determine high-quality responses. Pre-training with self-supervised learning requires companies to carefully curate the unstructured data they feed the model. Training on lunch orders and payroll could degrade performance or cause sensitive data to leak internally.

Even retrieval augmentation benefits from data labeling. Although vector databases efficiently handle relevance metrics, they won’t know if a retrieved document is accurate and up to date. No company wants its internal chatbot to return out-of-date prices or recommend discontinued products.

Data Is Essential To Delivering Generative AI Value

Using your proprietary data to build your AI moat requires work, and that work rests heavily on data-centric approaches including data labeling and curation. Firms can outsource some labeling to crowd workers, but much of it will be too complex, specialized or sensitive for gig workers to handle. And it’s still time-consuming and expensive, similar to relying on internal experts for data labeling.

Data science teams can use active learning approaches such as label spreading to amplify the impact of internal labelers. Programmatic labeling is another option. Our researchers used those tools to build a better LLM.

Your data—properly prepared—is the most important thing your organization brings to AI and where your organization should spend the most time to extract the most value. Your data is your moat. Use it.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Read the full article here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Articles

Seagate HDDs For AI And Panmnesia’s Composable AI Infrastructure

Innovation July 15, 2025

A Cybersecurity Primer For Businesses In 2025

Innovation July 14, 2025

Today’s Extra Clues And Answers

Innovation July 13, 2025

One Of The Best Action Movies Ever Made Lands On Netflix Today

Innovation July 12, 2025

Today’s NYT Mini Crossword Clues And Answers For Friday, July 11th

Innovation July 11, 2025

Taylor Vs. Serrano 3 Will Set A World Record—Here’s How To Watch

Innovation July 10, 2025
Add A Comment

Leave A Reply Cancel Reply

Editors Picks

Seagate HDDs For AI And Panmnesia’s Composable AI Infrastructure

July 15, 2025

How Much Money You Need to Be Wealthy: Survey

July 15, 2025

‘People Are Going to Die’: A Malnutrition Crisis Looms in the Wake of USAID Cuts

July 15, 2025

A Cybersecurity Primer For Businesses In 2025

July 14, 2025

Why Surcharging Is a Bad Move For Small Businesses — and What to Do Instead

July 14, 2025

Latest Posts

How to Build a Side Hustle That Stands on Its Own — Without Burning Out

July 14, 2025

Tornado Cash Made Crypto Anonymous. Now One of Its Creators Faces Trial

July 14, 2025

Today’s Extra Clues And Answers

July 13, 2025

‘Obvious’ Side Hustle: From $300k Monthly to $20M+ in 2025

July 13, 2025

The Smart Way to Scale From Single- to Multi-Unit Ownership

July 13, 2025
Advertisement
Demo

Startup Dreamers is your one-stop website for the latest news and updates about how to start a business, follow us now to get the news that matters to you.

Facebook Twitter Instagram Pinterest YouTube
Sections
  • Growing a Business
  • Innovation
  • Leadership
  • Money & Finance
  • Starting a Business
Trending Topics
  • Branding
  • Business Ideas
  • Business Models
  • Business Plans
  • Fundraising

Subscribe to Updates

Get the latest business and startup news and updates directly to your inbox.

© 2025 Startup Dreamers. All Rights Reserved.
  • Privacy Policy
  • Terms of use
  • Press Release
  • Advertise
  • Contact

Type above and press Enter to search. Press Esc to cancel.

GET $5000 NO CREDIT