Solving The Data Bottleneck For Physical AI

Imagine trying to run an autonomous car through this traffic more than a century ago. Imagine now that there are locales exactly like this through which an autonomous L4 car needs to traverse. The lifeblood of AI (at least until we reach a point where computers can actually think like humans) is data – a lot of it. Spanning diverse weather conditions, driving patterns and driver/pedestrian behavior, and very importantly, edge cases, for training neural networks to perceive, plan a path, and execute actions.

One way is to generate data is to physically collect and localize it by sending a fleet of human-driven cars in different locations to create maps and traffic scenarios. Waymo did exactly this – before launching true L4 ride-sharing service in San Francisco, CA in 2024, it collected ~20M miles of data in the Bay Area over a 15 year period (Waymo’s parent, Google started collecting data in 2010). The difficulty and costs associated with such massive road data collection is enormous (approaching billions of $), making it difficult for new entrants to disrupt the autonomy space. Similar constraints exist for other applications of physical AI – in construction, agriculture, mining, rail, robotics, maritime, and aerial vehicles like drones.

Another way is to generate relevant and realistic synthetic data. Waymo does this already using generative AI, and so do a host of other AV companies like Waabi, Aurora Tech and Zoox. DiffuseDrive, a California-based company also uses generative AI, along with its proprietary algorithms to create realistic data with minimal human involvement. This includes corner cases that are expensive to generate on the road, and essentially rely on luck. The company has multiple customers in the defense (even more difficult to generate meaningful field data) and the commercial space (transportation, industrial autonomy and robotics).

DiffuseDrive – Data Richness with Generated Real World-Grade Data

Estimates project the market for synthetic vision data at $2B by 2030 (currently ~$425M). DiffuseDrive (DD), a Silicon Valley startup founded in 2023 by CEO Bálint Pásztor and CTO Roland Pintér (both are Bosch alumni) are hoping to grab a slice of this pie. The company recently raised $4M in a seed round from Outlander, Presto Tech Horizons and NeuronVC. With previous funding from E2VC, this brings the total raised to date to $5M. DD currently has 12 people on staff (4 in the U.S. and 8 in Europe).

Figure 1 shows the data pipeline for DD’s platform.

The process starts with the customer’s existing camera image and/or video data collected in the field (already labelled).

The next step is statistical analysis of this data (data-mining). Figure 2 shows an exploded view of this block (shown as part of the data pipeline in Figure 1).

Bounding Box Dimension Analysis: analyzes object size distributions across classes (e.g., width/height, pixel coverage) to answer questions like does the data represent near or far-off objects. This helps determine data gaps.
Center-Point Distribution: analyzes spatial positioning of images to see if they are centered, and whether edge-of-frame images are under-represented
Clustering / Feature Space Coverage: helps understand well-covered and duplicative scenarios and sparse regions (data gaps).
Co-occurrence Matrix: measures how often different classes appear in the same frame – for example, pedestrians and cars. This helps in understanding situational context and identifies rare combinations which could be critical edge cases.

Finally, the magic begins. Data generation is a matter of generating images within the distribution (based on Item 3, Clustering, above) and also expanding out of distribution. This is done using diffusion models which are used in generative AI based VLMs (Visual Language Models) to progressively inject random noise into samples, and then reverse it to filter the noise and generate high-quality synthetic images. Sounds like black magic, but it has a rigorous mathematical basis. LLMs (Large Language Models) act as “directors” by converting user prompts into descriptive, structured text that guides the diffusion model to produce accurate, contextually relevant images, within and out of distribution. In DD’s case, the statistical analyses in Figure 2 are used in a structured process (via LLM prompts) to guide the customer to add scenarios that fill the gap areas. Different versions of the data generation modules are used for different verticals. For example, in the commercial automotive space, obstacle avoidance is an imperative, whereas for defense applications, being able to target and damage an adversary asset is the imperative.

Exception cases are generated through a structured scenario definition exercise (using LLMs) with the customer that allows them to:

Specify environmental parameter ranges (terrain, weather, lighting, time-of-day)
Control object types, distributions, interactions
Introduce rare configurations (long tail, rare but plausible – say a tiger on the highway) or system limiting stress conditions (edge cases like heavy rain, where cameras barely function) cases. Edge and long tail cases have different autonomy strategies and actions.

Once the customer provides the scenarios, diffusion algorithms can create tens of thousands – or millions – of unique but distribution-controlled scenarios that have never been observed in real-world datasets. One of the challenges with field data is that each set has to be humanly curated and organized for a neural network to train and test on. DD provides auto-labeling capability of the generated synthetic data by using definitions which were part of the scenario specification above. Customers can modify these auto-labels or manually label if they desire.

As mentioned above, other companies use generative AI and diffusion models as well. According to Balint Pastor, DiffuseDrive’s approach is different. “We focus on the underlying data problem that blocks autonomy across industries. Whether it is finding a target or avoiding a vulnerable road user, the core challenge is the same: missing coverage, weak edge cases, and slow iteration cycles. We work across defense, automotive, robotics, and industrial systems, enabling learning of patterns that generalize beyond a single use case”. This helps scalability of synthetic data generation, and reduces statistical brittleness in the training process.

To demonstrate the impact of adding high quality synthetic data, DD tested a subset of data from a well known open source data-set, DOTA, that is widely used in the computer vision industry to characterize the quality of image recognition engines. DD limited the data set to aerial images of planes and helicopters. A total of 2961 real-world images (7% were helicopters and 93% planes) were extracted for training and testing the neural network. Figure 3 shows examples.

Two performance metrics were tracked:

Precision (P): ratio of correctly identified objects (say planes) to total predicted planes (which could contain false positives like classifying a helicopter as a plane). In essence, it measures the probability of a classification being true for a certain object class.
Recall (R): ratio of correctly identified objects (say planes) to total number of actual planes in the data set. In essence, it measures the probability that an object in a certain class will be identified.

DD compared P and R for the original real-world data-set, as well as augmenting this data-set with synthetic images of both objects generated by their data pipeline.

Main Conclusions (Refer to Figure 4):

Using a real-world data set that is class-biased (93% planes, 7% helicopters) results in poor performance for helicopter recognition.
Introducing DD’s synthetic data to adjust the class bias improves helicopter recognition performance substantially when tested against real-world data. Recall performance for planes degrades, understandable since the neural network weights are less biased towards planes.
Combining all real-world and synthetic data for training and testing improves performance substantially for planes and helicopters.

Perspectives from two of DD’s customers are presented below. Their relationship with DD is confidential at this point, but the information is based on interviews with senior executives at these companies.

Customer 1: Global Automotive Tier 1 Supplier

This Tier 1’s engineering and R&D teams have been working with DD for the past 18 months, on generating synthetic data (based on collected real-world driving data) to train their AI systems for ADAS (Advanced Driver Assistance Systems) and autonomous cars. Although Tier 1 suppliers do not generally engage in autonomous driving, this activity provides information to help design some of the hardware and software they supply to their OEM customers. Since their ability to run a large number of vehicles across various locations is limited, DD’s ability to generate and generalize synthetic data is very attractive. Additionally, with the advent of SDV (Software Defined Vehicles), low ADAS system latency is critical. DD’s integration of synthetic data in the training pipeline builds more robust and efficient systems, making low-latency edge computing possible.

A critical issue for automotive companies is the General Data Protection Regulation (GDPR), a comprehensive EU law that governs data protection and privacy for individuals. Imaging data of cars and people on public roads is protected by GDPR. DD’s synthetic data (even if based on real-world testing data) makes this inherently possible (as opposed to real-world data that has to be manually edited to remove all personal information, for example, number plate and location information).

Another use case for this Tier 1 is to use road data to offer autonomy and transport companies services for a given area. This could include re-fueling, cleaning, towing and repair. Understanding the local area and driving patterns using DD’s synthetic data helps in this endeavor.

Finally, all Tier 1 suppliers have different types their of sensors integrated into multiple car models. This data can be used to assist local government bodies for urban planning and smart city infrastructure development. DD’s synthetic data pipe-line enriches real-world camera image and video information.

Customer 2: U.S. Based Defense Contractor

Customer 2 works on software solutions that process sensor and GPS data from a variety of military platforms (ground, air, sea) and use this to create perception maps, identify threats (foe vs friend) and provide targeting intelligence to defense forces for further action. The data is also used to coordinate autonomous movement (and missions) of various assets in a war or security threatened situation (collaborative autonomy in challenging GPS and communications jammed environments, with enemy counter-measures). A previous article had discussed the challenges of collaborative autonomy under these conditions. The key is to have enough data to train the systems before hand on perception, path planning, platform control and warfare tactics. Collecting real-world data is incredibly challenging – at best, war-game scenarios can be used to complement sparse field data.

Enter DiffuseDrive.

The collaboration of this defense contractor with DD is to analyze satellite imagery of war-zones, and use this as the basis for generating synthetic data. Tools for labeling relevant data, and finding edge and long tail cases through DDs data pipeline are also exercised. DD is also organizing the pipeline to enable hierarchical labeling of the data (along groups of objects – for example, planes, helicopters, tanks, ships). Within each group are more detailed classifications. This is important for understanding friend-vs-foe assets, and their locations.

The collaboration with DD is relatively new (4Q2025)and is proceeding towards maturity and deployment.

Data is the life-blood of physical AI. Collecting real-life data requires large numbers of moving platforms, equipped with sensors and storage, and requiring human operators. Organizing and labeling this data requires human curation. Timelines are long, privacy needs to be protected and operating costs are high. In spite of this effort, edge and long tail cases can be missed, causing accidents in operation (for example, an autonomous vehicle speeding past an accident scene or a school bus with flashing lights). Using generative AI and diffusion algorithms to create synthetic, auto-labeled data in a scalable way, using limited amounts of real-world field data can solve this bottleneck.

Read the full article here