A Paradigm Shift in Supercomputing?

In an era marked by unprecedented advancements in AI and machine learning, Tesla’s Dojo Supercomputer is garnering a lot of attention following a positive writeup from Morgan Stanley, albeit one shrouded in both skepticism and the usual Elon Musk-generated controversy. While the world’s leading tech giants such as NVIDIA, Intel, HPE, Lenovo, IBM, and Dell have been at the forefront of AI and supercomputer hardware, Tesla’s moves into chip design, supercomputing and AI are more than just corporate diversification; they provide an insight into how Tesla and more pertinently Elon Musk thinks about innovation and vertical integration.

As of July, Tesla has sold 4,527,916 cars during its history, and every one of these vehicles is transmitting data back to Tesla to power the company’s efforts to develop full self-driving capabilities. Meshing a huge mobile sensor and camera network with powerful edge computing capabilities to a backend supercomputer designed in-house to learn from that data is a paradigm we haven’t seen before and elevates Tesla beyond purely a vehicle manufacturer.

Setting the Context: Existing AI Processors and Supercomputers

To understand the significance of Dojo, one needs to examine the existing milieu of AI processors and supercomputing. Conventional supercomputers, typified by NVIDIA’s A100 GPUs, IBM’s Summit, or HPE’s Cray Exascale, have been vital in scientific research, complex simulations, and big data analytics. However, these systems are primarily designed for a broad array of tasks rather than optimized for a singular purpose like the real-world data-driven AI computer vision that Tesla is designing Dojo for.

Tesla’s Dojo promises to revolutionize the AI processing landscape by focusing solely on improving the company’s FSD [LW2] capabilities. With this vertical integration, Tesla aims to construct an ecosystem that encompasses hardware, data, and practical application—a trifecta that could usher in a new era of supercomputing, explicitly designed for real-world data processing.

The Evolution of Tesla’s Supercomputing Capabilities

Historically, Tesla relied on NVIDIA’s GPU clusters to train its neural networks for autopilot systems. Despite the lack of clarity in performance metrics such as single-precision or double-precision floating-point calculations, Tesla claimed to operate a computing cluster that stood as the “fifth-largest supercomputer in the world” as of 2021. Details are hard to come by but various commentators have put Tesla’s hoard of NVIDIA A100 GPUs at over 10,000 units, which by any stretch puts Tesla as having one of the largest training systems globally, and the company has been at this for at least 2 years now.

Technological Nuances and Scalability

The architectural uniqueness of Dojo is evident in its building block, the D1 chip, manufactured by TSMC using 7 nm semiconductor nodes, with a large die size of 645 mm² and 50 billion transistors and leveraging a RISC-V approach and custom instructions. The system scales by deploying multiple ExaPODs, housing up to 1,062,000 cores and reaching 20 exaflops. This kind of scalability has never been more necessary, especially when one considers the gargantuan volume of data that Tesla’s fleet generates. Furthermore, Dojo uses the software language PyTorch and introduces novel floating-point formats—CFloat8 and CFloat16—enabling more efficient vector processing and storage requirements.

Looking Ahead

While some industry experts consider Dojo to be an incremental rather than a revolutionary change, and others downplay everything Musk is associated with as bluster, it’s important to contextualize Dojo within Tesla’s broader FSD ambitions and the future of AI applications. Being developed entirely in-house, Dojo provides Tesla with the kind of control that can lead to accelerated development cycles, thereby fast-tracking innovations in autonomous vehicles and possibly other domains of AI including the computer vision-powered robots Tesla is working on.

This singular approach by Tesla may be a bellwether for what is to come with the wider AI domain. The norm becomes a customized integration solution to serve special purposes rather than more a general purpose approach where trained models are used as a foundation and organizations tune with private data. [LW3] It is too early to say, but this approach by Tesla will certainly only be available to a select few with the requisite deep pockets required to support it.

In summary, Dojo’s advent signals a shift in the supercomputing paradigm, one that leans toward edge-driven, vertical integration, specialization, and scalable architecture. Whether it transforms supercomputing is a matter of time to tell, but what is unequivocal is its potential to provide Tesla with a technical moat that other more traditional auto manufacturers will never be able to catch up to.

By focusing on specific real-world applications and demonstrating an architecture fundamentally different from conventional designs, Tesla’s Dojo emerges not merely as a tool for the company’s FSD aspirations but as a landmark development in the ever-converging worlds of AI and supercomputing.

[LW1]Per the Forbes style notes, no need to mention the year if it’s within the past year.

[LW2]Deleted per the Forbes style notes: “We don’t include abbreviations in parentheses after full references.” Also, the first occurrence of the term appears earlier, so I just used the acronym here.

[LW3]Steven, please ensure I didn’t change your meaning here.

Read the full article here