Groq's Record-Breaking Language Processor Hits 100 Tokens Per Second On A Massive AI Model

Groq’s newly announced language processor, the Groq LPU, has demonstrated that it can run 70-billion-parameter enterprise-scale language models at a record speed of more than 100 tokens per second.

In a YouTube video, Mark Heaps, VP of Brand and Communications for Groq, uses a cell phone to show what 100 tokens per second looks like with the Groq LPU running Meta’s 70-billion-parameter Llama 2 model. At 100 tokens per second, Groq estimates that it has a 10x to 100x speed advantage compared to other systems.

Groq chips are purpose-built to function as dedicated language processors. Large language models such as Llama 2 work by analyzing a sequence of words; then, using those words, they predict the next term in sequence. How accurate they are in predicting the next word is a critical factor for determining the best model.

Groq chips are optimized for the sequential nature of natural language and other sequential data like DNA, music and code. Being so specific in their design leads to much better performance on language tasks than, for example, GPUs that are optimized for parallel graphics processing.

Groq has proven it is no stranger to large language models. It has experimented using its chips on various LLMs including LLaMA 1 and Vicuna from Anthropic. Its engineers are now running LLaMA 2 with model sizes from 7 billion to 70 billion parameters.

Groq’s compiler plays an important role

Because Jonathan Ross, Groq’s founder and CEO, planned on the compiler being a cornerstone of the company’s technical capabilities, the design team spent its first six months with a focus on designing and building the compiler. Only after the team was satisfied with the compiler did it begin working on chip architecture.

Unlike traditional compilers, Groq’s does not rely on kernels or manual intervention. Through a software-first co-design approach for the compiler and hardware, Groq built its compiler to map models directly to the underlying architecture automatically. The automated compilation process allows the compiler to optimize model execution on the hardware without requiring manual kernel development or tuning.

The compiler also makes it easy to add resources and scale up. So far, Groq has compiled more than 500 AI models for experimental purposes by using the automated process just described.

When Groq ports a customer’s workload from GPUs to the Groq LPU, its first step is to remove non-portable vendor-specific kernels targeted for GPUs, then any manual parallelism or memory semantics. The code that remains is much simpler and more elegant when all the non-essentials are stripped away.

Groq gives an excellent example of this efficiency on its website in the description of its first go-round with Llama 1. What would have normally required months of work from dozens of engineers took only a week for a small team of 10 people to get Llama up and running on a GroqNode server. Even though Llama was not explicitly built for Groq’s architecture, the compiler could automatically uncover parallelism and optimize data layouts for the model. This example demonstrates how the compiler can map models to Groq’s hardware even without hardware-aware model development.

Groq also has an easy-to-use software suite and a low-latency purpose-built AI hardware architecture that synchronously scales to obtain more value from trained models. As the company continues to expand the scale of systems that the compiler can support, training the models will likely also become easier using the Groq approach.

Wrap up

In the future, Groq’s ultra-low latency and ultra-fast language processor could have a major impact on how LLMs are run and used. Groq’s automatic capability to map models to hardware without manual intervention is not only a technical advantage, but also a way to increase ROI by reducing the time needed to move models through development and into operation.

Beyond that, Groq’s focus on sequential language processing provides better performance than general-purpose AI chips. The results speak for themselves: when dealing with massive LLMs, speed is a major factor for performance—and nothing yet can compare to 100 tokens per second.

Moor Insights & Strategy provides or has provided paid services to technology companies like all research and tech industry analyst firms. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, and video and speaking sponsorships. The company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Ampere Computing, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Cadence Systems, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cohesity, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Five9, Flex, Foundries.io, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, HYCU, IBM, Infinidat, Infoblox, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Jabil Circuit, Juniper Networks, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, LoRa Alliance, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, Multefire Alliance, National Instruments, Neat, NetApp, Nightwatch, NOKIA, Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), NXP, onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, Telesign,TE Connectivity, TensTorrent, Tobii Technology, Teradata,T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler. Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. VI, Fivestone Partners, Frore Systems, Groq, MemryX, Movandi, and Ventana Micro., MemryX, Movandi, and Ventana Micro.

Read the full article here