New technique can accelerate language models by 300x

Contents

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.

Researchers at ETH Zurich have developed a new technique that can significantly boost the speed of neural networks. They’ve demonstrated that altering the inference process can drastically cut down the computational requirements of these networks.

In experiments conducted on BERT, a transformer model employed in various language tasks, they achieved an astonishing reduction of more than 99% in computations. This innovative technique can also be applied to transformer models used in large language models (LLMs) like GPT-3, opening up new possibilities for faster, more efficient language processing.

Fast feedforward networks

Transformers, the neural networks underpinning LLMs, are comprised of various layers, including attention layers and feedforward layers. The latter, accounting for a substantial portion of the model’s parameters, are computationally demanding due to the necessity of calculating the product of all neurons and input dimensions.

However, the researchers’ paper shows that not all neurons within the feedforward layers need to be active during the inference process for every input. They propose the introduction of “fast feedforward” layers (FFF) as a replacement for traditional feedforward layers.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!

Learn More

FFF uses a mathematical operation known as conditional matrix multiplication (CMM), which replaces the dense matrix multiplications (DMM) used by conventional feedforward networks.

In DMM, all input parameters are multiplied by all the network’s neurons, a process that is both computationally intensive and inefficient. On the other hand, CMM handles inference in a way that no input requires more than a handful of neurons for processing by the network.

By identifying the right neurons for each computation, FFF can significantly reduce the computational load, leading to faster and more efficient language models.

Fast feedforward networks in action

To validate their innovative technique, the researchers developed FastBERT, a modification of Google’s BERT transformer model. FastBERT revolutionizes the model by replacing the intermediate feedforward layers with fast feedforward layers. FFFs arrange their neurons into a balanced binary tree, executing only one branch conditionally based on the input.

To evaluate FastBERT’s performance, the researchers fine-tuned different variants on several tasks from the General Language Understanding Evaluation (GLUE) benchmark. GLUE is a comprehensive collection of datasets designed for training, evaluating and analyzing natural language understanding systems.

The results were impressive, with FastBERT performing comparably to base BERT models of similar size and training procedures. Variants of FastBERT, trained for just one day on a single A6000 GPU, retained at least 96.0% of the original BERT model’s performance. Remarkably, their best FastBERT model matched the original BERT model’s performance while using only 0.3% of its own feedforward neurons.

The researchers believe that incorporating fast feedforward networks into LLMs has immense potential for acceleration. For instance, in GPT-3, the feedforward networks in each transformer layer consist of 49,152 neurons.

The researchers note, “If trainable, this network could be replaced with a fast feedforward network of maximum depth 15, which would contain 65536 neurons but use only 16 for inference. This amounts to about 0.03% of GPT-3’s neurons.”

Room for improvement

There has been significant hardware and software optimization for dense matrix multiplication, the mathematical operation used in traditional feedforward neural networks.

“Dense matrix multiplication is the most optimized mathematical operation in the history of computing,” the researchers write. “A tremendous effort has been put into designing memories, chips, instruction sets, and software routines that execute it as fast as possible. Many of these advancements have been – be it for their complexity or for competitive advantage – kept confidential and exposed to the end user only through powerful but restrictive programming interfaces.”

In contrast, there is currently no efficient, native implementation of conditional matrix multiplication, the operation used in fast feedforward networks. No popular deep learning framework offers an interface that could be used to implement CMM beyond a high-level simulation.

The researchers developed their own implementation of CMM operations based on CPU and GPU instructions. This led to a remarkable 78x speed improvement during inference.

However, the researchers believe that with better hardware and low-level implementation of the algorithm, there could be potential for more than a 300x improvement in the speed of inference. This could significantly address one of the major challenges of language models—the number of tokens they generate per second.

“With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces,” the researchers write.

This research is part of a broader effort to tackle the memory and compute bottlenecks of large language models, paving the way for more efficient and powerful AI systems.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

READ SOURCE