Google's 'Performer' Aims To Solve The Resource-Hungry 'Transformer' Architecture

AI Brain

Transformer is a deep learning model introduced back in 2017, used primarily in the field of natural language processing (NLP).

Since its introduction, Transformers have become the model of choice for tackling many problems in NLP, replacing the older recurrent neural network (RNN) models. And since Transformer facilitates more parallelization during training, it has enabled training on larger data sets.

This allowed the development of pre-trained systems, like BERT (Bidirectional Encoder Representations from Transformers) from Google.

These systems have been trained with huge general-language data sets, and can be further fine-tuned by researchers and developers for more specific language tasks.

Because of this, the Transformer model has achieved the state-of-the-art results across diverse domains, going further than just NLP, to also include conversations, images, and even music.

But there is one big issue: the Transformer model requires a huge amount of computing power.

In other words, Transformer is hungry for resources.

At its core, the Transformer architecture has what it's called the attention module.

What it does, is computing similarity scores for all pairs of positions in an input sequence. The process of doing this, requires a quadratic amount of computation time and a huge amount of memory.

Transformer requires huge resources to store the matrix data. The more specific the model is trained with, it may need to process longer input sequences, and this is requiring even more resources.

So here, the more a Transformer model is relied on to train on a data set, its efficiency will decrease over time.

Researchers can address this by using the sparse attention method. This reduces the complexity of the data, by making the model compute only selective similarity scores from the sequence, based on various methods.

But this can create another problem, which is caused by limitations, like the unavailability of efficient sparse-matrix multiplication operations on all accelerators, lack of theoretical guarantees, insufficiency to address the full range of problems, and so forth.

This is why Google that wants to try solving the issue, introduces what it calls the 'Performer'.

Google Performer
Standard approach requires masking the attention matrix. But using a prefix-sum mechanism, an unbiased approximation can be built. (Credit: Google)

In a blog post, Google wrote that:

"To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19."

"The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels)."

The framework is implemented by Google's novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, providing scalable low-variance and unbiased estimation of attention mechanisms expressed by random feature maps decompositions (regular softmax-attention).

Its the system in using the decomposition mapping that helps it preserve linear space and time complexity.

In other words, the decomposition allows Google to store the implicit attention matrix with linear, rather than quadratic, memory complexity.

Google Performer
Benchmarking the space- and time-complexity of the Performer, showing that the attention speedups and memory reductions are empirically close to simply not using an attention mechanism at all. (Credit: Google)

While the original attention mechanism multiplies the stored attention matrix with the value input to obtain the final result, one cannot decompose the query-key back into its original query and key components after passing it through the nonlinear softmax operation.

However, said Google, it's possible for one to decompose the attention matrix back to a product of random nonlinear functions of the original query and key.

What this means, decomposing the attention matrix can rearrange the matrix multiplications to approximate the result of the regular attention mechanism, and that without explicitly constructing the quadratic-sized attention matrix.

This ultimately leads to FAVOR+.

By using the algorithm, Performer capable of processing data, increasing its speed, but with lower energy cost.

"To the best of our knowledge, we are the first to show that any attention matrix can be effectively approximated in downstream Transformer-applications using random features," said the researchers at Google.