Self-Attention and the Key-Value Cache (In A Nutshell!)
The Transformer architecture underpins most modern large language models (LLMs). In the seminal paper "Attention is All You Need" , Vaswani et al. propose a Transformer architecture that relies solely on the multi-head self-attention mechanism to learn global dependencies (i.e. relationships) between words in a sentence. In this post, I will first explain the multi-head self-attention mechanism that is used in LLMs such as the original ChatGPT model (which was derived from GPT-3.5), and go on to explain why a key-value cache is needed for efficient inference. For illustrative purposes, I use words to represent tokens. ChatGPT uses a decoder-only Transformer architecture, and is trained to predict the next token (e.g. sub-word, punctuation) given a context (i.e. message). I start by illustrating the multi-head self-attention mechanism using a single sentence as an example. Assume that we are training an LLM with a context window of 10 words; a 10-word sentence in the trai...