Posts

Showing posts from October, 2025

Multi-head Latent Attention (In A Nutshell!)

Image
In this post, I will dive into the Multi-head Latent Attention (MLA) mechanism, one of the innovations presented by the DeepSeek Team! This post assumes prior knowledge of the attention mechanism and the key-value cache. For a quick refresher on these topics, refer to my previous post on self-attention! One of the main problems with multi-head self-attention is the memory cost associated with the size of the key-value cache. MLA reduces the size of the key-value cache and speeds up LLM inference. The core idea is to cache latent embeddings that are  shared across all heads (and for both keys and values)  instead of different key and value embeddings for each head like in multi-head self-attention (Figure 1). The latent embeddings are multiplied with different key and value up-projection matrices for each head to produce different key and value embeddings unique to each head. Having unique key and value embeddings for each head maintains the expressivity of the attention mech...

Popular posts from this blog

Training an LLM (In a Nutshell!)

Upcoming blog posts

Self-Attention and the Key-Value Cache (In A Nutshell!)