Wei Xin Chan

Posts

Showing posts from October, 2025

Multi-head Latent Attention (In A Nutshell!)

October 05, 2025

In this post, I will dive into the Multi-head Latent Attention (MLA) mechanism, one of the innovations presented by the DeepSeek Team! This post assumes prior knowledge of the attention mechanism and the key-value cache. For a quick refresher on these topics, refer to my previous post on self-attention! One of the main problems with multi-head self-attention is the memory cost associated with the size of the key-value cache. MLA reduces the size of the key-value cache and speeds up LLM inference. The core idea is to cache latent embeddings that are shared across all heads (and for both keys and values) instead of different key and value embeddings for each head like in multi-head self-attention (Figure 1). The latent embeddings are multiplied with different key and value up-projection matrices for each head to produce different key and value embeddings unique to each head. Having unique key and value embeddings for each head maintains the expressivity of the attention mech...

Wei Xin Chan

Posts

Multi-head Latent Attention (In A Nutshell!)

Popular posts from this blog

Training an LLM (In a Nutshell!)

Self-Attention and the Key-Value Cache (In A Nutshell!)

Multi-head Latent Attention (In A Nutshell!)