Transformer Architecture Internals and Variants

The transformer architecture replaced recurrence with direct attention, making every token reachable from every other token in a single layer. At its core sit three operations: linear projections that split input into queries, keys, and values; a softmax-weighted average that lets each position attend to all positions; and feed-forward networks that apply the same MLP independently to each token. Layer normalization and residual connections keep gradients stable through dozens of layers. Variants such as encoder-only BERT, decoder-only GPT, and encoder-decoder T5 are not different architectures but different masking rules and training objectives applied to the same skeleton.

Multi-head attention splits the model dimension into parallel subspaces rather than adding new parameters. Each head receives a slice of the projected query, key, and value vectors (dimension $d_k = d_{\text{model}} / h$ ), computes scaled dot-product attention independently, and the results concatenate back to the original width before a final linear projection. The Q, K, V matrices remain $d_{\text{model}} \times d_{\text{model}}$ in aggregate; they are simply reshaped into h parallel blocks. This is a computational rearrangement, not a parameter expansion: the same weight budget learns multiple attention patterns simultaneously.

Scaled dot-product attention computes a similarity matrix between every query and every key, normalizes those scores into a probability distribution via softmax, and uses that distribution to take a weighted average of the values. The scaling factor $1/\sqrt{d_k}$ matters because the dot product of two random $d_k$ -dimensional vectors has variance $d_k$ ; without division, the softmax inputs grow large with dimension, pushing the output toward a one-hot vector and erasing gradient flow. Division by $\sqrt{d_k}$ keeps the pre-softmax magnitudes stable regardless of head size, preserving both sharp and diffuse attention patterns.

Encoder-decoder attention differs from self-attention in where the keys and values originate. In self-attention, queries, keys, and values all come from the same sequence, so every token attends to every other token in that single sequence. In encoder-decoder attention, queries come from the decoder’s previous layer while keys and values come from the encoder’s final output. This creates a directed bridge: each decoder token attends to every encoder token, allowing the model to align source and target representations. Causal masking still prevents decoder tokens from attending to future decoder positions, but the encoder side remains fully visible.

Mixture-of-experts layers replace a single dense feed-forward network with a bank of smaller expert networks and a trainable router that assigns each token to a small subset of them. While the total parameter count grows linearly with the number of experts, only one or two experts are activated per token, so the floating-point operations per forward pass remain nearly constant. The router learns to specialize experts by token type or linguistic function, creating a sparse activation pattern that expands model capacity without expanding inference cost proportionally. Training stability and load balancing across experts remain active research areas.