How Overfitting Warps LLM Vectors
How overfitting reshapes high dimensional spaces, collapsing variance, creating hubs, and boosting brittle features that memorize rather than generalize.
You can learn a lot about a language model by peeking at its vectors. When training goes past the sweet spot, the geometry inside those vectors starts acting weird. Think of a dance floor where everyone slowly drifts into the same corner. The music did not change. The space did.
This post stays inside the representation space: token embeddings, the LM head, and the contextual hidden states that a transformer produces for each position.
A 60 second refresher
- Token embedding matrix $E \in \mathbb{R}^{V \times d}$ maps tokens to $d$ dimensional vectors.
- Contextual embeddings $h \in \mathbb{R}^{d}$ are the hidden states after the stack.
- LM head $W \in \mathbb{R}^{V \times d}$ turns hidden states into logits. Often weights are tied so $W = E.$
Probabilities come from $z_t = W_t \cdot h$ and $p(t \mid h) = \operatorname{softmax}(z)$. Overfitting bends the geometry of $E$, $W$, and the distribution of $h$.
How overfitting warps high dimensional space
- Logit scale inflation that yields peaky posteriors. As the model memorizes patterns from training, it drives larger dot products for familiar contexts. The LM head and the hidden states align more strongly with the target token directions. On train-like inputs the softmax turns sharp and confident.
- Anisotropy and effective dimension collapse. Healthy representations spread power across many directions. With overfitting the covariance spectrum grows a few large eigenvalues while the tail thins. Effective rank falls. Intuition: the vector cloud funnels into narrow cones that fit the training domain but leave little capacity for novel inputs.
- Prototype locking. Each output token direction acts like a prototype. Overfitting pulls contextual vectors toward the prototypes favored by the training distribution. That looks great on seen text and can push novel inputs toward the wrong prototype at inference time.
- Hubness in neighbor graphs. High dimensions already produce hubs. Overfitting boosts this effect. A handful of directions become nearest neighbors for many points. Retrieval and clustering suffer because unrelated items start to look close.
- Outlier features and peaky attention. Some attention heads and MLP channels grow into high gain detectors for very specific signatures such as rare phrases or formats. These features fire brightly on memorized cases and barely help elsewhere.
- Frequency to norm coupling that gets exaggerated. Even in normal training, frequent tokens tend to have higher base influence. Overfitting can exaggerate that link or skew it toward idiosyncratic tokens. Local neighborhoods around those tokens distort and the map of meaning gets bumpy.
How to see the geometry move
You can spot these shifts without a single leaderboard score. Build a small internal dashboard and compare a shard of training data with a fresh, time-boxed shard.
- Covariance spectrum of hidden states: compute eigenvalues of $\operatorname{Cov}(h)$. Track effective rank or participation ratio. Overfitting shows up as swelling top eigenvalues and a shrinking effective rank.
- Isotropy checks: mean cosine of random hidden state pairs after unit normalization. Rising averages signal crowding into cones. Track the norm of the mean hidden state relative to per-token norms.
- Embedding and LM head norm histograms: watch for heavy tails and identify tokens whose columns change rapidly in scale or angle.
- Alignment margins: on held-out data measure $\max_t \cos(h, E_t)$ and the margin between the top two logits. Train-only growth is a red flag.
- Hubness metrics: build k-NN graphs. Count how often a point appears in someone elseβs neighbor list. Long-tail growth means hubness is increasing.
- Attention and MLP outliers: track attention entropy per head and activation RMS per channel. Persistent outliers usually point to memorization hooks.
Tip: always compare train versus fresh at the same time. Geometry that seems fine on train can look obviously distorted on new text.
If you train embedding encoders for RAG or search
Contrastive learning tries to balance two forces:
- Alignment brings matched pairs together.
- Uniformity spreads points roughly uniformly on the unit sphere.
Overfitting tips the balance toward alignment without uniformity.
- Positives become extremely close while the rest of the space collapses onto a submanifold.
- Cross-domain recall drops and hubness rises.
- With dot product scoring, norm inflation can fake improvement. With cosine scoring, angle collapse is the main issue.
Simple monitors:
- Track average positive cosine versus an energy measure for uniformity on the sphere.
- Measure recall@k in-domain and out-of-domain on rotating, unseen sets.
How to shape the space back into health
- Increase data diversity and deduplicate to reduce pressure to memorize narrow cones.
- Weight decay and LoRA dropout during SFT or other PEFT methods to temper outlier channels and logit scale inflation.
- Uniformity-friendly losses for encoders: temperature tuning, hyperspherical regularizers, broader negative sampling, or small feature noise.
- Early stopping on geometry signals such as effective rank stalling or hubness spikes.
- Entropy or KL guardrails in RLHF or DPO to prevent collapse relative to a reference policy.
Closing thought
Overfitting is not just a lower loss that went a bit too low. It is a quiet rearrangement of angles, spectra, and neighborhoods inside a very large space. Watch the geometry. When the cloud collapses, the outputs might still look confident, but the map they come from has lost its depth.