Every LLM Throws Away Token Identity After Layer One. TIDE Doesn't.

TIDE re-injects token identity at every transformer layer, directly fixing rare-token undertraining and contextual collapse in small models.

Every transformer-based language model in production today makes the same quiet assumption: look up the token embedding once, inject it at the input layer, and discard the index forever. No architecture review has flagged this. No scaling law has corrected for it. Yet this single design choice silently degrades two classes of tokens that matter most when models are small or vocabularies are large.

The failure is structural, not incidental. Vocabulary frequency follows a Zipf distribution, which means rare tokens receive a fraction of the cumulative gradient signal that common tokens accumulate over training. Their embeddings stay undertrained. Separately, in parameter-constrained models, tokens that are distributionally similar get mapped to nearly identical hidden states by the time they reach mid-network layers, a condition TIDE calls contextual collapse. Both problems share the same root: the model has no way to re-anchor a token's identity once the embedding layer is done.

TIDE fixes this with a component called EmbeddingMemory: an ensemble of K independent MemoryBlocks, each mapping token indices to context-free semantic vectors. These vectors are computed once per forward pass, then injected into every transformer layer through a depth-conditioned softmax router with a learnable null bank. Think of it as a persistent identity signal that rides alongside the residual stream rather than dissolving into it. The null bank matters here: it lets the router suppress the identity injection at layers where context should dominate, so the mechanism does not fight the model's own contextual representations. Unlike standard positional encodings, which encode sequence position rather than token identity, EmbeddingMemory encodes what the token is, at every depth.

The depth-conditioned router is the key design decision. Earlier layers tend to weight the identity injection more heavily; later layers, where contextual representations are richest, can route toward the null bank. This means rare tokens get repeated gradient signal through the EmbeddingMemory parameters across every layer, not just the input embedding, directly addressing the undertraining asymmetry. The ensemble of K MemoryBlocks adds capacity without coupling: each block learns an independent semantic projection, and the router selects among them based on layer depth.

TIDE improves performance across language modeling perplexity and multiple downstream tasks, with gains most pronounced on rare-token prediction and in smaller-parameter regimes where contextual collapse is most severe. For teams training or fine-tuning models on domain-specific vocabularies with heavy long-tail distributions, the takeaway is direct: re-injecting token identity at depth is a low-overhead architectural change that recovers gradient coverage that standard training simply cannot provide.

We're thinking: We find the rare-token undertraining angle more consequential than the contextual collapse result. Every team fine-tuning a general-purpose LLM on a specialized corpus, medical, legal, code in a niche language, is working with a vocabulary where the long tail is exactly the domain-relevant part. Those tokens are chronically underrepresented in pretraining gradient signal, and no amount of fine-tuning data fixes a structural undertraining problem at the embedding level. TIDE's EmbeddingMemory is effectively a targeted remedy: it gives rare tokens more gradient surface area without requiring vocabulary expansion or a larger model. The open question is whether the depth-conditioned router learns stable injection patterns across very different domain shifts, or whether it needs domain-specific tuning to avoid suppressing identity signals at the wrong layers.

Key takeaways:

EmbeddingMemory injects context-free token identity vectors at every transformer layer through a depth-conditioned softmax router with a learnable null bank, replacing the single-injection assumption baked into every standard transformer.
TIDE improves language modeling perplexity and downstream task performance, with the largest gains on rare-token prediction and small-parameter models; the primary caveat is that benefits scale with vocabulary long-tail severity, so gains on balanced corpora may be smaller.
Teams training or fine-tuning LLMs on domain-specific corpora with heavy rare-token distributions should treat TIDE's EmbeddingMemory as a drop-in architectural addition before reaching for vocabulary expansion or larger model sizes.

Source: TIDE: Every Layer Knows the Token Beneath the Context