NTP's One-Hot Supervision Leaves Representation Space Broken by Design

NITP adds a dense continuous supervision signal in latent space during pre-training, lifting MMLU-Pro by 5.7% on a 9B MoE model with only 2% extra training FLOPs.

Standard next-token prediction has one job: predict the next discrete token. It does that job well enough to produce capable models at scale. But the supervision signal it provides, a one-hot label at the output logit layer, says nothing about what the hidden states should look like. That silence is not neutral. It leaves the representation space under-constrained, allowing internal geometry to drift into degenerate, anisotropic configurations that quietly limit what the model can generalize to.

The fix is not a new architecture. NITP, Next Implicit Token Prediction, adds a second supervision target that operates directly in the representation space. Instead of only predicting which token comes next, the model also predicts the implicit semantic content of that token: a continuous, dense vector drawn from shallow-layer representations of the same model. Those shallow representations are stable enough to serve as self-supervised targets without requiring an external teacher or a separate training stage. The result is a dual-signal pre-training objective: discrete prediction at the output, continuous prediction in the latent space.

The analogy is useful here. Standard NTP is like grading a student only on their final answer. NITP also grades the quality of their working notes. The answer can be correct while the notes are incoherent; over thousands of examples, incoherent notes compound into brittle internal structure. Adding a constraint on the notes does not change what the student is being asked to produce. It changes how tightly organized their reasoning has to become to produce it consistently. Theoretically, NITP regularizes the optimization landscape by reducing under-constrained degrees of freedom and encouraging a compact, structured representation geometry.

Across dense and MoE models from 0.5B to 9B parameters, NITP consistently lifts downstream performance. On a 9B MoE model, MMLU-Pro improves by 5.7% absolute, C3 by 6.4%, and CommonsenseQA by 4.3%. The cost is approximately 2% additional training FLOPs and zero additional inference cost, since the auxiliary prediction head is discarded after training. For teams running pre-training or continued pre-training runs, the takeaway is direct: the loss function is a tunable variable with measurable downstream impact, and NITP is a drop-in change to the training objective that does not touch the model architecture or inference stack.

We're thinking: We find the framing here more consequential than the numbers alone suggest. The scaling literature has treated architecture, data, and compute as the three primary levers. Loss function design has largely been fixed since GPT-2: predict the next token, cross-entropy, done. NITP is evidence that this fixity was a choice, not a necessity, and that the one-hot supervision signal actively permits a class of representational failure that scales do not automatically correct. The specific implication for practitioners is uncomfortable: if your pre-training runs have been sweeping learning rate, batch size, and data mixture while holding the loss function constant, you may have been optimizing inside a constrained space without knowing it. The 2% FLOP overhead makes this easy to test.

Key takeaways:

NITP augments standard next-token prediction with a continuous self-supervised target in the representation space, drawn from shallow layers of the same model, constraining hidden state geometry without changing architecture or inference cost.
On a 9B MoE model, NITP delivers 5.7% absolute improvement on MMLU-Pro, 6.4% on C3, and 4.3% on CommonsenseQA, at approximately 2% additional training FLOPs across model sizes from 0.5B to 9B; results are on pre-training, so gains on instruction-tuned or RLHF-trained variants remain to be confirmed.
Teams running pre-training or continued pre-training should treat NITP as a low-cost ablation to add to their next run: the auxiliary head is dropped at inference, the overhead is minimal, and the gains on reasoning and language benchmarks are consistent across model scales tested.

Source: NITP: Next Implicit Token Prediction for LLM Pre-training