Cross-Architecture dLLM Distillation: 0.6B Student, 48.78 HumanEval

TIDE is the first framework to distill diffusion LLMs across incompatible architectures, lifting a 0.6B student to 48.78 HumanEval against a 32.3 AR baseline.

Every existing distillation method for diffusion large language models assumes teacher and student share the same architecture. That assumption quietly locks the efficiency gains of parallel decoding to teams who can afford to run billion-parameter models at inference time. TIDE breaks that assumption entirely.

The core problem is structural incompatibility. A diffusion LLM teacher may use bidirectional attention, a masked-token training objective, and a proprietary tokenizer. A smaller student built for deployment often uses none of those. Standard distillation objectives transfer knowledge through logit matching or feature alignment, both of which assume the teacher and student produce comparable probability distributions over the same token vocabulary. When the tokenizer differs, the vocabulary differs. When the attention mechanism differs, the internal representations are not comparable. Prior work simply did not address this.

TIDE decomposes the cross-architecture transfer problem into three modular components, each targeting one failure mode. The first, TIDAL, treats the teacher's reliability as a variable rather than a constant. At high noise levels during the diffusion process, the teacher's predictions are less trustworthy. TIDAL modulates distillation strength jointly across training progress and diffusion timestep, reducing the student's dependence on teacher outputs precisely when those outputs are most likely to mislead. Think of it as a confidence-weighted curriculum: the student leans on the teacher when the teacher is coherent, and learns more independently when the teacher is guessing through noise.

The second component, CompDemo, addresses a different failure mode. Under heavy masking, the teacher lacks sufficient context to generate reliable predictions. CompDemo enriches that context through complementary mask splitting: the masked sequence is split into two complementary views, each providing the other with the context it was missing. The teacher now sees more signal before generating the predictions the student will learn from.

The third component, Reverse CALM, handles the tokenizer mismatch directly. Standard chunk-level likelihood matching aligns probabilities across token boundaries, but when teacher and student tokenize differently, those boundaries do not align. Reverse CALM inverts the matching direction, yielding bounded gradients and filtering noise from both ends of the chunk boundary. The gradient stability matters: without it, misaligned token distributions produce training instability that compounds across the diffusion timestep schedule.

Distilling an 8B dense teacher and a 16B MoE teacher into a 0.6B student across two heterogeneous pipelines, TIDE lifts average performance by 1.53 points across eight benchmarks. The code generation result is the headline: HumanEval reaches 48.78, compared to 32.3 for the autoregressive baseline at comparable scale. That is not a marginal improvement. For teams evaluating whether diffusion LLMs are worth the architectural investment, the takeaway is direct: the efficiency argument for dLLMs no longer requires you to run the full teacher at inference time.

We're thinking: The most underappreciated implication here is what TIDE does to the deployment calculus for diffusion LLMs. Until now, the practical argument against dLLMs was circular: yes, parallel decoding is faster, but you need a massive model to get competitive quality, which erases the latency advantage. TIDE breaks that loop. We think the more consequential result is not the 48.78 HumanEval number itself, but the proof that architectural heterogeneity is a solvable engineering problem, not a fundamental barrier. Teams who dismissed dLLMs because they could not afford to run 8B-plus models at inference should revisit that decision. The 0.6B student here is not a toy.

Key takeaways:

TIDE introduces three modular components, TIDAL for noise-dependent distillation weighting, CompDemo for context enrichment under heavy masking, and Reverse CALM for cross-tokenizer gradient stability, enabling knowledge transfer across incompatible architectures for the first time.
Distilling 8B and 16B teachers into a 0.6B student yields a 1.53-point average gain across eight benchmarks and a HumanEval score of 48.78 versus 32.3 for the AR baseline; the caveat is that results are reported on a single student architecture, so generalization across other small-model targets remains to be confirmed.
Teams building inference-cost-sensitive applications on top of diffusion LLMs should treat TIDE as the baseline framework for compression, particularly for code generation tasks where the quality gap versus autoregressive models has historically been the strongest objection.

Source: Turning the TIDE: Cross-Architecture Distillation for Diffusion LLMs