MoE-to-Dense Conversion Beats Dense Pruning by 6.3 Points
A new framework converts trained Mixture-of-Experts models into standard dense networks, outperforming dense-to-dense pruning by 6.3 pp at matched parameter count.
The dominant assumption about Mixture-of-Experts models is that their efficiency is a deployment win. Fewer parameters activate per token, so inference is cheaper. That assumption breaks the moment memory enters the picture: every expert weight still lives in RAM, whether activated or not, and no existing compression method changes that fundamental constraint.
Compressing a MoE by reducing expert count still leaves you with a MoE. You still need the full parameter set resident. The architecture's memory ceiling stays fixed. What this framework does instead is structurally different: it converts the MoE into a standard dense feed-forward network, eliminating the expert routing machinery entirely and producing a model that fits the deployment stack practitioners already have.
The conversion pipeline works in three stages. First, experts are scored for importance using one of seven candidate scoring methods, with a new diversity-aware scorer that penalizes redundant expert selection outperforming all prior approaches. Second, selected experts are grouped and concatenated into a single dense FFN block, preserving the parameter budget without the routing overhead. Third, knowledge distillation from the original MoE teacher refines the dense student over roughly 4 billion tokens. Think of it as collapsing a committee of specialists into one generalist who has studied every specialist's notes: the generalist is slower per decision than any single specialist, but you only need to keep one person in the room.
The framework was evaluated across 350 configurations on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B, testing seven scoring methods, five grouping strategies, and two magnitude scaling approaches. Diversity-aware scoring consistently led across all three model families. Under a controlled comparison at matched parameter count, MoE-to-dense conversion beats dense-to-dense pruning by 6.3 percentage points in average downstream accuracy, and training runs 1.6 times faster in wall-clock time. For ML infrastructure teams targeting single-GPU or memory-constrained deployment, the takeaway is direct: if your target model is a MoE, pruning it as a MoE is the wrong starting point.
We're thinking: The 6.3-point gap over dense-to-dense pruning is the number worth sitting with, because it inverts a reasonable prior. We would expect a dense model distilled from a MoE to pay a quality penalty for losing the routing structure. Instead, the MoE teacher provides richer training signal than a dense teacher does, and the diversity-aware scoring ensures the selected experts span the function space rather than clustering around the most activated ones. The practical implication is sharper than it first appears: teams that have been treating large MoE weights as inaccessible for edge or single-GPU deployment now have a concrete compression path that does not require them to accept a dense-to-dense quality floor. The caveat is distillation cost: 4 billion tokens is not trivial, and the framework has not yet been validated on the very largest MoE scales.
Key takeaways:
- Experts are scored for diversity, grouped, and concatenated into a dense FFN, then refined via knowledge distillation from the MoE teacher, eliminating expert routing and its memory residency requirement entirely.
- MoE-to-dense conversion beats dense-to-dense pruning by 6.3 pp at matched parameter count across 350 tested configurations on three model families, though distillation requires approximately 4 billion tokens and largest-scale validation remains open.
- Teams deploying frontier-class MoE models on memory-constrained hardware should treat this conversion pipeline as the default compression strategy rather than expert-count reduction, which preserves the memory problem it is trying to solve.
Source: Pruning and Distilling Mixture-of-Experts into Dense Language Models