Post-Trained MoE Models Can Skip Half Their Experts Without Retraining
ZEDA converts static MoE models into dynamic ones via self-distillation, cutting over 50% of expert FLOPs with ~1.20× speedup and minimal accuracy loss.
The standard assumption in MoE optimization is that expert-skipping must be baked in at pretraining time. Dynamic MoE methods, the ones that route easy tokens past unnecessary experts, have required either training from scratch or task-specific fine-tuning on the target architecture. That assumption leaves every already-deployed MoE model stranded: you either accept full inference costs or rebuild.
ZEDA (Zero-Expert Self-Distillation Adaptation) bypasses that constraint entirely. The core insight is architectural: inject parameter-free zero-output experts into each MoE layer, then train the augmented model to route easy tokens toward those zero experts instead of activating real ones. The zero experts contribute nothing computationally. They are placeholders that give the router somewhere to send tokens it doesn't need to process. The original frozen model acts as a teacher throughout, and a group-level balancing loss prevents the router from collapsing to trivially skipping everything. Two stages of self-distillation stabilize the transition without touching pretraining data or budgets.
Think of it as teaching a model to recognize which questions don't need a specialist. A capable routing system already exists inside every trained MoE. ZEDA retrains only the routing behavior, not the expert weights, so the experts that matter stay intact while the router learns to idle on inputs that don't require them.
On Qwen3-30B-A3B and GLM-4.7-Flash, tested across 11 benchmarks covering math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It beats the strongest dynamic MoE baseline by 6.1 points on Qwen3 and 4.0 points on GLM-4.7-Flash, and delivers approximately 1.20x end-to-end inference speedup. For ML infrastructure teams running MoE models in production, the takeaway is direct: cutting expert compute by half no longer requires a pretraining budget or a new model checkpoint.
We're thinking: We find the zero-expert injection approach more significant than the speedup number alone suggests. The 1.20x end-to-end figure is real but conservative, because it reflects current batching and serving infrastructure that wasn't built to exploit dynamic expert skipping. As serving frameworks add native support for conditional expert execution, the practical ceiling rises. The more pointed question is whether the 50% FLOP reduction holds uniformly across token difficulty distributions in real production traffic, which skews harder than benchmark suites. Teams running Mixtral-class models on coding or math workloads should treat the benchmark results as an upper bound on gains and run their own routing analysis before committing to deployment changes.
Key takeaways:
- ZEDA converts static MoE models into dynamic ones by injecting parameter-free zero-output experts and retraining only routing behavior through two-stage self-distillation against a frozen teacher, leaving expert weights unchanged.
- On two models across 11 benchmarks, ZEDA cuts expert FLOPs by over 50% and delivers 1.20x end-to-end speedup, outperforming the strongest dynamic MoE baseline by 4.0 to 6.1 points; the caveat is that real-world gains depend on traffic difficulty distribution and serving framework support for dynamic expert skipping.
- Teams serving Mixtral-class or other post-trained MoE models should evaluate ZEDA as a drop-in inference optimization before pursuing more expensive alternatives like quantization or distillation into a smaller dense model.
Source: Post-Trained MoE Can Skip Half Experts via Self-Distillation