Post-Trained MoE Can Skip Half Its Experts Without Retraining

ZEDA converts static MoE models into dynamic ones via self-distillation, cutting over 50% of expert FLOPs with minimal accuracy loss and no pretraining.

The standard assumption in dynamic MoE research is that expert-skipping behavior must be baked in from the start: either train the model with dynamic routing from scratch, or accept that post-training adaptation will break something. That assumption has kept a large class of already-deployed, fully trained MoE models locked into their original compute costs, even when most of those expert activations are wasted on easy tokens.

ZEDA (Zero-Expert Self-Distillation Adaptation) bypasses that constraint entirely. The core idea is architectural injection followed by knowledge recovery. Into each MoE layer, ZEDA inserts parameter-free "zero-output experts": experts that produce no output and consume no compute, functioning as explicit skip targets for the router. The model now has somewhere to route easy tokens without discarding any original expert weights. The problem is that the router, trained on a static architecture, has no learned behavior for these new slots. ZEDA solves this through two-stage self-distillation: the original frozen MoE acts as its own teacher, and the augmented model is trained to match its outputs while a group-level balancing loss prevents routing collapse, where all tokens pile into a single path.

The analogy is approximate but useful: think of it as adding emergency exits to a building that was designed without them, then running fire drills until occupants learn to use them without prompting. The structure changes; the behavior is recovered through practice against the original blueprint.

On Qwen3-30B-A3B and GLM-4.7-Flash, tested across 11 benchmarks covering math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 points on Qwen3-30B-A3B and 4.0 points on GLM-4.7-Flash, and delivers approximately 1.20x end-to-end inference speedup. For ML infrastructure teams running Mixtral-class or similar post-trained MoE deployments, the takeaway is direct: halving active expert compute is now an adaptation problem, not a retraining problem.

We're thinking: We find the framing here more significant than the speedup number alone. A 1.20x end-to-end speedup sounds modest, but it arrives without touching the training pipeline, without task-specific fine-tuning, and without degrading a model that teams have already validated. The real unlock is that ZEDA decouples inference cost reduction from model development cycles. The caveat worth watching: the method is tested on two models, both in the 3-30B active-parameter range. Whether the zero-expert routing behavior generalizes cleanly to larger MoE architectures, or to models with different expert granularity, is an open question that practitioners should probe before treating this as a universal recipe.

Key takeaways:

ZEDA converts static post-trained MoE models into dynamic ones by injecting parameter-free zero-output experts and recovering routing behavior through two-stage self-distillation against the original frozen model, with no pretraining required.
Over 50% of expert FLOPs eliminated on Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, with a 1.20x end-to-end speedup and 6.1-point margin over the strongest dynamic MoE baseline; tested on two models only, so generalization to larger or differently structured MoE architectures is unconfirmed.
Teams serving post-trained MoE models at scale should evaluate ZEDA as a drop-in adaptation step before any infrastructure-level optimization, given that the compute reduction requires no changes to the upstream training or fine-tuning pipeline.

Source: Post-Trained MoE Can Skip Half Experts via Self-Distillation