MoE Experts Are Entangled by Default. EMO Fixes That at 1B Scale.

EMO pretrains MoE models so expert subsets specialize by domain, cutting 75% of experts at inference while losing only 1% accuracy.

Mixture-of-Experts models were supposed to solve the memory problem: activate a fraction of parameters, pay a fraction of the cost. The assumption was that restricting inference to a domain-relevant expert subset would work naturally. It does not. Standard MoEs route tokens to experts based on learned affinities that have no semantic coherence at the expert-group level, so isolating any subset collapses performance. The architecture looks modular. It is not.

EMO (Emergent Modularity via pretraining) attacks this at the training objective, not at inference time. The core insight is that tokens within a single document tend to share a domain, so EMO forces all tokens in a document to draw from a shared expert pool rather than routing independently across the full expert set. Different documents can use different pools. No human-defined domain labels, no explicit routing supervision: just document boundaries as the structural prior. The result is that coherent expert groupings emerge on their own during pretraining.

Think of it as the difference between a shared office where anyone sits anywhere versus assigned team bays. Standard MoEs produce the former: routing is efficient globally but experts develop mixed, overlapping specializations that fall apart when you try to isolate any one team. EMO produces the latter, where the constraint is architectural and the specialization is a byproduct.

The specialization that emerges is semantic, not syntactic. Standard MoEs show low-level syntactic specialization across experts. EMO's expert subsets align with domains: math tokens cluster to math-relevant pools, code tokens to code-relevant pools. That distinction matters because syntactic specialization does not support modular deployment while semantic specialization does.

A 1B-active, 14B-total EMO pretrained on 1T tokens matches standard MoE performance as a full model. Retaining only 25% of experts incurs a 1% absolute accuracy drop. Retaining 12.5% of experts incurs a 3% absolute drop. Under the same conditions, standard MoEs break. For teams running inference on memory-constrained hardware, the takeaway is direct: modularity-aware pretraining is the missing prerequisite for actually deploying sparse models at a fraction of their parameter footprint.

We're thinking: We find the entanglement finding more consequential than the accuracy numbers. EMO reveals that standard MoE pretraining produces experts that are individually meaningless, which means every memory-efficiency argument made for MoEs at the architecture level has been contingent on using the full model anyway. The practical implication is uncomfortable: teams that adopted MoEs specifically for deployment flexibility may have gotten the compute savings during training but not at inference. EMO reframes the problem correctly: modularity is a pretraining property, not something you can retrofit. The 1% degradation at 25% expert retention is a number worth testing against your actual memory budget before dismissing it.

Key takeaways:

EMO constrains all tokens in a document to share a single expert pool during pretraining, causing semantic domain specialization to emerge from document boundaries alone without any explicit domain labels.
A 1B-active/14B-total model pretrained on 1T tokens retains full-model parity and degrades only 1% (3%) when 75% (87.5%) of experts are dropped at inference; standard MoEs fail under identical conditions, though results are currently at 1B active scale only.
Teams deploying large sparse models under memory constraints should treat modularity-aware pretraining as a first-class objective, not an inference-time optimization: if the model was not trained with EMO-style document-level pooling, expert subsetting will not be reliable.

Source: EMO: Pretraining Mixture of Experts for Emergent Modularity