One Base Model, One Million Policies: MinT's LoRA Adapter Architecture

MinT keeps a single resident base model and hot-swaps LoRA adapters at request time, cutting per-policy GPU cost to near-zero at million-adapter scale.

The standard assumption in multi-tenant LLM serving is that each fine-tuned policy needs its own weights on disk, and ideally its own memory footprint at inference time. At dozens of adapters, that assumption is manageable. At millions, it collapses entirely, not because storage is cheap or expensive, but because the checkpoint-per-policy model forces every pipeline stage, training handoff, rollout, evaluation, and serving, to move full model weight sets around a cluster.

MinT (MindLab Toolkit) treats the base model as permanent infrastructure and treats LoRA adapters as the only artifact that moves. The base model stays resident on GPU across all policies. Adapters, which in rank-1 configurations can be under 1% of base-model size, travel through the entire lifecycle: rollout, update, export, evaluation, serving, and rollback. Nothing else does. The architecture hides distributed training, scheduling, and data movement behind a single service interface, so callers interact with named policy revisions, not with weight tensors.

The mechanism works across three scaling axes. Scale Up validates LoRA RL training and serving on frontier-scale dense and Mixture-of-Experts architectures, including MLA and DSA attention paths, with end-to-end validation beyond 1 trillion total parameters. Scale Down exploits the size asymmetry directly: moving only the exported LoRA adapter rather than a merged checkpoint cuts the measured handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE. Running concurrent multi-policy GRPO training on the same base further shortens wall time by 1.77x and 1.45x on those respective architectures, without increasing peak memory. Scale Out addresses the catalog problem: a tensor-parallel deployment supports addressable catalogs at 10^6 scale, with single-engine sweeps measured through 100,000 adapters and thousand-adapter active waves at cluster scale. Packed MoE LoRA tensors improve live engine loading by 8.5 to 8.7x, and cold loading is scheduled as routine service work rather than treated as a blocking operation.

The headline result is that MinT manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models, with no per-policy full checkpoint materialization anywhere in the pipeline. For ML infrastructure teams running LoRA fine-tuning at scale, the takeaway is direct: the bottleneck is not adapter count, it is whether your serving layer is built around the assumption that each policy owns its own base model copy.

We're thinking: We find the Scale Out numbers more significant than the training speedups. An 18x handoff improvement is useful engineering. But demonstrating a single engine sweep through 100,000 adapters and a path to 10^6 addressable policies is a different category of claim: it means the adapter-as-artifact model is not a clever trick for small fleets, it is a viable serving primitive for product-scale personalization and multi-tenant fine-tuning. The real question MinT raises for teams is organizational, not technical: if per-policy GPU cost approaches zero and catalog management becomes a scheduling problem rather than a memory problem, the constraint on how many fine-tuned policies a product can maintain shifts from infrastructure to evaluation and governance. That is a harder problem, and MinT does not solve it.

Key takeaways:

MinT keeps one base model resident on GPU and routes only LoRA adapter revisions through the full training-to-serving lifecycle, eliminating full checkpoint materialization per policy.
Adapter-only handoff cuts the measured transfer step by 18.3x on a 4B dense model; concurrent multi-policy GRPO cuts wall time by 1.77x; live engine loading improves 8.5 to 8.7x via packed MoE LoRA tensors, all validated at or beyond 1T total parameters. Caveat: results are from MindLab's own cluster; external reproducibility on heterogeneous hardware is unconfirmed.
Teams running LoRA fine-tuning pipelines at scale should audit whether their current serving stack materializes merged checkpoints per policy: if it does, MinT's adapter-only handoff design is the architectural reference to evaluate against before the next capacity planning cycle.

Source: MinT: Managed Infrastructure for Training and Serving Millions of LLMs