Perplexity's GB200 Serving Stack Benchmarks Prefill-Decode Gains Over Hopper for Large MoEs

Perplexity publishes GB200 inference stack details for large MoEs like Qwen, giving infra teams a concrete throughput benchmark against H100 Hopper.

9. Perplexity's GB200 Serving Stack Benchmarks Prefill-Decode Gains Over Hopper for Large MoEs

Perplexity CEO Aravind Srinivas announced on May 10, 2026, that the company has published technical details of its GB200-based inference stack, specifically targeting large mixture-of-experts models like Qwen. The post highlights that NVIDIA's GB200 NVL72 architecture changes how prefill and decode disaggregation is structured when serving large MoEs, and the published writeup quantifies throughput improvements compared to running the same workloads on Hopper-generation H100 hardware.

The competitive weight here sits with inference providers and cloud operators currently standardized on H100 clusters. Prefill-decode disaggregation has been one of the more contested engineering problems in production LLM serving: prefill is compute-bound, decode is memory-bandwidth-bound, and routing them to separate hardware pools is expensive to tune. GB200's NVLink-connected CPU-GPU architecture shifts those tradeoffs materially. By publishing concrete throughput numbers, Perplexity is doing something most inference-focused companies avoid: making the performance delta against existing Hopper deployments legible to outside engineers. That puts pressure on competitors like Together AI, Fireworks AI, and cloud inference layers at AWS and Azure to either match the benchmark or explain why their Hopper-based stacks remain competitive on cost-per-token.

The broader pattern is worth tracking. Inference optimization is moving from a proprietary moat into a published-benchmark competition. Groq did this with latency numbers on custom silicon; Perplexity is doing it on commodity-adjacent GPU infrastructure. For teams running large MoE serving at scale, this writeup is a direct reference point before any GB200 procurement decision. The next move to watch: whether other inference providers publish comparable disaggregation benchmarks on GB200, or whether Perplexity's numbers become the default baseline by default of being first.

Source: @AravSrinivas on X