Google DeepMind Proves Large Models Can Train Across Continents on Cheap Networks, Dismantling the Case for Centralized Supercomputers
Google DeepMind successfully trained a 12-billion-parameter Gemma model distributed across four U.S. regions using low-bandwidth networks, according to a post from the official GoogleDeepMind account.
5. Google DeepMind Proves Large Models Can Train Across Continents on Cheap Networks, Dismantling the Case for Centralized Supercomputers
Google DeepMind successfully trained a 12-billion-parameter Gemma model distributed across four U.S. regions using low-bandwidth networks, according to a post from the official GoogleDeepMind account. The team also demonstrated that heterogeneous hardware, specifically TPU6e and TPUv5p chips, can be mixed within a single training run without degrading performance. Both results together constitute a proof-of-concept for what DeepMind is calling a rethink of "global compute," suggesting the underlying infrastructure assumptions that have governed large-scale AI training since GPT-3 are now being stress-tested at Google scale.
The competitive implications are significant. The prevailing model for frontier AI training requires tightly coupled, co-located clusters with high-speed interconnects, a constraint that concentrates capability at whoever can build or afford the largest single facility. If DeepMind's approach generalizes, it breaks that bottleneck in two ways simultaneously: geographically distributed training reduces the need to aggregate thousands of chips in one physical location, and hardware-generation mixing means operators can draw on existing installed capacity rather than waiting for uniform next-generation deployments. This directly threatens the moat that Nvidia and hyperscaler competitors like Microsoft and Amazon have built around high-bandwidth networking infrastructure, while potentially giving Google a way to utilize its globally distributed TPU footprint more aggressively. Startups and sovereign AI programs that previously lacked the capital to build monolithic clusters become more viable actors under this paradigm.
This result connects to a broader pattern of research aimed at decoupling model scale from infrastructure concentration. Work on pipeline parallelism, federated training, and parameter-server architectures has been accumulating for years, but credible demonstration at 12B parameters across a multi-region, mixed-hardware configuration is a threshold moment. The trajectory points toward a world where compute abundance is defined not by peak cluster density but by coordination efficiency across distributed, heterogeneous resources, a shift that would redraw competitive maps across cloud providers, chip designers, and AI labs alike.
Source: https://twitter.com/GoogleDeepMind/status/2047330992713589009