A person with short dark hair and glasses smiles at the camera, wearing a dark jacket with CO-DESIGN AI AND SCIENCE on it, standing outdoors near a body of water under a cloudy sky.

Re-Thinking DiLoCo through Generalized Primal Averaging for Distributed Training

Date and time: Friday 6 March 2026, 10:00-11:00 CET
Speaker: Hao-Jun Michael Shi, Meta Platforms
Title: Re-Thinking DiLoCo through Generalized Primal Averaging for Distributed Training

Where: Digital Futures hub, Osquars Backe 5, floor 2 at KTH main campus OR Zoom
Directionshttps://www.digitalfutures.kth.se/contact/how-to-get-here/
OR
Zoomhttps://kth-se.zoom.us/j/69560887455

Host: Mikael Johansson <mikaelj@kth.se>

A person with short dark hair and glasses smiles at the camera, wearing a dark jacket with CO-DESIGN AI AND SCIENCE on it, standing outdoors near a body of water under a cloudy sky.

Bio: Michael Shi is a Research Scientist in the Kernels and Optimizations team within Meta Superintelligence Labs. He obtained his B.S. degree in Applied Mathematics from the University of California, Los Angeles, and his Ph.D. from Northwestern University in Industrial Engineering and Management Sciences under the supervision of Prof. Jorge Nocedal. His team recently won the MLCommons’ AlgoPerf Training Algorithms competition (external tuning track).

He previously received the Walter P. Murphy Fellowship at Northwestern, the NSF Graduate Research Fellowship Honorable Mention in 2016 and 2017, and was a top ICML reviewer in 2019. His current research interests are in the design and implementation of scalable and distributed training algorithms and systems for deep learning. He previously contributed to the areas of stochastic optimization, noisy optimization, and derivative-free optimization as well as recommender systems and embedding compression.

Abstract: The Distributed Low-Communication (DiLoCo) algorithm has recently emerged as a promising solution for training Large Language Models across geographically dispersed compute clusters. The method modifies Local SGD by applying Nesterov momentum to the averaged local trajectory (i.e., pseudo-gradients), matching synchronous training performance with AdamW despite reduced communication costs.

In this talk, we argue that DiLoCo’s effectiveness stems primarily from this multi-step Nesterov mechanism — a benefit that persists even in non-distributed settings. Consequently, the number of inner steps acts as a coupled hyperparameter for both local SGD and the multi-step Nesterov method. To address this, we propose Generalized Primal Averaging (GPA), a principled extension of Nesterov momentum that wraps a base optimizer similar to DiLoCo (i.e., AdamW). This formulation removes DiLoCo’s rigid two-loop structure in the non-distributed setting by decoupling the interpolation constants in Nesterov’s primal averaging formulation.

We demonstrate that GPA achieves significant speedups compared to AdamW on Llama models, motivating new algorithmic directions for re-thinking DiLoCo for distributed training from first principles.

Events & seminars