Log Fire

Gradient Multi-Normaliza

Gradient Multi-Normalizat
Gradient Multi-Normalization for Stateless and Scalable LLM Training

arXiv:2502.06742v1 Announce Type: new
Abstract: Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et a…

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *