How to Scale Your EMA Apple Machine Learning Research
*=Equal Contributors Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule; for example, in stochastic gradient descent, one should scale… Read More »How to Scale Your EMA Apple Machine Learning Research