Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Apple Machine Learning Research
Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction… Read More »Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Apple Machine Learning Research