V-JEPA2 ViT-G/16 384 (action-conditioned latent video predictor)

Architecture diagram