Loading paper
ViSTa Dataset: Do vision-language models understand sequential tasks? | Tomesphere