Loading paper
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models | Tomesphere