Loading paper
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts | Tomesphere