Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao

TL;DR
GeoCodeBench is a new benchmark for evaluating AI models' ability to generate correct 3D geometric vision code, revealing current models' limited performance and highlighting challenges in scientific coding comprehension.
Contribution
The paper introduces GeoCodeBench, a comprehensive benchmark for PhD-level 3D vision coding tasks, with a novel evaluation framework and insights into model capabilities.
Findings
GPT-5 achieves only 36.6% pass rate on the benchmark.
Research tasks are significantly harder than general 3D capability tasks.
Shorter context inputs (up to Method section) outperform full-paper inputs.
Abstract
AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
