React-ing to Grace Hopper 200: Five Open-Weights Coding Models, One React Native App, One GH200, One Weekend
Alex Potanin

TL;DR
This paper evaluates five open-weights coding models on a React Native app generation task, revealing that lower-ranked models can outperform higher-ranked ones and uncovering deployment insights and hardware efficiency trends.
Contribution
It provides a comprehensive evaluation of open-weights coding models on a practical app generation task and uncovers novel deployment and hardware efficiency insights.
Findings
Kimi-K2.5 with aggressive quantization outperforms higher SWE-Bench models.
Default temperature=0 causes sampling hangs in coding tools.
Web-platform adaptation of mobile APIs is a universal training-data gap.
Abstract
We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task performance: Kimi-K2.5 at aggressive 3-bit quantization (UD-Q3_K_XL, 480 GB) produces the most complete and specification-compliant output, outranking models with substantially higher SWE-Bench Pro scores. We document three novel deployment findings: (1) default temperature=0 in coding tools causes sampling hangs with reasoning-model architectures, (2) reasoning-model thinking traces can leak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
