Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
\'Eric Jacopin

TL;DR
This study shows that co-locating tests with implementation code significantly improves AI-generated code quality, with inline tests leading to higher correctness and preservation across multiple models.
Contribution
It provides the first large-scale empirical evidence that test syntax structure influences AI code generation quality, highlighting the importance of software design choices.
Findings
Inline tests achieve near-perfect preservation and correctness.
Separated tests reveal significant model-tier gaps in correctness.
Inline test markers receive stronger attention in transformer models.
Abstract
AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92-100%) across all models; (2) separated tests expose stark model-tier gaps (0-100% correctness) and independence between preservation and correctness; (3) model behavior evolves across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
