Semantic Voting: Execution-Grounded Consensus for LLM Code Generation
Shan Jiang, Zijian Yi, Chenguang Zhu

TL;DR
This paper compares various execution-grounded consensus methods for LLM code generation, showing execution-based selectors outperform majority voting, with input quality and thinking level influencing effectiveness.
Contribution
It introduces SemanticVote, a clustering-based execution fingerprint method, and provides a comprehensive analysis of 18 configurations across models and benchmarks.
Findings
Execution-based selectors outperform majority voting by 19-52 percentage points.
Input quality significantly impacts selection effectiveness, with sketch-based inputs outperforming direct LLM generation.
Deeper thinking improves majority voting but not execution-based methods, which are more sensitive to candidate diversity.
Abstract
LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix textual voting, ranking, and execution-based agreement, but the relative contribution of each component remains unclear. We study 18 configurations across different models, thinking levels, and benchmarks, comparing output-pattern majority voting, weighted voting, MBR-Exec, and SemanticVote - a method that clusters candidates by execution fingerprints on LLM-generated inputs. Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points on every configuration, with every execution-based selector exceeding it by at least 18 points. (2) Once candidates are executed on diverse inputs, aggregation rule has limited effect: SemanticVote, weighted voting, and MBR-Exec are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
