QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang

TL;DR
This paper introduces QAQ, a novel data selection framework that assesses synthetic code data quality through bidirectional semantic coherence using Reverse Mutual Information, leading to more effective data curation for code generation models.
Contribution
The paper proposes a new data selection method based on reverse mutual information and model disagreement, improving synthetic data quality assessment for code generation.
Findings
Selecting 25% of data with RMI matches full-data performance.
Stratified RMI outperforms existing data selection methods.
Bidirectional semantic coherence enhances synthetic data quality.
Abstract
Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query (). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ()? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Machine Learning and Algorithms
