Bootstrapping Lexical Choice via Multiple-Sequence Alignment
Regina Barzilay, Lillian Lee

TL;DR
This paper introduces a novel automatic method for building lexicons for natural language generation using multiple-sequence alignment on multi-parallel corpora, improving efficiency and quality.
Contribution
It presents a new multiple-pass alignment algorithm that leverages multi-parallel datasets to automatically acquire lexicons, reducing reliance on labor-intensive knowledge-based methods.
Findings
Generated natural language proofs with high readability.
Achieved comparable faithfulness to semantic input as traditional systems.
Demonstrated effectiveness through human evaluations.
Abstract
An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method leverages latent information contained in multi-parallel corpora -- datasets that supply several verbalizations of the corresponding semantics rather than just one. We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
