REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction
Omar Sharif, Joseph Gatto, Madhusudan Basak, Sarah M. Preum

TL;DR
REGen is a new evaluation framework for generative event argument extraction that combines multiple matching strategies to better reflect true model performance, especially for large language models, and aligns well with human judgment.
Contribution
It introduces REGen, an evaluation method that improves upon exact match by incorporating relaxed and LLM-based matching, capturing more accurate performance of generative models.
Findings
REGen shows an average +23.93 F1 performance gain over EM.
REGen achieves 87.67% alignment with human judgment.
Experiments on six datasets demonstrate REGen's effectiveness.
Abstract
Event argument extraction identifies arguments for predefined event roles in text. Existing work evaluates this task with exact match (EM), where predicted arguments must align exactly with annotated spans. While suitable for span-based models, this approach falls short for large language models (LLMs), which often generate diverse yet semantically accurate arguments. EM severely underestimates performance by disregarding valid variations. Furthermore, EM evaluation fails to capture implicit arguments (unstated but inferable) and scattered arguments (distributed across a document). These limitations underscore the need for an evaluation framework that better captures models' actual performance. To bridge this gap, we introduce REGen, a Reliable Evaluation framework for Generative event argument extraction. REGen combines the strengths of exact, relaxed, and LLM-based matching to better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
