Resolving the Imbalance Issue in Hierarchical Disciplinary Topic   Inference via LLM-based Data Augmentation

Xunxin Cai; Meng Xiao; Zhiyuan Ning; Yuanchun Zhou

arXiv:2310.05318·cs.CL·October 17, 2023

Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation

Xunxin Cai, Meng Xiao, Zhiyuan Ning, Yuanchun Zhou

PDF

Open Access

TL;DR

This paper proposes using Llama V1-based data augmentation with keyword prompts to address data imbalance in disciplinary research proposals, improving the fairness and accuracy of topic models and reviewer assignment systems.

Contribution

It introduces a novel LLM-based data augmentation method tailored for complex scientific texts within hierarchical disciplinary structures.

Findings

01

Generated proposals effectively balance data distribution

02

Augmentation improves downstream topic model accuracy

03

Enhanced fairness in reviewer assignment systems

Abstract

In addressing the imbalanced issue of data within the realm of Natural Language Processing, text data augmentation methods have emerged as pivotal solutions. This data imbalance is prevalent in the research proposals submitted during the funding application process. Such imbalances, resulting from the varying popularity of disciplines or the emergence of interdisciplinary studies, significantly impede the precision of downstream topic models that deduce the affiliated disciplines of these proposals. At the data level, proposals penned by experts and scientists are inherently complex technological texts, replete with intricate terminologies, which augmenting such specialized text data poses unique challenges. At the system level, this, in turn, compromises the fairness of AI-assisted reviewer assignment systems, which raises a spotlight on solving this issue. This study leverages large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Software Engineering Research