VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and   Benchmark Models

Ming Cheng; Jiaying Gong; Chenhan Yuan; William A. Ingram; Edward Fox,; Hoda Eldardiry

arXiv:2411.04825·cs.CL·February 25, 2025

VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Ming Cheng, Jiaying Gong, Chenhan Yuan, William A. Ingram, Edward Fox,, Hoda Eldardiry

PDF

Open Access 1 Video

TL;DR

This paper introduces VTechAGP, a novel dataset for academic-to-general text paraphrasing at the document level, and proposes DSPT5, a dynamic prompt-based generative model that outperforms large language models on this task.

Contribution

The paper provides the first academic-to-general paraphrase dataset and develops DSPT5, a novel dynamic soft prompt model with a contrastive-generative training approach.

Findings

01

DSPT5 achieves competitive results compared to larger models.

02

State-of-the-art LLMs underperform on this specific paraphrasing task.

03

The dataset enables benchmarking for academic to general audience text paraphrasing.

Abstract

Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques

MethodsADaptive gradient method with the OPTimal convergence rate · Focus