Evaluating Differentially Private Generation of Domain-Specific Text

Yidan Sun; Viktor Schlegel; Srinivasan Nandakumar; Iqra Zahid; Yuping Wu; Warren Del-Pinto; Goran Nenadic; Siew-Kei Lam; Jie Zhang; Anil A Bharath

arXiv:2508.20452·cs.LG·September 1, 2025

Evaluating Differentially Private Generation of Domain-Specific Text

Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Warren Del-Pinto, Goran Nenadic, Siew-Kei Lam, Jie Zhang, Anil A Bharath

PDF

Open Access

TL;DR

This paper introduces a benchmark for evaluating the quality of domain-specific text generated under differential privacy, revealing current methods' limitations in utility and fidelity, especially under strict privacy constraints.

Contribution

It provides a unified benchmark for systematic evaluation of differentially private text generation across domains, addressing key challenges and setting standards for future research.

Findings

01

Significant utility and fidelity degradation under strict privacy constraints

02

Current privacy-preserving methods have notable limitations in real-world scenarios

03

Benchmark facilitates realistic evaluation of differentially private text generation

Abstract

Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These…

Tables3

Table 1. Table 1. Benchmark characteristics: Number of labels ( | C | |C| ), dataset size ( | D | |D| ), average/90th percentile token length | L ¯ | |\overline{L}| and data access.

Dataset	$\| C \|$	$\| D \|$	$\| \bar{L} \|$	Access mechanism
HoC	10	10301	37/58	free on HuggingFace
N2C2’08	16	620	1985/3070	DUA & manual approval
PsyTAR	7	5102	19/34	DUA
DMSAFN	3	3876	30/50	free on HuggingFace
AsyLax	3	7999	2326/3442	free on HuggingFace

Table 2. Table 2. Random/majority guess baselines, and average/best downstream classification F1 scores for three models per dataset across different privacy budgets ( ϵ \epsilon ).

Method		$ϵ = \infty$	$ϵ = 4$	$ϵ = 2$	$ϵ = 1$	$ϵ = 0.5$
HoC	(Random: 3.7, Majority: 9.1, Original: 71.6/74.3)
DP-Gen (avg / best)		52.0 / 58.6	15.5 / 17.9	14.4 / 15.7	11.0 / 11.5	13.6 / 14.9
AUG-PE (avg / best)		15.8 / 19.7	10.3 / 12.1	7.5 / 8.6	8.1 / 9.4	6.7 / 10.0
N2C2’08	(Random: 46.1, Majority: 53.2, Original: 73.1/87.7)
DP-Gen (avg / best)		54.9 / 58.2	53.2 / 53.2	47.8 / 53.2	53.7 / 54.7	57.1 / 61.3
AUG-PE (avg / best)		55.9 / 57.6	55.8 / 60.9	54.7 / 57.7	56.8 / 60.0	55.7 / 56.9
PsyTAR	(Random: 25.4, Majority: 41.8, Original: 80.7/82.1)
DP-Gen (avg / best)		69.5 / 70.1	41.6 / 44.0	40.1 / 42.5	40.8 / 42.3	39.2 / 39.1
AUG-PE (avg / best)		65.9 / 67.3	62.1 / 62.6	58.4 / 60.4	56.0 / 57.4	45.7 / 48.1
DMSAFN	(Random: 30.5, Majority: 41.1, Original: 76.8/95.2)
DP-Gen (avg / best)		62.7 / 91.6	47.5 / 70.1	46.5 / 65.8	49.4 / 69.6	48.0 / 65.0
AUG-PE (avg / best)		51.0 / 65.6	51.0 / 65.9	51.7 / 70.9	47.9 / 61.3	50.0 / 65.9
AsyLax	(Random: 38.4, Majority: 51.4, Original: 64.9/69.0)
DP-Gen (avg / best)		60.2 / 61.5	34.8 / 36.2	34.9 / 39.3	47.6 / 48.3	37.0 / 48.0
AUG-PE (avg / best)		50.5 / 51.6	49.3 / 52.1	51.5 / 51.5	51.9 / 52.9	51.4 / 51.5

Table 3. Table 3. MAUVE ( ℳ \mathcal{M} ) as well as entity ( 𝒩 \mathcal{N} ) and text length ( ℒ \mathcal{L} ) distribution divergences (fidelity) for different approaches and privacy budgets ( ϵ \epsilon ) across datasets.

Method	$ϵ = \infty$	$ϵ = 4$	$ϵ = 2$	$ϵ = 1$	$ϵ = 0.5$
Method	$ℳ ↑ / 𝒩 ↓ / ℒ ↓$	$ℳ ↑ / 𝒩 ↓ / ℒ ↓$	$ℳ ↑ / 𝒩 ↓ / ℒ ↓$	$ℳ ↑ / 𝒩 ↓ / ℒ ↓$	$ℳ ↑ / 𝒩 ↓ / ℒ ↓$
HoC	(Original: 0.99/1.04/0.004)
DP-Gen	0.65/2.10/0.06	0.18/4.33/0.40	0.16/4.35/0.417	0.14/4.37/0.44	0.12/4.55/0.47
AUG-PE	0.01/2.87/1.49	0.01/3.19/1.40	0.01/3.29/1.35	0.01/4.04/1.25	0.01/4.85/1.25
N2C2’08	(Original: 0.99/0.70/0.17)
DP-Gen	0.42/1.57/0.75	0.02/9.98/1.37	0.02/9.91/1.40	0.02/9.58/1.42	0.02/9.86/1.54
AUG-PE	0.02/8.55/3.72	0.03/8.50/3.80	0.02/8.31/3.87	0.02/8.53/3.63	0.02/8.78/3.42
PsyTAR	(Original: 0.99/0.86/0.007)
DP-Gen	0.61/2.00/0.03	0.42/2.71/0.03	0.34/2.92/0.05	0.37/3.55/0.03	0.35/4.01/0.03
AUG-PE	0.02/4.31/3.21	0.02/5.02/3.27	0.02/5.06/3.23	0.02/5.35/3.34	0.02/6.29/3.38
DMSAFN	(Original: 0.86/0.71/0.012)
DP-Gen	0.21/1.72/0.06	0.18/1.94/0.02	0.16/2.03/0.04	0.15/2.22/0.03	0.12/2.36/0.06
AUG-PE	0.01/3.25/2.36	0.01/3.87/2.35	0.01/4.19/2.34	0.01/5.77/2.24	0.01/3.98/2.30
AsyLax	(Original: 0.98/0.11/0.004)
DP-Gen	0.03/1.42/0.43	0.01/3.82/6.41	0.01/4.03/5.95	0.01/8.06/3.58	0.01/7.66/1.23
AUG-PE	0.01/6.15/1.41	0.01/5.76/1.41	0.01/6.42/1.92	0.01/7.43/6.12	0.01/7.36/1.41

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Advanced Graph Neural Networks

Full text

Evaluating Differentially Private Generation

of Domain-Specific Text

Yidan Sun

0000-0002-3607-4963 Imperial College London, Imperial Global SingaporeSingapore

,

Viktor Schlegel

Imperial College London, Imperial Global SingaporeSingapore

,

Srinivasan Nandakumar

Imperial College London, Imperial Global SingaporeSingapore

,

Iqra Zahid

Imperial College London, Imperial Global SingaporeSingapore

,

Yuping Wu

University of ManchesterUnited Kingdom

,

Warren Del-Pinto

University of ManchesterUnited Kingdom

,

Goran Nenadic

University of ManchesterUnited Kingdom

,

Siew-Kei Lam

Nanyang Technological UniversitySingapore

,

Jie Zhang

A*STARSingapore

and

Anil A Bharath

Imperial College London, Imperial Global SingaporeSingapore

Abstract.

Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.111Corresponding Authors: [email protected] and [email protected].

Synthetic Data, Differential Privacy, Generative AI, Benchmark

††conference: CIKM; November 10-14, 2025; Seoul, Korea††isbn: 978-1-4503-XXXX-X/2018/06††ccs: Computing methodologies Natural language generation††ccs: Security and privacy Privacy-preserving protocols††ccs: Information systems Data mining

1. Introduction

The rapid advancement of Natural Language Processing (NLP) methods has seen success in an increasing number of tasks, including specialist domains, e.g., generating medical documents (Li et al., 2023; Nagar et al., 2024; Binici et al., 2025), solving math problems (Cobbe et al., 2021) or penetration-testing (Deng et al., 2024). However, their success hinges on the wide availability of training and benchmark data, which is exacerbated for potentially sensitive domain-specific data, due to regulatory privacy-related reasons. Without access to such domain-specific data, NLP systems’ performance often deteriorates (Nagar et al., 2025) or becomes too unpredictable for domain experts to use (Dell’Acqua et al., 2023).

Generative AI models offer a promising solution, providing a view on domain-specific data and its complexities by generating realistic synthetic datasets representing otherwise inaccessible sensitive data. Combining them with Differential Privacy (DP) (Dwork, 2006), the de-facto gold standard measure for formally quantifying the maximum disclosure risk associated with a data release, allows data holders to generate high-quality representative synthetic data to share with external AI researchers, while maintaining formal privacy guarantees (Schlegel et al., 2025). However, the effectiveness of such methods has been primarily validated on either toy problems (Mattern et al., 2022a; Ochs and Habernal, 2025) or open-domain datasets (Yue et al., 2023; Xie et al., 2024). This introduces two problems with respect to the truthful estimation of their efficacy:

Firstly—what we call prior exposure—generating synthetic text from public-domain datasets is comparatively simple, as publicly available data is likely to be found in the pre-training corpora of foundation models. This simplifies generation, as models tend to memorise their training data (Carlini et al., 2021) thus leading to performance overestimates on various benchmarks (Ni et al., 2025), including synthetic text generation. Importantly, privacy leakage would be underestimated, as the generative model has access to the data not only during the synthesis process, where disclosure risk is controlled, but also during (un-controlled) pre-training (Tramèr et al., 2024). Secondly—we coin this problem representativeness—the utilisation of general domain datasets, such as product reviews, as opposed to domain-specific datasets (Johnson et al., 2023) excludes challenges associated with domain-specific data, such as domain-specific jargon (Hudson, 1978) or organisation- or country-specific work practices reflected in data. For instance, a real-world benchmark for clinical coding will incorporate hospital- and country-specific coding practices (Nguyen et al., 2023); exposing them through data sharing, however, is challenging due to privacy concerns.

Generating domain-specific text data requires rigorous benchmarking and evaluation, an area that is currently underdeveloped, as existing methods do not share common benchmarks or evaluation protocols. To address this research gap, we propose a first effort to benchmark domain-specific textual dataset generation under formal DP guarantees. Specifically, we (a) design our benchmark to include gated-access, domain-specific datasets to address the issues outlined above, (b) introduce a rigorous and reproducible evaluation protocol aimed at precisely quantifying the performance of DP text generators; (c) evaluate state-of-the-art approaches and showcase their struggles in realistic privacy-preserving data sharing scenarios. Lastly, we (d) suggest future work and possible research avenues.

2. Background & Related Work

Differential privacy is a formal method that provides an upper bound on the amount of information that can be inferred about a private dataset from a derived data release. Formally, a randomised mechanism $\mathcal{M}$ is $(\epsilon,\delta)$ -differentially private, if for any of its two datasets $x,x^{\prime}$ , that differ in a single row, for each $S\subset Range(\mathcal{M})$ the following inequality holds (Dwork, 2006): $Pr[\mathcal{M}(x)\in S]\leq exp(\epsilon)Pr[\mathcal{M}(x^{\prime})\in S]+\delta$ . In other words: The difference in the probability densities for any possible subset of the output space for outputs of $M$ for $x$ and $x^{\prime}$ differs by at most $exp(e)$ (in $1-\delta$ % of the cases, no guarantee for the rest) thus giving an upper bound on the performance of a best-possible Membership Inference Attack (MIA). In the context of data synthesis, $x$ is a private dataset, $\mathcal{M}$ is a data generator and $\mathcal{M}(x)$ is a generated synthetic dataset. We are concerned with the setting of a “trusted curator” (e.g., a hospital) releasing a synthetic version of a private dataset, and looking to limit the ability of an adversary to infer whether any individual—represented by a single row in the private dataset—was part of the private dataset, by looking at the synthetic data (only).

Direct data manipulation (“rewriting”) is infeasible due to the high-dimensional, discrete nature of textual data, and because such anonymization efforts generally do not prevent re-identification (Sweeney, 2002). Thus, existing text generation methods fall into two paradigms. First, DP training approaches (Yue et al., 2023; Mattern et al., 2022a; Meeus et al., 2025a) fine-tune generative models on “control codes” (e.g. class labels) using variants of DP-SGD (Abadi et al., 2016), which privatizes training by clipping gradients and adding noise to bound each example’s influence. By post-processing, prompting these models with control codes yields DP synthetic data. Second, DP inference methods privatize generation itself (Koga et al., 2024; Flemings et al., 2024), e.g., PATE (Papernot et al., 2018), which aggregates teacher predictions from disjoint private subsets. Because private data may be accessed at each generation step, privacy loss accrues per token, limiting scalability for large corpora (Amin et al., 2024). A notable exception is AUG-PE (Xie et al., 2024), which evolves random datasets toward the private distribution while privatizing the fitness function at relatively low privacy cost. In this paper, we evaluate one representative state-of-the-art method from each direction.

The evaluation of synthetic data generation varies significantly across modalities. For tabular data, comprehensive frameworks assess fidelity through structural similarity, distribution preservation, and global structure metrics (Commission, 2023; Chundawat et al., 2022; Bellinger et al., 2016). Text evaluation is fundamentally more challenging due to its high-dimensional, context-dependent nature. Most approaches measure distributions in embedding space using metrics like KL-divergence or MAUVE (Pillutla et al., 2021), yet prior work has predominantly focused on “easy-to-measure characteristics” such as text length distributions (Xie et al., 2024; Yue et al., 2023). Additionally, existing benchmarks typically use general-purpose datasets like sentiment analysis corpora or PubMed abstracts (Blitzer et al., 2007; Canese and Weis, 2013). Our work addresses these limitations by introducing comprehensive evaluation metrics that go beyond surface-level properties to assess domain-specific utility and entity-level fidelity, while evaluating on specialist domains that provide more realistic assessments of privacy-preserving text generation.

3. Benchmark Design

We design the benchmark to address key challenges of evaluating differentially private synthetic text generators:

Addressing the challenges: For prior exposure, we include gated access datasets (requiring e.g., Data Usage Agreements) to reduce the likelihood of the data appearing in LLMs’ pre-training corpora used for data synthesis. Further, we utilise a mostly open language model (Grattafiori et al., 2024) as a backbone of the evaluated methods, verifying that benchmark “private” data has not been exposed to the models during pre-training222They might have still encountered the data during closed-source post-training.. For representativeness We use challenging domain-specific datasets from the (bio)-medical, clinical and legal domain.

Choice of $\epsilon$ : While there is no consensus how much $\epsilon$ is “enough” (Lee and Clifton, 2011), high values provide increasingly lower protections (Dwork et al., 2019). In practice333e.g., as recommended by NIST: https://www.nist.gov/blogs/cybersecurity-insights/differential-privacy-future-work-open-challenges $e\leq 5$ or even $e\leq 1$ is considered a strong privacy guarantee, which guides our choice: $\epsilon\in\{0.5,1,2,4\}$ .

Evaluation Protocol: Our benchmark evaluates both utility and fidelity at different $\epsilon$ -levels. Utility quantifies how useful the synthetic data is for the real downstream application task. We do this by training multiple downstream classification models on the synthetic data and evaluating their performance on a (held-out) test set of original data, assessing how well synthetic data supports downstream tasks. Meanwhile, fidelity is the assessment of how similar synthetic data is with respect to the original dataset. Our implementation supports the quality evaluation of synthetic text, across various metrics. For surface-level similarity, we report BLEU (Papineni et al., 2001) and, METEOR (Banerjee and Lavie, 2005), which capture n-gram and sequence overlap between real and synthetic sentences. BERTScore (Zhang et al., 2019) and Universal Sentence Encoder (USE) (Cer et al., 2018) cosine similarity are used to evaluate semantic alignment. For these reference-based metrics, we use a “many reference” evaluation approach, where each synthetic sentence is compared against the entire pool of original sentences. Corpus-level fidelity evaluation includes measuring differences in distributions between original and synthetic data: MAUVE (Pillutla et al., 2021), recognised Named Entities and text lengths evaluate semantic, content and structural similarity, respectively. To quantify the effect of varying privacy noise levels, we compute these metrics for each synthetic dataset at varying $\epsilon$ .

Methods & Datasets: We compare two state-of-the-art differentially private text generators: DP-Gen (DP-SGD generation from Yue et al. (2023)) and AUG-PE (Xie et al., 2024), representing distribution alignment. We evaluate three healthcare datasets—HoC (Baker et al., 2016) (cancer hallmark identification in scientific literature), N2C2’08 (Uzuner, 2009) (obesity and co-morbidity recognition in clinical discharge summaries), and PsyTAR (Zolnoori et al., 2019) (adverse drug effect detection in social media posts). We further include a financial dataset, DMSAFN (Daniel-ML, 2023), and AsyLax (Barale et al., 2023) to extend coverage to legal reasoning. As summarized in Table 1, the benchmark includes long documents (HoC, N2C2’08), gated-access corpora (PsyTAR, N2C2’08), small datasets challenging to train on (N2C2’08), and multi-label settings with many labels (HoC, N2C2’08, PsyTAR).

4. Results & Analysis

Table 2 reveals substantial utility degradation in text generation. Importantly, even without privacy constraints ( $\epsilon=\infty$ ), synthetic data fails to fully match real-data baselines suggesting that the evaluated methods cannot fully capture domain-specific complexity. Under strong privacy constraints and averaged across all datasets ( $\epsilon\leq 4$ ), the average of models’ performance stays at around 50% of real-data performance, strikingly independently of $\epsilon$ . Discounting for random/majority baseline performance and calculating the overall improvement over baselines (capped by performance on real data, similar to relative gain by Mattern et al. (2022b)), lowers this score to 55%/21% without privacy guarantees for DP-Gen/Aug-PE, and to an improvement of at most 28%/52% with $\epsilon\leq 4$ . Only when looking at the best performing model on each dataset, the expected utility-privacy trade-off becomes visible (e.g., 28%, 26%, 23%, 21%, 15% retention for AUG-PE at $\epsilon\in\{\inf,4,2,1,0.5\}$ , with DP-Gen dominating over AUG-PE without privacy constraints (62% vs 28% improvement over the baseline). This finding validates our benchmark design, specifically the use of multiple classifier baselines, to obtain more precise data utility estimates. Looking at per-dataset performance, gated-access datasets present significant challenges—for example N2C2’08 has the worst baseline-adjusted utility. This yields evidence towards our hypothesis that evaluating on non-publicly available data results in a more realistic performance estimate.

Table 3 shows that fidelity also deteriorates, following similar trends of utility. MAUVE scores between real and synthetic data are close to zero across all domains, with entity overlap divergence particularly pronounced in HoC and N2C2’08, possibly due to domain-specificity and closed-access nature of the datasets. Interestingly, AsyLex again mirrors this pattern: low MAUVE and high NER divergence indicate failure to preserve legal argumentation structure, even when surface vocabulary is retained. Here, similarly, N2C2’08 and AsyLax prove to be the most challenging datasets, which we attribute to the combination of long context, domain specificity, and, for N2C2’08, gated access.

Our comparison of lexical and semantic measures of AUG-PE and DP-SGD finetuned generators, as seen in Figure 2, shows that data quality (compared to non-privately synthesised data) progressively decreases, with similar declines for both methods. This privacy-fidelity trade-off is in line with the literature, with stronger privacy leading to lower data quality (Yue et al., 2023; Xie et al., 2024; Mattern et al., 2022a).

Overall, text generated by fine-tuned models exhibits good utility and fidelity (as evidenced by high utility at $\epsilon=\infty$ , high MAUVE scores and low text length divergences), but the addition of DP noise deteriorates the quality of the generated data significantly, as noted by Xie et al. (2024), especially on “unfamiliar” domain-specific data444which demands larger parameter updates that are clipped, noised; long contexts force smaller batch size, further raising the noise-to-signal ratio. Conversely, data generated by AUG-PE is of initially low quality (low utility/MAUVE scores at $\epsilon=\infty$ ), but the elegant noising of histogram counts introduces much less noise in the process, thus deteriorating the quality less thereafter. This phenomenon is visualised in Figure 3: while DP-Gen text maintains the overall structure (e.g. text length, general sense of having adverse drug effects), the AUG-PE example misses the social-media tone. With enough good candidates, the example would have been filtered out by the private evolution mechanism, suggesting that the embedding model underlying the histogram mechanism cannot faithfully capture the nuances of domain-specific data. Notably, we observe markedly lower performance than reported previously (Yue et al., 2023; Xie et al., 2024; Mattern et al., 2022a), suggesting that evaluating DP text generators on open-domain, simple datasets overestimates their performance for real use-cases.

5. Conclusion and Future Work

We present a benchmark design for synthetic text generation under formal privacy guarantees, alongside preliminary empirical results on healthcare text data. The results reveal significant challenges in generating high-quality synthetic data: state-of-the-art models face substantial performance deterioration when they are applied to domain-specific datasets increasingly so under high privacy constraints. This highlights the limitations of approaches which focus on foundation-models which are pre-trained on general domain data. Our findings emphasise the need for domain-specific approaches that faithfully represent domain-specific data while preserving privacy. Our work is a first move towards creating a standardised benchmark that can catalyse the progress in privacy-preserving synthetic data generation for high-stake applications. To accelerate this, we release our code and report additional metrics and scores, omitted in this paper due to space constraints, in a public git repository: https://github.com/ImperialGlobalSingapore/synth-data.

Future work will focus on expanding this benchmark: we aim to support multimodal data generation to evaluate the preservation of complex relationships between different data types e.g., clinical text reports and medical images (Johnson et al., 2019) and more advanced corpus evaluation metrics encompassing discourse structure and human assessments. Finally, while in this work we have focussed primarily on the data quality, we will also develop stronger membership inference attacks that leverage changes in these metrics to provide a more realistic assessment of privacy risks. To better validate privacy guarantees, our future work will introduce diagnostic datasets specifically designed for auditing membership inference attacks (MIA) on synthetic data (Meeus et al., 2025b). This targeted approach will allow researchers that propose novel DP text generation methods to audit implementation correctness without generating large volumes of synthetic data to train shadow classifiers (Guépin et al., 2023).

Acknowledgement

This research is part of the IN-CYPHER programme and is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. We are grateful for the support provided by Research IT in the form of access to the Computational Shared Facility at The University of Manchester and the computational facilities at the Imperial College Research Computing Service (DOI: https://doi.org/10.14469/hpc/2232). We also thank the anonymous CIKM reviewers for their feedback that helped us improve the paper further.

GenAI Usage Disclosure

Some authors have used generative AI tools (GenAI) to polish grammar and suggest wordings. One author has relied on GenAI to propose an (empty) table format, which was later manually populated by corresponding results. No part of this paper has been entirely generated by GenAI.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , Vol. 24-28-October-2016. ACM, New York, NY, USA, 308–318. doi: 10.1145/2976749.2978318 · doi ↗
3Amin et al. (2024) Kareem Amin, Alex Bie, Weiwei Kong, Alexey Kurakin, Natalia Ponomareva, Umar Syed, Andreas Terzis, and Sergei Vassilvitskii. 2024. Private prediction for large-scale synthetic text generation. ar Xiv preprint ar Xiv:2407.12108 (2024).
4Baker et al. (2016) Simon Baker, Ilona Silins, Yufan Guo, Imran Ali, Johan Högberg, Ulla Stenius, and Anna Korhonen. 2016. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32, 3 (2016), 432–440. doi: 10.1093/bioinformatics/btv 585 · doi ↗
5Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. 65–72 pages. https://aclanthology.org/W 05-0909/
6Barale et al. (2023) Claire Barale, Michael Rovatsos, and Nehal Bhuta. 2023. Automated Refugee Case Analysis: A NLP Pipeline for Supporting Legal Practitioners. In Findings of the Association for Computational Linguistics: ACL 2023 . Association for Computational Linguistics, Toronto, Canada, 2992–3005. doi: 10.18653/v 1/2023.findings-acl.187 · doi ↗
7Bellinger et al. (2016) Colin Bellinger, Christopher Drummond, and Nathalie Japkowicz. 2016. Beyond the Boundaries of SMOTE: A Framework for Manifold-Based Synthetically Oversampling. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2016) (Lecture Notes in Computer Science, vol. 9851) . Springer, Cham, 248–263. doi: 10.1007/978-3-319-46128-1_16 · doi ↗
8Binici et al. (2025) Kuluhan Binici, Abhinav Ramesh Kashyap, Viktor Schlegel, Andy T. Liu, Vijay Prakash Dwivedi, Thanh-Tung Nguyen, Xiaoxue Gao, Nancy F. Chen, and Stefan Winkler. 2025. MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues. Proceedings of the AAAI Conference on Artificial Intelligence 39, 22 (4 2025), 23496–23504. doi: 10.1609/aaai.v 39i 22.34518 · doi ↗