HFL at SemEval-2022 Task 8: A Linguistics-inspired Regression Model with   Data Augmentation for Multilingual News Similarity

Zihang Xu; Ziqing Yang; Yiming Cui; Zhigang Chen

arXiv:2204.04844·cs.CL·April 12, 2022

HFL at SemEval-2022 Task 8: A Linguistics-inspired Regression Model with Data Augmentation for Multilingual News Similarity

Zihang Xu, Ziqing Yang, Yiming Cui, Zhigang Chen

PDF

Open Access 1 Repo

TL;DR

This paper presents a multilingual news similarity system using a linguistics-inspired regression model with data augmentation, achieving top performance in SemEval-2022 Task 8.

Contribution

It introduces a novel combination of data augmentation, multi-label loss, adapted R-Drop, and sample reconstruction techniques for multilingual news similarity.

Findings

01

Ranked 1st on the leaderboard

02

Achieved Pearson's Correlation of 0.818

03

Demonstrated effectiveness of linguistics-inspired methods

Abstract

This paper describes our system designed for SemEval-2022 Task 8: Multilingual News Article Similarity. We proposed a linguistics-inspired model trained with a few task-specific strategies. The main techniques of our system are: 1) data augmentation, 2) multi-label loss, 3) adapted R-Drop, 4) samples reconstruction with the head-tail combination. We also present a brief analysis of some negative methods like two-tower architecture. Our system ranked 1st on the leaderboard while achieving a Pearson's Correlation Coefficient of 0.818 on the official evaluation set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

geekdream-x/semeval2022-task8-tonyx
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies