Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS   Tokens

Ronald Seoh; Haw-Shiuan Chang; Andrew McCallum

arXiv:2309.04333·cs.CL·September 11, 2023·1 cites

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

Ronald Seoh, Haw-Shiuan Chang, Andrew McCallum

PDF

Open Access 1 Repo

TL;DR

This paper introduces Multi2SPE, a method that uses multiple CLS tokens in Transformers to better capture multi-domain scientific document features, improving tasks like citation prediction.

Contribution

It proposes Multi2SPE, a novel approach that employs multiple CLS tokens for enhanced multi-domain scientific document encoding, along with a new benchmark dataset.

Findings

01

Multi2SPE reduces citation prediction error by up to 25%.

02

It requires minimal additional computation over standard BERT.

03

The approach improves multi-domain document representations.

Abstract

Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ronaldseoh/multi2spe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Adam · Weight Decay · Byte Pair Encoding · Linear Warmup With Linear Decay