TL;DR
This paper introduces CharacterBERT and Self-Teaching to enhance the robustness of dense retrievers against queries with typos, addressing tokenization issues in BERT and demonstrating improved effectiveness on real-world typo queries.
Contribution
The paper proposes a novel combination of CharacterBERT and Self-Teaching to improve dense retriever robustness to typos, along with a new dataset for evaluation.
Findings
CharacterBERT with Self-Teaching outperforms previous methods on typo queries.
The tokenization strategy in BERT significantly affects robustness to typos.
A new dataset with real-world typo queries is introduced.
Abstract
Current dense retrievers are not robust to out-of-domain and outlier queries, i.e. their effectiveness on these queries is much poorer than what one would expect. In this paper, we consider a specific instance of such queries: queries that contain typos. We show that a small character level perturbation in queries (as caused by typos) highly impacts the effectiveness of dense retrievers. We then demonstrate that the root cause of this resides in the input tokenization strategy employed by BERT. In BERT, tokenization is performed using the BERT's WordPiece tokenizer and we show that a token with a typo will significantly change the token distributions obtained after tokenization. This distribution change translates to changes in the input embeddings passed to the BERT-based query encoder of dense retrievers. We then turn our attention to devising dense retriever methods that are robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Attention Dropout · Adam · Residual Connection · WordPiece · CharacterBERT
