Tweet2Vec: Character-Based Distributed Representations for Social Media

Bhuwan Dhingra; Zhong Zhou; Dylan Fitzpatrick; Michael Muehl; William; W. Cohen

arXiv:1605.03481·cs.LG·May 18, 2016

Tweet2Vec: Character-Based Distributed Representations for Social Media

Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, William, W. Cohen

PDF

1 Repo

TL;DR

Tweet2Vec introduces a character-based model for social media text that effectively captures complex dependencies, outperforming word-level methods especially with out-of-vocabulary and informal language.

Contribution

The paper presents a novel character composition model, Tweet2Vec, that learns tweet representations directly from characters, addressing social media language challenges.

Findings

01

Outperforms word-level baseline in hashtag prediction

02

Handles out-of-vocabulary words effectively

03

Improves representation of informal social media text

Abstract

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bdhingra/tweet2vec
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.