A Comprehensive Understanding of Code-mixed Language Semantics using   Hierarchical Transformer

Ayan Sengupta; Tharun Suresh; Md Shad Akhtar; and Tanmoy Chakraborty

arXiv:2204.12753·cs.CL·April 28, 2022·5 cites

A Comprehensive Understanding of Code-mixed Language Semantics using Hierarchical Transformer

Ayan Sengupta, Tharun Suresh, Md Shad Akhtar, and Tanmoy Chakraborty

PDF

Open Access 1 Repo

TL;DR

This paper introduces a hierarchical transformer model that effectively learns the semantics of code-mixed languages, outperforming existing models across multiple NLP tasks and languages, and demonstrates strong generalizability through pre-training and transfer learning.

Contribution

The paper proposes a novel hierarchical transformer architecture for code-mixed language understanding, incorporating multi-headed self-attention and outer product attention, with extensive evaluation on diverse datasets.

Findings

01

HIT outperforms state-of-the-art models on 9 NLP tasks across 17 datasets.

02

Pre-training significantly enhances downstream task performance.

03

Model generalizes well with masked language modeling, zero-shot, and transfer learning.

Abstract

Being a popular mode of text-based communication in multilingual communities, code-mixing in online social media has became an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge, due to scarcity of data and unavailability of robust and language-invariant representation learning technique. Any morphologically-rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this paper, we explore a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages. HIT consists of multi-headed self-attention and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across 6 Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lcs2-iiitd/code-mixed-classification
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection