LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang

TL;DR
This study investigates whether large language models can learn and utilize deeper linguistic features like syntax, metaphor, and phonetics across multiple languages, revealing their potential and limitations in capturing complex language properties.
Contribution
Introduces a multilingual genre classification dataset with explicit linguistic features to evaluate LLMs' ability to learn complex language properties from raw text and features.
Findings
LLMs can learn latent linguistic structures from raw text and explicit features.
Different linguistic features contribute variably across classification tasks.
Incorporating complex linguistic signals enhances model performance.
Abstract
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Sentiment Analysis and Opinion Mining · Topic Modeling
