LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Weiye Shi; Zhaowei Zhang; Shaoheng Yan; Yaodong Yang

arXiv:2512.04957·cs.CL·December 5, 2025

LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics

Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang

PDF

Open Access

TL;DR

This study investigates whether large language models can learn and utilize deeper linguistic features like syntax, metaphor, and phonetics across multiple languages, revealing their potential and limitations in capturing complex language properties.

Contribution

Introduces a multilingual genre classification dataset with explicit linguistic features to evaluate LLMs' ability to learn complex language properties from raw text and features.

Findings

01

LLMs can learn latent linguistic structures from raw text and explicit features.

02

Different linguistic features contribute variably across classification tasks.

03

Incorporating complex linguistic signals enhances model performance.

Abstract

Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Sentiment Analysis and Opinion Mining · Topic Modeling