# A corpus approach to orthographic chunking: near-naive word separation in Swiss German text messages

**Authors:** Erika Just, Paul Widmer

PMC · DOI: 10.1515/cllt-2024-0049 · Corpus Linguistics and Linguistic Theory · 2025-03-14

## TL;DR

This paper explores how Swiss German text messages use spelling patterns influenced by pronunciation, showing differences from standard German.

## Contribution

The study introduces a corpus-based analysis of orthographic chunking in Swiss German text messages, highlighting phonology-driven word separation.

## Key findings

- Swiss German text messages show fewer orthographic words compared to Standard German.
- Writers prioritize phonology over syntax when deciding word separation.
- Findings question the usefulness of orthographic representation in comparative linguistic research.

## Abstract

A lot of importance is indirectly attributed to the orthographic word: it constitutes the basis of any task that is preceded by tokenization and presents material for stimuli in psycholinguistic experiments. But in many writing traditions, the orthographic word is representative of isolated entries in the lexicon and largely ignores phonological processes of production. This study examines near-naive word separation in Swiss German using a corpus of text messages, revealing distinct patterns of orthographic segmentation driven by phonological processes such as assimilation and epenthesis. Compared to Standard German, Swiss German exhibits fewer orthographic words, suggesting heightened representation of prosodic dependencies in writing. Writers prioritize phonology over syntax when deviating from standard German space insertion conventions. These findings increase doubts about the meaningfulness of orthographic representation for word-based comparative linguistic research and highlight the importance of integrating phonological information into natural language processing models.

## Full-text entities

- **Genes:** SLC6A3 (solute carrier family 6 member 3) [NCBI Gene 6531] {aka DAT, DAT1, PKDYS, PKDYS1}, ACACA (acetyl-CoA carboxylase alpha) [NCBI Gene 31] {aka ACAC, ACACAD, ACACalpha, ACC, ACC1, ACCA}
- **Diseases:** MOD (MESH:C564833), sick (MESH:D008881)
- **Chemicals:** NOUN (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12919633/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12919633/full.md

## References

61 references — full list in the complete paper: https://tomesphere.com/paper/PMC12919633/full.md

---
Source: https://tomesphere.com/paper/PMC12919633