Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

Hugo Moreira

arXiv:2604.13056·cs.CL·April 16, 2026

Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

Hugo Moreira

PDF

TL;DR

This paper introduces a flexible pipeline that converts text corpora into quantitative semantic signals using embeddings, logprob scoring, and noise reduction, demonstrated on Portuguese AI news articles.

Contribution

It presents a novel, adaptable framework combining embeddings, logprob evaluation, and noise reduction for semantic analysis of large text datasets.

Findings

01

Supports document-level semantic positioning and corpus characterization.

02

Integrates Qwen embeddings, UMAP, and anomaly detection into a workflow.

03

Applicable to AI engineering tasks like corpus inspection and monitoring.

Abstract

This paper presents a practical pipeline for turning text corpora into quantitative semantic signals. Each news item is represented as a full-document embedding, scored through logprob-based evaluation over a configurable positional dictionary, and projected onto a noise-reduced low-dimensional manifold for structural interpretation. In the present case study, the dictionary is instantiated as six semantic dimensions and applied to a corpus of 11,922 Portuguese news articles about Artificial Intelligence. The resulting identity space supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. We show how Qwen embeddings, UMAP, semantic indicators derived directly from the model output space, and a three-stage anomaly-detection procedure combine into an operational text-as-signal workflow for AI engineering tasks such as corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.