Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction
Hugo Moreira

TL;DR
This paper introduces a flexible pipeline that converts text corpora into quantitative semantic signals using embeddings, logprob scoring, and noise reduction, demonstrated on Portuguese AI news articles.
Contribution
It presents a novel, adaptable framework combining embeddings, logprob evaluation, and noise reduction for semantic analysis of large text datasets.
Findings
Supports document-level semantic positioning and corpus characterization.
Integrates Qwen embeddings, UMAP, and anomaly detection into a workflow.
Applicable to AI engineering tasks like corpus inspection and monitoring.
Abstract
This paper presents a practical pipeline for turning text corpora into quantitative semantic signals. Each news item is represented as a full-document embedding, scored through logprob-based evaluation over a configurable positional dictionary, and projected onto a noise-reduced low-dimensional manifold for structural interpretation. In the present case study, the dictionary is instantiated as six semantic dimensions and applied to a corpus of 11,922 Portuguese news articles about Artificial Intelligence. The resulting identity space supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. We show how Qwen embeddings, UMAP, semantic indicators derived directly from the model output space, and a three-stage anomaly-detection procedure combine into an operational text-as-signal workflow for AI engineering tasks such as corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
