Trace Reconstruction with Language Models

Franziska Weindel; Michael Girsch; Reinhard Heckel

arXiv:2507.12927·cs.LG·July 18, 2025

Trace Reconstruction with Language Models

Franziska Weindel, Michael Girsch, Reinhard Heckel

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces TReconLM, a novel approach using language models trained on synthetic and real data to improve trace reconstruction accuracy, especially in DNA data storage applications.

Contribution

It presents a new language model-based method for trace reconstruction that outperforms existing algorithms, including prior deep learning approaches.

Findings

01

TReconLM achieves higher sequence recovery rates than state-of-the-art methods.

02

Pretraining on synthetic data and fine-tuning on real data enhances performance.

03

The approach is effective in correcting errors in DNA data storage applications.

Abstract

The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of the data retrieval process. In this work, we propose TReconLM, which leverages language models trained on next-token prediction for trace reconstruction. We pretrain language models on synthetic data and fine-tune on real-world data to adapt to technology-specific error patterns. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This is a very important problem for improving DNA-based data storage retrieval speeds. 2. The results show clear improvement and paper is clearly written, while being easy to read.

Weaknesses

1. The message and contribution from the paper is unclear - is it a new architecture, insight into training strategies or way to finetune on real datasets? 2. The improvements over DNAFormer, are these obtained by finetuning on the real data? Is DNAFormer also pretrained on synthetic and finetuned on real data as well.

Reviewer 02Rating 8Confidence 4

Strengths

* The paper is well written and motivated. The results are clear, where the authors’ LM outperforms several algorithmic and deep learning competitors. They evaluate their method on both synthetic and real data, and they perform several experiments in different settings (e.g., noise rate, cluster size). * The results also include a theoretical analysis of how transformers can solve the TR problem. The basic idea is to first look at logistic regression for this problem. Then extend this analysis

Weaknesses

* The theoretical analysis only considers substitution errors, which are often much easier to analyze than full edit distance errors. It would strengthen the paper a lot to have new theoretical results for insertions and deletions, since this would represent a challenging theoretical problem. * Similarly, the paper does not quite explain why the transformer they train works well for the full TR problem with insertions and deletions. I think this is kind of mysterious, given that their may be ma

Reviewer 03Rating 6Confidence 3

Strengths

- Simple formulation that beats baselines on synthetic and real data. - Useful scaling-law study to pick model size at fixed FLOPs. - Empirical interpretability via attention visualizations.

Weaknesses

- Results on real data use N≤10 and short L; Quadratic attention over concatenated reads may bottleneck for large N or L. Inputs are an unordered set of traces but the method imposes an order via concatenation; no explicit permutation-invariance or ablation on trace order. Attention maps suggest same-index focus but don’t guarantee invariance. The method is really data hungry perhaps because of this. More on this in questions! - The theory explains only substitution; The main challenging cases,

Code & Models

Models

🤗
mli-lab/TReconLM
model

Datasets

mli-lab/TReconLM_datasets
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques