TeNet: Text-to-Network for Compact Policy Synthesis
Ariyan Bighashdel, Kevin Sebastian Luck

TL;DR
TeNet is a framework that creates compact, task-specific robot policies from natural language descriptions by leveraging large language models and hypernetworks, enabling efficient real-time control with strong multi-task performance.
Contribution
Introduces TeNet, a novel text-to-network approach that generates lightweight, executable robot policies from language, combining pretrained LLMs with hypernetworks for efficient, generalizable control.
Findings
Policies are significantly smaller than sequence-based baselines.
Achieves strong performance in multi-task and meta-learning settings.
Supports high-frequency, real-time control.
Abstract
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Interesting Idea - Good presentation - Strong results: high-frequency, efficient policies.
The main weaknesses are misleading scope and overstated claims: - The paper's core motivation, which frames TENET as complementary to large VLAs, is misleading, since VLAs operate in complex, high-diversity vision-based, real-world scenarios, whereas TENET is evaluated only in state-based, simulation environments. - The claim that the framework makes “TeNet more scalable and practical for diverse task sets.” is not supported by experiments - The language instruction and task diversity are very l
1. Clear Novelty: The paper's core idea—using a language-conditioned hypernetwork to synthesize a compact policy network—is novel and well-motivated. The related work section is thorough, successfully positioning this as a new paradigm distinct from large VLAs, trajectory-prompted models, and other existing hypernetworks. And the successful implementation experience of hypernetworks is valuable to the community 2. Efficiency: The results in Table 1 are highly compelling: the framework generates
1. The current framework operates on state-based inputs. This is a reasonable first step but it limits the immediate applicability of the work, as most modern embodied agents are expected to operate from vision. 2. No trajectory-based hypernetwork baseline is provided, although TeNet is the first attempt to directly use language descriptions to generate policies, a comparison with trajectory based hypernetworks would be appreciated. 3. A related work for trajectory based hypernetworks is missing
1. The paper introduces an interesting paradigm in which a hypernetwork conditioned on natural-language embeddings generates the weights of compact, task-specific policy networks. 2. The paper is clearly written, well structured, and easy to follow. 3. Hyperparameter and experimental details are provided thoroughly, supporting reproducibility. 4. Empirical results show consistent improvements over Decision Transformer baselines across the evaluated benchmarks.
1. Architecture justification: The authors use a large-scale language model (LLaMA-8B) as the text encoder, followed by a two-layer MLP hypernetwork (hidden size = 128) to produce compact policy weights. Given the limited number of unique task descriptions (50 × 10 = ~500 text embeddings), this setup appears unbalanced—heavy on preprocessing and light on the core policy backbone. The authors should justify why such a large encoder is necessary and whether smaller models (e.g., T5-sma
This paper is clearly presented, and the results seem reasonable. I appreciate the detailed training iteration in Figure 2 and ablations in Table 1. The implementation details in the appendix are also very helpful.
There are two primary concerns related to the method clarity and experiments of this paper: 1. The paper’s main contribution appears to be a novel way of integrating language embeddings with robot state inputs and action outputs through a hypernetwork-generated policy. However, several prior works (e.g., π₀ [1], RoboVLMs [2]) have already explored similar design choices. It would strengthen the paper to include direct comparisons against these baselines (not necessarily the ones I mentioned ear
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
