TL;DR
NeuralOS is a neural framework that simulates operating system GUIs by predicting screen images from user inputs, enabling realistic GUI sequence generation and application simulation.
Contribution
It introduces NeuralOS, combining RNNs and diffusion models to simulate GUIs and demonstrates learning from synthetic data to simulate unseen applications.
Findings
Successfully generates realistic GUI sequences
Accurately predicts mouse interactions and state transitions
Can simulate applications not present in training data
Abstract
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Beyond reproducing existing systems, NeuralOS shows that synthesized training data can teach the model to simulate applications that were never installed, as illustrated by a…
Peer Reviews
Decision·ICLR 2026 Poster
This work provides a clear and detailed description of the model architecture, input feature encoding, and multi-stage training strategy to ensure rendering quality. The use of LLMs to autonomously collect interaction data, thereby removing human involvement, is particularly interesting. Overall, the paper demonstrates a promising approach to constructing a world model for operating systems, capable of generating real-time screens based on user interactions.
#1 The paper should better position its work within the context of existing works. For instance, although video generation is briefly mentioned (line 146), no specific papers are cited. Similarly, despite of the discussion on world models in Section L, the main text does not contextualize NeuralOS in world models. The authors clarify how their approach to modeling long-term trajectories and user actions differs from prior methods used in video generation and world modeling. #2 The discussion of
* The paper is clearly written and well organized; the technical components are formally defined and adequately justified, which makes the work easy to follow. * The proposed architectures are coherently structured and integrated. * The paper represents an original contribution, as generative simulation of Operating Systems remains novel and unexplored, though it draws some parallels with prior work on generative simulation in other fields (i.e.: gaming). * Code is given to reviewers, allowing
* The motivation remains vague throughout the paper. The key questions left to be answered are: “How can this contribution advance the field?”, “Why do we need to simulate OSs?”. A more explicit problem statement - together with expected applications (i.e.: human-computer interaction research, AI agents) would strengthen the paper’s significance. * No closely related previous work is discussed - hence, no competitors are referenced. The authors only cite examples from game or real-time simulati
1. Cleanly split state tracking (hierarchical RNN with attention over the previous frame) from image synthesis, which makes the problem well-posed for long, interactive sequences. 1. This paper is very instructive in providing a recipe for training video based world-models in general. There are several important tricks here that are more broadly applicable than training a neural OS world model. For instance, the model architecture for long memory, multi-stage training, combination of tricks in
1. You can't really use the final artifact for much. If you want to train a model to use an operating system, you'd rather just use the Docker container. I'm finding it a bit hard to justify the contribution of this paper beyond that it is a very interesting demo and some of the tricks used to collect the data and make the world-model work. 1. The demonstration paths through the Docker OS to collect training data were generated by a computer-use agent (Claude-3.5-Sonnet). This is obviously very
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
