RSSM: deterministic backbone + stochastic "belief" (with code)
We need stable rollouts and uncertainty
Concluding with the limitations of the VRNN paper in the last blog, the field has converged on a specific architecture known as the Recurrent State Space Model (RSSM). The PlaNet and Dreamer papers mainly popularized it. Why? It bridges the gap between deterministic RNNs and stochastic VAEs.
I’m not going to provide a breakdown of the RSSM paper; instead, I will simply state how it addresses the above research gap.
The RSSM explicitly splits the state S_t into two parts:
Deterministic State (h_t): The memory → captures the historical context (implemented as a GRU/RNN)
Stochastic State (z_t): The uncertainty → captures the “randomness” or information we can’t predict perfectly (implemented as a VAE latent variable)
Why split? Because we want stable rollouts (deterministic backbone), and uncertainty when necessary (stochastic part).
Therefore, there are three essential components:
Transition Model (The Prior / “The Dreamer”): Predicts the next state without seeing the future.
\(h_t = \text{GRU}(h_{t-1},z_{t-1},a_{t-1})\)\(\hat{z}_t \sim p(z_t|h_t), \space \space \text{the network predicts a Gaussian } (\mu, \sigma)\)Representation Model (The Posterior / “The Observer”): Corrects the state after seeing the image x_t.
\(z_t \sim q(z_t|h_t,x_t)\)NOTE: This is the “Kalman update” equivalent. It uses the ground truth pixel data to pinpoint the true state.
Observation Model (The Decoder): Reconstructs the world to prove we understood it.
\(\hat{x}_t \sim p(x_t|h_t,z_t)\)
The key loop here:
During imagination rollouts, we can run 1-2 steps repeatedly without observations. During training, we can use 3 to correct.
Now, let’s move beyond the theory and train an RSSM on a simple moving-dot agent. We will first implement a data generator for the moving dot, then introduce occlusion, and subsequently incorporate control. Finally, we will train the agent using the RSSM architecture.
Environment: Synthetic-Dot
State: (x, y, vx, vy) → position and velocity in a 2D grid.
Dynamics
The boundary condition is that either the ball will reflect (bounce) or be clipped.
The observation is a 32x32 (or 64x64) grayscale image with a Gaussian-blurred dot.
For occlusion (partial observability), we have an optional rectangular masks that persist for multiple frames. It only affects the observations, not the ground-truth state.
RSSM Architecture
┌─────────────────────────────────────────────────────────┐
│ At each timestep t │
└─────────────────────────────────────────────────────────┘
obs_t ──► Encoder ──► embed_t
│
▼
(h, a_t) ──────────► Prior ─────► z_prior ~ N(μ_prior, σ_prior) [predict without obs]
│
(h, embed_t) ─────► Posterior ─► z ~ N(μ_post, σ_post) [infer from obs]
│
▼
(h, z) ───────────► Decoder ───► recon_t
│
concat(z, a_t), h ─► GRU ──────► h'You can find a minimal implementation of RSSM here: RSSM CODE. You can train the RSSM on a moving-dot environment or adapt to your environment of choice by defining the observations, occlusions, and control.
Additionally, you can find the precursors for the classical/neural world models here: PRECURSORS CODE
And here’s a sample world model: MINIVERSE CODE
In the next blog, I’ll discuss the posterior collapse and representation leakage in these family of models.
~Ashutosh

