> For the complete documentation index, see [llms.txt](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural/4-4-jepa-world-models.md).

# 4.4.2 JEPA World Models

***

The **JEPA World Model** (JEPA-WM) paradigm — developed by Yann LeCun's group at Meta AI — separates representation learning from world-model learning, and operates entirely in **abstract feature space** rather than pixel space. Unlike DreamerV3 (§4.3.8), which learns world models in a pixel-derived latent space and requires reward signals, JEPA-WMs are trained entirely via self-supervised feature prediction and plan using a cost function without any reward annotation.(LeCun, 2022)(Terver et al., 2026)

## The Three-Stage Pipeline

```
Stage 1 — JEPA Representation Learning (V-JEPA):
  Self-supervised feature prediction from unlabeled video
  → Rich visual encoder with no pixel reconstruction
  → Motion and appearance representations

Stage 2 — Action-Conditioned World Model (V-JEPA-2-AC):
  Given: current latent state z_t + action a_t
  Predict: next latent state ẑ_{t+1}
  Training: minimize latent prediction error over action-labeled sequences

Stage 3 — MPC Planning (CEM / MPPI):
  Optimize action sequence a_{0:T} such that
  predicted trajectory minimizes cost c(ẑ_0, ..., ẑ_T)
  No reward annotation required — cost function encodes task objective
```

**The key design principle** — from LeCun's 2022 AMI manifesto — is to predict in *representation* space, not pixel space. A world model that predicts abstract features (motion trajectories, object identity, spatial relationships) forces the system to reason about what matters semantically, rather than irrelevant rendering details.

## Engineering Design Guide

The "What Drives Success in Physical Planning with JEPA World Models" paper (Terver, Yang, Ponce, Bardes, LeCun, 2026) provides the first systematic empirical design guide for JEPA-WMs — evaluating every major design choice:(Terver et al., 2026)

| Design Dimension       | Finding                                                                                                                                        |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| Predictor architecture | AdaLN + RoPE consistently best; AdaLN-zero variant also competitive                                                                            |
| Training objective     | Multistep rollout >> single-step                                                                                                               |
| Visual encoder         | Best encoder varies by task: DINOv2-S for simulated navigation; DINOv2-L for real-world manipulation. V-JEPA variants competitive across tasks |
| Proprioception         | Combined visual + joint state >> either alone                                                                                                  |
| Planning optimizer     | CEM best for both manipulation and navigation (MPPI is evaluated as a CEM variant)                                                             |

## V-JEPA 2.1 — Dense Features (2026)

**V-JEPA 2.1** (Mur-Labadia et al., 2026) extended the JEPA-WM architecture to produce **dense spatial features** — a prerequisite for fine-grained manipulation. Innovations: dense predictive loss (both visible and masked tokens contribute), hierarchical self-supervision across encoder layers, multi-modal image+video tokenizers.

Results: **+20 percentage points in real-robot grasping success** over V-JEPA-2-AC baselines; SOTA on Ego4D action anticipation and TartanDrive navigation.(Mur-Labadia et al., 2026)

## Comparison: DreamerV3 (§4.3.8) vs. JEPA-WM

| Property                | DreamerV3                      | JEPA-WM                               |
| ----------------------- | ------------------------------ | ------------------------------------- |
| Representation          | Reconstructive (pixel decoder) | Predictive (feature space only)       |
| World model training    | Reward + reconstruction loss   | Feature prediction (no labels/reward) |
| Planning                | Actor-critic in imagination    | CEM/MPPI in latent space              |
| Supervision             | Reward signal required         | Cost function only (reward-free)      |
| Physical robot grasping | Not primary benchmark          | +20pp over prior SOTA (2026)          |
| Symbolic integration    | Research pairing (PDDL)        | Research pairing (PDDL)               |

**Architectural classification:** JEPA-WMs reside in §4.4 (Pure Neural) rather than §4.3 (Hybrid) because they contain no explicit formal symbolic representations — planning operates entirely in continuous learned space. DreamerV3 remains in §4.3 because it is frequently combined with symbolic task planners in research systems, with the latent model handling continuous dynamics and a PDDL planner selecting sub-goal sequences.

## The Neuro-Symbolic Perspective

JEPA-WMs are not neuro-symbolic in the book's sense. Their planning component (CEM/MPPI over learned latent dynamics) is a continuous optimization, not a symbolic search. The critical unresolved question is whether learned world models can ever satisfy the formal correctness guarantees that symbolic planners provide — or whether formal verification (§5.2) is ultimately necessary for safety-critical applications. This tension is examined in §5.1.

***

> **When to Use §4.4.2 (JEPA World Models)**
>
> Use JEPA-WMs when all three of the following hold:
>
> 1. **Physical planning from perception is required** — the task involves reasoning about continuous physical dynamics (manipulation, navigation, locomotion) from visual or proprioceptive inputs.
> 2. **Labeled rewards are unavailable or expensive to obtain** — you cannot instrument the environment with a dense reward signal, but you have access to unlabeled video observations of the relevant dynamics.
> 3. **A cost function can encode the task objective** — you can specify what "success" looks like geometrically (reach target pose, minimize contact force, stay within a region) without defining a reward schedule.
>
> **Do not use JEPA-WMs when:** provably correct outputs are required and the cost function cannot be formally verified; the domain is discrete and symbolic (planning with PDDL, theorem proving, constraint satisfaction); or you need formal safety guarantees rather than empirically good performance. In those cases, §4.1–4.3 are more appropriate, potentially combined with a JEPA-style perceptual front-end.

***

## References

1. LeCun, Yann. "A Path Towards Autonomous Machine Intelligence." *OpenReview*, Version 0.9.2, June 2022. <https://openreview.net/forum?id=BZ5a1r-kVsf>
2. Terver, Basile, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. "What Drives Success in Physical Planning with Joint-Embedding Predictive World Models." *Transactions on Machine Learning Research (TMLR)*, 2026. arXiv:2512.24497. <https://arxiv.org/abs/2512.24497> | Code: <https://github.com/facebookresearch/jepa-wms>
3. Mur-Labadia, Lorenzo, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. "V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning." *arXiv:2603.14482*, March 2026. <https://arxiv.org/abs/2603.14482>
4. Bardes, Adrien, et al. "Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)." *arXiv:2404.08471*, February 2024. <https://arxiv.org/abs/2404.08471>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural/4-4-jepa-world-models.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
