> For the complete documentation index, see [llms.txt](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2/2-2-neuro-symbolic-systems.md).

# 2.2 Modern Neuro-Symbolic Systems

The historical systems above are purely classical. They demonstrate that formal planning works at scale. The systems we turn to now are different: they combine neural components with classical planning, and in doing so achieve capabilities that neither approach could achieve alone.

## LLM+P — Bridging Natural Language and Planning (2023)

**LLM+P** (*LLM + Planner*) is the canonical example of the "LLM as translator" pattern.(Liu et al., 2023)

The core insight is elegant: LLMs are remarkably good at translating between natural language and structured representations, even for formal languages such as PDDL. Classical planners are remarkably good at solving PDDL problems once they are correctly specified. Why not combine them?

LLM+P's pipeline:

1. The user describes a planning problem in natural language.
2. An LLM (the original study evaluated both GPT-3.5 and GPT-4) translates the natural language description into a PDDL domain file and problem file.
3. A classical planner (Fast Downward) solves the PDDL problem.
4. The planner's solution is translated back into natural language for the user.

The result is a system that accepts human-readable problem descriptions and returns provably correct solutions. The neural component handles the messy, ambiguous, semantically rich natural language. The symbolic component provides the guarantee of correctness.

**Limitation:** LLM+P requires that the domain be known and pre-specified in advance. It is not designed for problems where the domain itself must be learned or discovered at runtime.

*Reference:* Liu, Bo, et al. "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency." *arXiv preprint* arXiv:2304.11477 (2023). <https://arxiv.org/abs/2304.11477> | Code: <https://github.com/Cranial-XIX/llm-pddl>

***

## SayCan — Grounding Language in Robotic Affordances (2022)

**SayCan** (*"Do As I Can, Not As I Say"*) from Google Robotics addresses a fundamental challenge in robot planning: an LLM might propose a sequence of actions that sounds reasonable in English but is physically infeasible for the specific robot in its current environment.(Ahn et al., 2022)

SayCan combines two sources of knowledge: semantic knowledge from an LLM (given a high-level goal like "bring me a coke from the kitchen," the LLM scores candidate low-level actions by their semantic relevance) and affordance functions from robot experience (a value function learned from interaction data scores candidate actions by their physical feasibility in the current environment). The combined score (LLM semantic relevance × affordance feasibility) guides a greedy search over the action space. The resulting plans are both linguistically coherent *and* physically executable.

SayCan demonstrated on a real mobile manipulation robot in a kitchen environment that this combination outperformed either component alone, successfully completing complex multi-step tasks such as "bring me something from the fridge that I can use to wipe up a spill."

> **Key insight:** Symbol grounding — mapping abstract language concepts to physical actions — requires both semantic knowledge (neural) and physical feasibility estimation (symbolic or learned).

*Reference:* Ahn, Michael, et al. "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." *Conference on Robot Learning (CoRL)*, 2022. <https://arxiv.org/abs/2204.01691> | Website: <https://say-can.github.io>

***

## Code as Policies — Composable Robot Behaviors (2023)

**Code as Policies** exploits a different LLM strength: their ability to write functional Python code.(Liang et al., 2023)

Robot behaviors are often compositional: "pick up all red blocks" is trivially expressed as a loop over detected objects with a color filter, where `pick_up(block)` calls a low-level robot API. An LLM given access to a library of robot primitives can generate hierarchical, compositional behaviors as Python programs — programs that can be *statically verified* for syntactic correctness and *dynamically executed* with real-time constraint checking.

Code as Policies enables robot behaviors that are hierarchical (high-level programs call mid-level subroutines call low-level primitives), compositional (behaviors from different tasks can be combined), verifiable (generated code can be checked by static analysis before execution), and interpretable (a Python program is readable by any programmer).

*Reference:* Liang, Jacky, et al. "Code as Policies: Language Model Programs for Embodied Control." *IEEE International Conference on Robotics and Automation (ICRA)*, 2023. <https://arxiv.org/abs/2209.07753> | Website: <https://code-as-policies.github.io>

***

## RT-2 — Vision-Language-Action Models (2023)

**RT-2** (Brohan et al., 2023) represents a fundamentally different approach to grounding language in robotic action: rather than combining a frozen LLM with learned affordances (SayCan) or generating code programs (Code as Policies), RT-2 **co-trains language, vision, and robotic actions in a single unified model**.

The key insight: robotic actions — expressed as discrete tokenized motor commands — can be treated as just another token type in a vision-language model's vocabulary. RT-2 fine-tunes a large VLM (PaLM-E or PaLI-X scale) on a mixture of web-scraped vision-language data and robot trajectory data. The model learns to predict action token sequences directly, grounded in visual observations and natural language instructions.

**Results:** RT-2 demonstrates emergent generalization abilities not present in any model trained on robot data alone: chain-of-thought robot planning (multi-step decomposition and sequential execution), zero-shot semantic generalization to instructions referencing web-only concepts ("put the object associated with Nikola Tesla next to the orange"), and 55% absolute success on novel emergent-skill tasks — compared to approximately 9% for RT-1 on equivalent tasks.

**Neuro-symbolic reading:** RT-2 occupies the neural-primary end of the spectrum. Its failure modes — hallucinated actions, semantic drift on constraint-heavy tasks — are precisely those that formal task planning layers prevent. RT-2 thus motivates the hybrid architectures of Chapter 4: the VLM handles perceptual grounding and semantic understanding; a formal task planner provides constraint satisfaction and goal verification.

*Reference:*\
Brohan, Anthony, et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." *arXiv preprint* arXiv:2307.15818 (2023). <https://arxiv.org/abs/2307.15818> | Website: <https://robotics-transformer.github.io>

***

## JEPA Physical Planning — From Representation Learning to Robot Control (2024–2026)

The JEPA (Joint Embedding Predictive Architecture) program at Meta AI, led by Yann LeCun, represents the most systematic recent attempt to build physical world models for robot planning entirely from self-supervised learning — no pixel reconstruction, no text supervision, no reward labels.(LeCun, 2022)

**The core idea** (from LeCun's 2022 AMI manifesto) is that a world model should predict in *abstract representation space*, not pixel space. A model that predicts semantic features (object positions, motion trajectories, physical relations) learns what matters for planning; one forced to predict pixels wastes capacity on irrelevant rendering details. This principle separates the JEPA program from earlier reconstruction-based approaches (MAE, pixel-space video models).

**V-JEPA** (Bardes et al., 2024) demonstrated that feature prediction from unlabeled video produces representations competitive with supervised methods on both motion and appearance tasks (81.9% on Kinetics-400, 72.2% on Something-Something-v2), with a frozen backbone and zero task-specific supervision. This validated the first stage of the world-model pipeline.

**JEPA-WMs** (Terver, Yang, Ponce, Bardes, LeCun, 2026) closes the loop to physical planning: an action-conditioned predictor trained on top of V-JEPA representations learns to predict the next latent state given an action, enabling model-predictive control (CEM/MPPI) in latent space for robot manipulation and navigation — without any reward annotation.(Terver et al., 2026)

**V-JEPA 2.1** (Mur-Labadia et al., 2026) achieved +20 percentage points in real-robot grasping success over prior baselines by introducing dense spatial features through hierarchical self-supervision across encoder layers.(Mur-Labadia et al., 2026)

> **Neuro-symbolic reading:** The JEPA-WM pipeline is architecturally classified as [§4.4 Pure Neural](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural.md) — it contains no explicit formal symbolic representations, and planning operates entirely in continuous learned space. This distinguishes it from the [§4.3 Hybrid Co-Processing](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-3-hybrid-architectures.md) systems (which use formal symbolic oracles or knowledge graphs as co-equal components). The critical open question is whether this continuous approach can eventually provide the correctness guarantees that symbolic planning provides. This is examined in [§4.4.2](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural/4-4-jepa-world-models.md) and [§5.1](/neuro-symbolic-ai-in-practice/part-iv-synthesis/chapter-5/5-1-synthesis.md).

*References:* LeCun, Yann. "A Path Towards Autonomous Machine Intelligence." *OpenReview*, Version 0.9.2, June 2022. <https://openreview.net/forum?id=BZ5a1r-kVsf>

Terver, Basile, et al. "What Drives Success in Physical Planning with Joint-Embedding Predictive World Models." *TMLR*, 2026. arXiv:2512.24497. <https://arxiv.org/abs/2512.24497> | Code: <https://github.com/facebookresearch/jepa-wms>

Mur-Labadia, Lorenzo, et al. "V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning." *arXiv:2603.14482*, March 2026. <https://arxiv.org/abs/2603.14482>

***

## HAIMEDA — Neuro-Symbolic Verification in Production Medical AI (2026)

**HAIMEDA** (Sigloch & Benzmüller, 2026) is among the first fully deployed neuro-symbolic LLM verification systems in a regulated, data-sensitive domain — medical device damage assessment reporting. It demonstrates that the LLM-Modulo framework ([§1.3](/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-3-llm-modulo.md)) directly translates to production deployments where both hallucination and privacy violations carry legal and patient-safety consequences.(Sigloch & Benzmüller, 2026)

The core architectural insight is **asymmetric verification**: different methods are appropriate for different parts of the pipeline.

```
Input Prompt
     │ Type-aware formal logic
     │ (decidable completeness check
     │  on structured constraints)
     ▼
   LLM Generation
     │
     │ Embedding-based semantic validation
     │ (neural similarity against
     │  ground-truth context)
     ▼
   Verified Output Report
```

1. **Pre-generation (symbolic):** A formal logic layer with decidable completeness verifies that the prompt and domain constraints are internally consistent *before* the LLM generates. This exploits the property that structured medical inputs — device codes, damage categories, regulatory classifications — are formally typed and can be checked against a schema with guaranteed completeness. The check catches structurally inconsistent prompts before hallucination opportunities arise.
2. **Post-generation (neural):** An embedding-based semantic similarity layer validates that the LLM's generated report is semantically faithful to the source documents. This handles the continuous, unstructured space of natural language where formal methods lack expressiveness.

**Results on the HAIMEDA deployment:** 83% detection rate for structured-entity hallucinations (wrong device IDs, incorrect damage classifications), 72% for semantic fabrications (plausible-sounding but ungrounded technical claims), and 30% reduction in report creation time compared to the prior manual workflow.

**The neuro-symbolic insight:** The system's architecture exactly mirrors the book's [§1.4 comparison table](/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-4-neural-vs-symbolic.md) — formal methods where decidability holds (structured type constraints), neural methods where it does not (semantic faithfulness in natural language). The LLM handles the perceptual and linguistic complexity; the symbolic components enforce the domain's legal and regulatory constraints.

*Reference:* Sigloch, Paul, and Christoph Benzmüller. "Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains." *Proceedings of KI 2026 (German Conference on Artificial Intelligence)*, 2026. <https://arxiv.org/abs/2605.26942>

***

## AlphaCode and AlphaProof — Planning Over Formal Spaces (2022–2024)

Two DeepMind systems represent the most striking demonstrations of neuro-symbolic AI achieving superhuman performance on formally defined tasks.

**AlphaCode** (2022) applied neural generation and symbolic evaluation to competitive programming.(Li et al., 2022) The system generated large numbers of candidate programs, executed them against test cases (a symbolic evaluation oracle), and filtered to retain only passing programs. On competitive programming benchmarks, AlphaCode achieved approximately 50th percentile performance among human competitors — a remarkable result for a problem class requiring multi-step algorithmic reasoning.

**AlphaProof** and **AlphaGeometry 2** (2024) represent the most striking achievement yet: **silver-medal-equivalent performance on the 2024 International Mathematical Olympiad (IMO)**, solving 4 out of 6 problems.(AlphaProof Team et al., 2024) Of the four problems solved, **AlphaProof solved 3** (two algebra problems and one number theory problem, specifically Problems 1, 2, and 6) and **AlphaGeometry 2 solved 1** (Problem 4, a Euclidean geometry problem, using a separate neuro-symbolic architecture described in [§4.3.1](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-3-hybrid-architectures/4-3-system1-system2.md)).

AlphaProof's architecture is a direct instantiation of classical planning ideas applied to formal mathematical proof. The state space consists of proof states in the Lean 4 formal proof language, where each state is a proof context with a set of hypotheses and a current goal. Actions are Lean tactics — formal proof steps that transform a proof state into (possibly multiple) sub-goals. AlphaProof couples a fine-tuned **Gemini** language model with an **AlphaZero-inspired MCTS-based RL approach** adapted to formal proof search. The language model generates Lean 4 proof tactic sequences for a given goal; the MCTS (with PUCT selection as used in AlphaZero, Silver et al., 2018 — guided by the policy network prior P(s,a); the exact selection algorithm used internally by AlphaProof is not publicly documented but is inferred from its AlphaZero-based training) explores the proof state space, and the Lean 4 type checker provides binary, verified rewards (complete proof = reward 1; failed proof = reward 0). Each verified proof is added to the training corpus and the model is fine-tuned on the augmented corpus — a self-improving loop in which the system solves progressively harder problems with each iteration.

The critical element is the **Lean 4 type checker**: a sound, decidable formal verifier that provides a binary, provably correct verdict on whether any proof step is valid. The neural policy network would be useless without this oracle; the oracle alone could not explore the astronomical proof search space efficiently. Together, they solve problems that stump most PhD mathematicians.

> **This was the first time any AI system achieved competition-level performance on IMO problems** — problems requiring multi-step mathematical reasoning at a level most working mathematicians cannot match.

For a full architectural treatment of the AlphaProof pattern, see [§4.3 Hybrid / Co-Processing Architectures](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-3-hybrid-architectures.md).

*References:*\
Li, Yujia, et al. "Competition-Level Code Generation with AlphaCode." *Science* 378.6624 (2022): 1092–1097. <https://doi.org/10.1126/science.abq1158> | Dataset: <https://github.com/google-deepmind/code_contests>\
AlphaProof Team and AlphaGeometry Team, DeepMind. "AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems." *DeepMind Technical Report*, 2024. <https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/>

***

## SWE-bench and SWE-agent — Test Suites as Correctness Oracles (2024)

**SWE-bench** (Jimenez et al., 2024) established software engineering as one of the most compelling real-world neuro-symbolic application domains.(Jimenez et al., 2024) The benchmark consists of 2,294 real GitHub issues from 12 popular Python repositories (Django, Flask, scikit-learn, NumPy, etc.), each paired with a repository-level test suite that formally specifies the correct resolution. An AI system must produce a code patch; the test suite is the **symbolic oracle** that determines correctness.

The performance differential directly demonstrates the book's central thesis:

| System                          | SWE-bench Verified (% resolved) | Symbolic Oracle Used?                   |
| ------------------------------- | ------------------------------- | --------------------------------------- |
| GPT-4 (zero-shot, no execution) | \~2%                            | No                                      |
| SWE-agent + GPT-4 Turbo (2024)  | 12.5%                           | Yes — test execution                    |
| State-of-the-art (2025)         | 50%+                            | Yes — test execution + iterative repair |

The factor-of-9+ improvement from adding test execution as a symbolic oracle is the NeSy argument in empirical form: the neural LLM generates; the formal test suite verifies; their combination far exceeds either alone. This mirrors the AlphaCode ([§4.3](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-3-hybrid-architectures/4-3-applications.md)) pattern, extended from competitive programming to production software maintenance.

**SWE-agent** (Yang et al., 2024) provides the agent scaffolding: a structured **Agent-Computer Interface (ACI)** that gives the LLM access to file editing tools, bash terminal, and test runner with structured symbolic feedback.(Yang et al., 2024) When a test fails, the failure message (file, line number, assertion error) is returned as structured symbolic information — not raw text — enabling the LLM to precisely locate and correct its mistake. This symbolic feedback loop is architecturally equivalent to AlphaProof's Lean 4 type checker: binary correctness signal + structured error message + iterative repair.

> **Broader pattern:** The test-suite-as-oracle pattern is domain-agnostic. Any task where correctness is formally checkable — formal proof verification (Lean 4), optimization feasibility (OR-Tools), planning validity (PDDL validators), schema conformance (JSON Schema) — instantiates the same architecture. SWE-bench demonstrates it at the scale of real-world production software, with real GitHub history, real tests, and real failure modes.

*References:* Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" *Proceedings of ICLR*, 2024. <https://arxiv.org/abs/2310.06770> | Benchmark: <https://www.swebench.com>

Yang, John, et al. "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." *arXiv preprint* arXiv:2405.15793 (2024). <https://arxiv.org/abs/2405.15793> | Code: <https://github.com/SWE-agent/SWE-agent>

***

## LLM World Models for Model-Based Task Planning (2023)

**Guan et al. (2023)** address a fundamental gap in the LLM+P approach: the requirement that a domain model be specified in advance.(Guan et al., 2023)

Their system uses an LLM to **construct a world model** — a structured representation of how actions change the state of the world — directly from natural language task descriptions. The key distinction from LLM+P is that this world model is not a hand-authored PDDL domain file but an LLM-generated transition model capturing environmental dynamics. The LLM plays the role of a domain modeler: given natural language descriptions of the task environment, it generates state transition rules that serve as an approximate transition function $T(s, a) \rightarrow s'$. A model-based planner then searches over plans using the LLM-constructed world model, producing action sequences grounded in the constructed dynamics.

This enables planning in domains where explicit domain models are unavailable — as long as natural language descriptions are accessible. The system demonstrates performance on embodied task planning benchmarks where domain formalization has traditionally been a bottleneck, showing that LLM-based world model construction can substitute for labor-intensive manual modeling.

**Limitation:** The LLM-constructed world model is approximate and unverified. Errors in the world model propagate to the planner, producing plans that may be invalid in the actual environment. External execution feedback and iterative world model refinement are active research areas.

*Reference:* Guan, Lin, et al. "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2305.14909>

***

## Tree of Thoughts and RAP — Structured Search Over LLM Reasoning (2023)

**Tree of Thoughts** (Yao et al., 2023a) and **Reasoning via Planning (RAP)** (Hao et al., 2023) exploit a different architecture: using the LLM itself as an approximate world model within a structured search process.

**Tree of Thoughts** structures LLM reasoning as a tree, where each node is a partial solution (a "thought") and branches correspond to different continuations. The LLM evaluates the quality of each node, guiding a best-first search. This imposes systematic search structure on what would otherwise be a linear chain-of-thought process, and has been shown to dramatically improve performance on multi-step reasoning tasks including the Game of 24, creative writing planning, and mini-crosswords.

**RAP** goes further: it uses an LLM *both* as an action proposer *and* as a world model for MCTS. At each step, the LLM proposes candidate actions (rollout policy) and simulates the next state given an action (world model / transition function), while a value estimator (also the LLM) scores the resulting state. The result is a planning system that does not require an external domain model — the LLM's internal world knowledge serves as an implicit transition function. RAP demonstrated significant improvements over chain-of-thought and least-to-most prompting on blocksworld planning and logical reasoning tasks.

**Two convergent design trajectories share the same abstract structure** — a proposal model, an evaluation model, and a search algorithm — but arrive from independent origins. The **LLM-agent search lineage** runs from ToT (LLM as evaluator, no external feedback) → LATS (§4.2.2, external execution feedback as reward signal). In parallel, the **formal RL lineage** runs from AlphaGo (Silver et al., 2016) → AlphaZero (Silver et al., 2018) → AlphaProof (2024), independently developing neural-guided MCTS over formal state spaces with verified correctness oracles. AlphaProof did not emerge from the ToT/LATS line; these are convergent architectures, not a single progression. The critical distinction: formal RL systems use provably sound verification (Lean 4 type checker) while LLM-agent systems use approximate self-evaluation — a difference that matters enormously for safety-critical applications.

The crucial limitation remains: the LLM world model is *approximate* and *unverified*, making this approach unsuitable for safety-critical domains without external verification.

*References:*\
Yao, Shunyu, et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2305.10601> | Code: <https://github.com/princeton-nlp/tree-of-thought-llm> Hao, Shibo, et al. "Reasoning with Language Model is Planning with World Model." *Proceedings of EMNLP*, 2023. <https://arxiv.org/abs/2305.14992> | Code: <https://github.com/Ber666/llm-reasoners>

***

## References

1. Sigloch, Paul, and Christoph Benzmüller. "Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains." *Proceedings of KI 2026 (German Conference on Artificial Intelligence)*, 2026. <https://arxiv.org/abs/2605.26942>
2. Silver, David, et al. "Mastering the Game of Go with Deep Neural Networks and Tree Search." *Nature* 529 (2016): 484–489. <https://doi.org/10.1038/nature16961>
3. Silver, David, et al. "A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play." *Science* 362.6419 (2018): 1140–1144. <https://doi.org/10.1126/science.aar6404>
4. Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" *Proceedings of ICLR*, 2024. <https://arxiv.org/abs/2310.06770> | Benchmark: <https://www.swebench.com>
5. Yang, John, et al. "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." *arXiv preprint* arXiv:2405.15793 (2024). <https://arxiv.org/abs/2405.15793> | Code: <https://github.com/SWE-agent/SWE-agent>

> **Next:** [§2.3 — LLM Agent Patterns](/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2/2-3-agent-patterns.md) examines the composable runtime patterns — ReAct, Reflexion, Toolformer, Voyager, and LATS — that practitioners use to integrate LLM reasoning with external symbolic tools.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2/2-2-neuro-symbolic-systems.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.