> For the complete documentation index, see [llms.txt](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-1-llm-limitations.md).

# 1.1 LLM Planning Limitations

For several years, a growing body of researchers has accumulated compelling evidence that large language models (LLMs), despite their extraordinary fluency, are *not* autonomous reasoners. They are, at best, powerful statistical approximators of language about reasoning — a subtle but critical distinction.(Stechly et al., 2023)(Marcus & Davis, 2019)

The indictment began in earnest at the 2022 NeurIPS Workshop on Foundation Models for Decision Making, where Valmeekam et al. published a systematic evaluation titled *"Large Language Models Still Can't Plan."*(Valmeekam et al., 2022) The study ran GPT-3 and subsequent models on classical planning benchmarks — Blocksworld, Logistics, Gripper — problems that any competent undergraduate student of AI could specify and solve with a classical planner in seconds. The results were sobering: **success rates below 10%** on non-trivial instances, with models confidently generating action sequences that violated elementary preconditions. Subsequent work by the same group, published as **PlanBench** at NeurIPS 2023,(Valmeekam et al., 2023) showed that even GPT-4 — then the state of the art — failed systematically on planning tasks, and that few-shot prompting, chain-of-thought, and self-refinement provided only marginal improvements. Stechly et al. (2023) confirmed this last point in rigorous controlled experiments on graph coloring — a canonical NP-complete reasoning problem — showing that iterative self-prompting and self-critique do not reliably correct reasoning errors: the model's performance is largely independent of whether the critique is correct or even present.(Stechly et al., 2023)

A more recent and perhaps more compelling benchmark is **TravelPlanner** (Xie et al., 2024), which moves from toy planning domains to real-world complexity. TravelPlanner requires agents to construct multi-day travel itineraries satisfying dozens of simultaneous constraints: budget limits, transportation availability, accommodation preferences, dining requirements, and local attraction scheduling. Critically, all data is drawn from real databases (Flights, Hotels, Restaurants) rather than synthetic domain descriptions.

Results are sobering: GPT-4-Turbo with ReAct and tool access achieves a **success rate of 0.6%** on the full benchmark — the strictest evaluation, requiring a plan to satisfy all commonsense and hard constraints simultaneously. Even with chain-of-thought prompting and multiple retrieval tools, the best LLM agents achieve below 10% on hard instances. This is not a toy problem that classical planners trivially solve — human travel agents achieve it routinely. The failure is entirely in systematic constraint satisfaction across multi-step, multi-resource plans.

TravelPlanner's significance is that it closes the obvious objection to PlanBench: "toy Blocksworld problems are artificial." TravelPlanner is the kind of planning task that a corporate travel booking assistant would face daily, and LLMs fail it at the most basic level.

*Reference:* Xie, Jian, et al. "TravelPlanner: A Benchmark for Real-World Planning with Language Agents." *Proceedings of ICML*, 2024. <https://arxiv.org/abs/2402.01622> | Benchmark: <https://github.com/OSU-NLP-Group/TravelPlanner>

Why do LLMs struggle? The failure is architectural, not cosmetic:

1. **No verified world model.** An LLM predicts the next token given a context. It has no internal mechanism to simulate the forward application of actions in a state space, verify preconditions, or detect that an action sequence is physically impossible. What *looks* like planning in a fluent GPT-4 output is, mechanistically, pattern matching over training data that included plans.(Kambhampati et al., 2024)
2. **No soundness guarantee.** Classical planning systems are *sound*: every plan they output is correct by construction — every precondition is met, every effect is applied, and the goal is provably achieved. LLMs offer no such guarantee. A plan from GPT-4 may violate any constraint and the model will not know.(Kambhampati et al., 2024)
3. **Performance degrades with complexity.** LLM planning performance falls sharply as plan length increases, as the number of objects grows, or as the problem requires maintaining long-range dependencies. This is the exact scenario where reliable planning is most needed.(Valmeekam et al., 2022)(Valmeekam et al., 2023)
4. **Sensitivity to prompt wording.** Rephrasing the same planning problem in different natural language produces dramatically different LLM outputs — sometimes correct, sometimes absurd.(Kambhampati et al., 2024) Classical planners are insensitive to surface form: they operate on formal problem specifications.
5. **Token-by-token generation ≠ systematic search.** A planner systematically explores a state space, guided by heuristics, with backtracking and completeness guarantees. LLM generation is left-to-right, autoregressive, and commits to each token before seeing downstream consequences.

**The Neural Perspective Arrives at the Same Diagnosis** In 2022, Yann LeCun — then VP and Chief AI Scientist at Meta AI, Turing Award laureate, and one of the architects of the deep learning revolution — independently reached *the same conclusion as the planning community* from an entirely different starting point. In his position paper *"A Path Towards Autonomous Machine Intelligence"* (LeCun, 2022), he argues that autoregressive LLMs are the wrong architecture for intelligence for structurally identical reasons: they (1) predict at the wrong level of abstraction (tokens, not world states); (2) lack a world model for simulating the consequences of actions; and (3) cannot perform systematic search over action sequences. LeCun is not a neuro-symbolic researcher — he explicitly rejects formal symbolic representations as necessary components. Yet he arrives at the same first-principles diagnosis as this book.

His proposed solution — the Joint Embedding Predictive Architecture (JEPA) with hierarchical planning over learned world models — represents the neural-only alternative: instead of formal symbolic structure, learn action-conditioned world models entirely from unlabeled video and plan via model-predictive control in latent space. By 2026, this program had produced JEPA World Models (JEPA-WMs) achieving state-of-the-art performance on real robot manipulation (Terver et al., 2026). The critical open question — whether learned world models alone can provide the correctness guarantees that formal verification offers, or whether formal structure is necessary — is discussed in [§5.1](/neuro-symbolic-ai-in-practice/part-iv-synthesis/chapter-5/5-1-synthesis.md) and [§5.3](/neuro-symbolic-ai-in-practice/part-iv-synthesis/chapter-5/5-3-outlook.md). For practitioners, the architectural tradeoff is covered in [§4.4.2](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural/4-4-jepa-world-models.md).(LeCun, 2022)(Terver et al., 2026)

**Smooth Falsehoods — A Formal Blind Spot for Probability-Based Verification**

A recent formal result (Miya, 2025) identifies a specific sub-class of LLM hallucination that exposes the fundamental limit of any confidence-based defense. **Smooth falsehoods** are statements assigned *high generation probability* by the LLM yet *structurally disconnected* from the context they describe. They are maximally confident fabrications — not uncertain guesses.

This matters enormously for system design: any verification approach based on confidence thresholds, self-consistency sampling, or RLHF calibration is *incapable by design* of rejecting this class — the LLM's probability model has no internal signal distinguishing a smooth falsehood from a true statement, as Miya (2025) demonstrates experimentally on a controlled diagnostic dataset. The only principled defense is **external structural verification** — checking the candidate statement against the actual knowledge graph, contextual graph structure, or domain model rather than against the model's own probability distribution. This is the strongest formal argument for why neuro-symbolic verification (the subject of Chapter 4) is architecturally necessary for high-stakes deployments, not merely a performance optimization.(Miya, 2025)

**Benchmark Spotlight: The Abstraction and Reasoning Corpus (ARC)**

The **Abstraction and Reasoning Corpus** (ARC) (Chollet, 2019) is one of the most consequential reasoning benchmarks in modern AI. Each task presents a small number of input-output demonstration pairs on colored grids (typically 3, occasionally up to 10); the agent must identify the transformation rule and apply it to a test grid. The rules require abstract concepts — symmetry, counting, object tracking, spatial relations — that humans learn intuitively but AI systems struggle to acquire from so few examples.

State-of-the-art LLMs, including GPT-4, score below 10% on the hardest ARC tasks without fine-tuning. The best-performing systems are explicitly neuro-symbolic: **DreamCoder** (Ellis et al., 2021, [§4.2.1](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-2-neural-helps-symbolic/4-2-differentiable-reasoning.md)) uses library learning to infer compositional transformation programs; neural recognition networks propose programs while a symbolic interpreter evaluates correctness.

In late 2024, OpenAI's o3 model reached 87.5% on the ARC-AGI benchmark at high compute (1024 samples per task, approximately 171× the standard evaluation budget; at the standard budget of 6 samples per task, the score was 75.7%) — a striking advance attributed to extensive internal chain-of-thought search. The result has renewed debate about whether abstract reasoning can emerge from scale alone. The key open question: o3's approach requires enormous compute per task (orders of magnitude more than human solvers); a neuro-symbolic system that learns reusable transformation primitives (like DreamCoder) solves new tasks in milliseconds after the library is learned.

*Reference:* Chollet, François. "On the Measure of Intelligence." *arXiv preprint* arXiv:1911.01547 (2019). <https://arxiv.org/abs/1911.01547> | Benchmark: <https://github.com/fchollet/ARC>

**The Scale Inversion Result — Architecture Dominates Scale**

In May 2026, a **Lattice Deduction Transformer** (LDT) with **800,000 parameters** achieved **100% accuracy on Sudoku-Extreme** — a benchmark of hard combinatorial puzzles requiring systematic constraint propagation and backtracking search. Frontier LLMs (GPT-4, Gemini, and contemporaries) scored **0%** on the same benchmark.(Davis et al., 2026)

This gap — 800K parameters vs. hundreds of billions — is the clearest empirical demonstration that *architectural choices dominate model scale* for structured combinatorial reasoning. The LDT does not scale pattern-matching; it makes systematic deduction structurally guaranteed by projecting its latent state through a formal abstract lattice at every forward pass ([§4.4](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-4-pure-neural.md)). LLMs cannot close this gap by scaling because the capability they lack — sound, backtracking-based deduction — is *absent from their architecture*, not merely underdeveloped.

*Reference:* Davis, Liam, et al. "Lattice Deduction Transformers." *arXiv preprint* arXiv:2605.08605 (2026). <https://arxiv.org/abs/2605.08605>

Compositional generalization — the ability to understand novel combinations of familiar components — is another systematic LLM failure mode with direct implications for planning. The **SCAN** benchmark (Lake & Baroni, 2018) tests whether models trained on simple navigation commands generalize to compositional instructions: a model that has learned "jump" and "walk twice" must generalize to "jump twice." Standard sequence-to-sequence neural models fail catastrophically on systematic compositional splits (near 0% accuracy) while achieving near-perfect in-distribution performance. The **COGS** benchmark (Kim & Linzen, 2020) extends this to semantic parsing, showing that even large pre-trained LLMs exhibit sharp accuracy drops on compositionally novel inputs.

For planning, compositional generalization is not an academic nicety — it is the difference between an agent that has memorized task-specific plans and one that can reason about novel task combinations from first principles.

*References:* Lake, Brenden M., and Marco Baroni. "Generalization Without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks." *Proceedings of ICML*, 2018. <https://arxiv.org/abs/1711.00350> | Benchmark: <https://github.com/brendenlake/SCAN>

Kim, Najoung, and Tal Linzen. "COGS: A Compositional Generalization Challenge Based on Semantic Interpretation." *Proceedings of EMNLP*, 2020. <https://arxiv.org/abs/2010.05465>

Miya, Shinobu. "Eidoku: A Neuro-Symbolic Verification Gate for LLM Reasoning via Structural Constraint Satisfaction." *arXiv preprint* arXiv:2512.20664 (2025). <https://arxiv.org/abs/2512.20664>

This does not mean LLMs are useless for planning-related tasks — quite the opposite, as we will see in Chapters 2 and 4. It means that LLMs *alone* cannot be trusted as autonomous planning engines. The solution is combination, not replacement.

> **Next:** [§1.2 — The Neurosymbolic Agenda](/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-2-neurosymbolic-agenda.md) surveys the field-level diagnosis and the KIL framework that maps directly to the architectural taxonomy of Chapter 4.

***


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-1-llm-limitations.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.