> For the complete documentation index, see [llms.txt](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-5-extended-thinking.md).

# 1.5 The Extended Thinking Complication

No honest treatment of why neuro-symbolic AI is necessary can ignore the most significant recent development in the field: the introduction of **inference-time compute scaling**, exemplified by OpenAI's o1 (2024) and o3 (announced December 2024, publicly released April 2025) model series.(OpenAI, 2025)

These models devote extended compute at inference time — often tens of seconds to minutes — to generating long internal chain-of-thought reasoning traces before producing a final answer. On benchmarks such as MATH, AIME, and ARC-AGI, o3 achieves performance that dramatically surpasses earlier models. This has prompted a reasonable question: *does extended thinking effectively solve the planning problem, making the neuro-symbolic approach unnecessary?*

The answer, examined carefully, is **no** — for three fundamental reasons.

## Reason 1: No Formal Verifier

Extended thinking in o1/o3 is still neural computation. The model generates a long reasoning trace, but there is no external oracle verifying each step. A model can produce a 500-token reasoning chain that contains a subtle logical error in step 47, confirm the error in step 48, and arrive at a wrong answer with full confidence. The reasoning *looks* systematic but has no soundness guarantee. In contrast, AlphaProof's MCTS over Lean 4 proofs has a formal verifier at every step — Lean 4's type checker either accepts or rejects each tactic. Extended thinking has no equivalent.

## Reason 2: No True Backtracking

Autoregressive generation, even with extended compute, is fundamentally a left-to-right process. While o1/o3 models exhibit apparent backtracking (they sometimes write "wait, that's wrong, let me reconsider"), this is learned behavior that mimics backtracking without implementing true state-space search. A classical planner with A\* genuinely backtracks to an earlier state and explores a different branch; the model only generates tokens that describe backtracking. For problems requiring systematic branching search over large state spaces, this distinction is decisive.

## Reason 3: Degradation on Formally Specified Problems

Empirically, o3 shows strong performance on mathematical problem-solving but weaker performance on formally specified planning benchmarks, particularly those involving many objects and long-horizon dependencies. Kambhampati et al. (2024) and subsequent analyses show that even o1-class models fail on non-trivial Blocksworld and logistics instances when evaluated on PlanBench.(Valmeekam et al., 2023)

## What Extended Thinking Gets Right

Extended thinking *does* approximate some aspects of symbolic reasoning and narrows the gap significantly for mathematical and common-sense tasks. The performance improvement is real and substantial. The appropriate framing is:

> **Extended thinking closes the gap on tasks where approximate reasoning suffices. It does not provide the formal guarantees required by safety-critical systems, long-horizon planning with resource constraints, or any domain where an incorrect plan has irreversible consequences.**

For industrial automation, medical decision support, space mission planning, or autonomous vehicle route planning — the core application domains of this book — a model that is *usually* correct is not sufficient. The neuro-symbolic approach, where neural components generate candidates and symbolic systems verify correctness, provides the missing guarantee. Extended thinking is a valuable addition to the neural component of a neuro-symbolic system; it does not replace the symbolic component.

**Connection to STaR and self-improvement (→ §4.1).** A closely related phenomenon is the **Self-Taught Reasoner (STaR)** paradigm (Zelikman et al., 2022): models trained to generate and filter their own chain-of-thought rationales via iterative self-improvement acquire increasingly structured internal reasoning without any symbolic oracle. STaR demonstrates that the reasoning patterns approximated by o1/o3-style extended compute can — in principle — be instilled through data-driven self-improvement, not just scaling. The critical limitation remains: STaR-trained models improve fluency at reasoning *patterns* but have no mechanism to verify *correctness* of the reasoning steps they generate. The neuro-symbolic curriculum approaches covered in §4.1 (Knowledge-Infused Learning, symbolic constraint training) address precisely this gap — they provide a ground-truth signal that STaR's self-verification cannot. This makes STaR an important *complementary* technique for the neural component of hybrid systems, not a substitute for symbolic verification.

*Reference:* Zelikman, Eric, Yuhuai Wu, Jesse Mu, and Noah Goodman. "STaR: Bootstrapping Reasoning with Reasoning." *Advances in Neural Information Processing Systems (NeurIPS)* 35 (2022). <https://arxiv.org/abs/2203.14465>

> **Practical upshot:** Consider o1/o3-class models as a significantly improved *LLM* component in the LLM-Modulo framework (Section 1.3). Their extended reasoning makes them better plan generators and better NL → PDDL translators. They do not eliminate the need for the symbolic critic.

***

## Open Problems

1. **Soundness under approximate verification.** When the symbolic critic in an LLM-Modulo system is itself imperfect (e.g., a learned verifier rather than a formal one), what formal bounds can be placed on plan quality? Can we characterize the failure modes precisely?
2. **Benchmarking extended reasoning.** PlanBench and similar benchmarks were designed before o1-class models existed. Do existing benchmarks adequately distinguish genuine systematic search from high-quality reasoning mimicry? What new benchmark properties are needed?
3. **Knowledge acquisition bottleneck at scale.** For neurosymbolic systems to match LLMs' breadth of world knowledge, symbolic knowledge bases must scale proportionally. What is the right architecture for knowledge bases that are both formally verifiable and broad enough to support general-purpose reasoning?
4. **Calibration of the neural component.** In LLM-Modulo, how should the LLM's output be calibrated to maximize the efficiency of the verification loop? Can the LLM learn to generate candidates that are *close* to valid plans (minimizing critic iterations) without sacrificing diversity?

***

## Exercises

**1.1** *(Conceptual)* An LLM generates the following plan for moving blocks A, B, C from a configuration where A is on B and B is on C (all on the table) to a goal where C is on B and B is on A: `(1) pick-up A (2) stack A on C (3) pick-up B (4) stack B on A (5) pick-up C (6) stack C on B`. Identify every precondition violation. What does this reveal about the LLM's world model?

**1.2** *(Research)* Locate three papers from 2024–2026 that evaluate o1 or o3 on planning or multi-step reasoning benchmarks. For each paper: (a) what benchmark was used? (b) how did the model perform relative to classical planners? (c) did the paper provide a formal soundness analysis or only empirical evaluation?

**1.3** *(Design)* For a hospital bed-assignment system (patients have requirements: room type, equipment, proximity to nursing station; beds have properties), specify: (a) which components you would implement with neural networks, (b) which with a symbolic planner, (c) what the neural-to-symbolic interface would look like. Justify each choice using Table 1.4.

**1.4** *(Mathematical)* The LLM-Modulo framework iterates until the plan passes all critics or a budget is exhausted. Assume the LLM generates a correct plan with probability $p$ per attempt and the critics are perfect (no false positives or negatives). What is the expected number of LLM calls to produce a verified plan? How does this change if $p = 0.1$ vs. $p = 0.7$?

**1.5** *(Critical Analysis)* Kambhampati et al. argue that LLMs cannot plan. A counterargument is: "GPT-4 solved this 5-step Blocksworld problem correctly." Construct the strongest possible version of this counterargument, then explain why it does not invalidate the original thesis.

> **Next:** [Chapter 2 — A History of Success](/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2.md) grounds the neuro-symbolic approach in a track record of real deployments — from NASA spacecraft operations through AlphaProof's silver-medal IMO performance.

***

## References

1. Stechly, Kaya, Matthew Marquez, and Subbarao Kambhampati. "GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems." *NeurIPS 2023 Workshop on Mathematical Reasoning and AI*, 2023. <https://arxiv.org/abs/2310.12397>
2. Marcus, Gary, and Ernest Davis. *Rebooting AI: Building Artificial Intelligence We Can Trust*. Pantheon Books, 2019. <https://rebooting.ai>
3. Valmeekam, Karthik, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. "Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)." *NeurIPS 2022 Workshop on Foundation Models for Decision Making*, 2022. <https://arxiv.org/abs/2206.10498>
4. Valmeekam, Karthik, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. "PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change." *Advances in Neural Information Processing Systems* 36 (NeurIPS 2023). <https://proceedings.neurips.cc/paper_files/paper/2023/hash/7a92bcdede88e0c9e7facd71cd5f6f78-Abstract-Datasets_and_Benchmarks.html> | Benchmark: <https://github.com/karthikv792/LLMs-Planning>
5. Kambhampati, Subbarao, et al. "Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks." *Forty-First International Conference on Machine Learning (ICML)*, 2024. <https://arxiv.org/abs/2402.01817>
6. Sheth, Amit, Kaushik Roy, and Manas Gaur. "Neurosymbolic Artificial Intelligence (Why, What, and How)." *IEEE Intelligent Systems* 38.3 (2023): 56–62. <https://doi.org/10.1109/MIS.2023.3234994>
7. Kaplan, Jared, et al. "Scaling Laws for Neural Language Models." *arXiv preprint* arXiv:2001.08361 (2020). <https://arxiv.org/abs/2001.08361>
8. Ghallab, Malik, Dana Nau, and Paolo Traverso. *Automated Planning: Theory and Practice*. Morgan Kaufmann, 2004. <https://www.sciencedirect.com/book/9781558608566/automated-planning>
9. Helmert, Malte. "The Fast Downward Planning System." *Journal of Artificial Intelligence Research* 26 (2006): 191–246. <https://doi.org/10.1613/jair.1879> | Code: <https://github.com/aibasel/downward>
10. OpenAI. "OpenAI o1 System Card." *OpenAI Technical Report*, 2024. <https://openai.com/index/openai-o1-system-card/> | See also: OpenAI. "OpenAI o3 and o4-mini System Card." *OpenAI Technical Report*, 2025.
11. Xie, Jian, et al. "TravelPlanner: A Benchmark for Real-World Planning with Language Agents." *Proceedings of the Forty-First International Conference on Machine Learning (ICML)*, 2024. <https://arxiv.org/abs/2402.01622> | Benchmark: <https://github.com/OSU-NLP-Group/TravelPlanner>
12. Chollet, François. "On the Measure of Intelligence." *arXiv preprint* arXiv:1911.01547 (2019). <https://arxiv.org/abs/1911.01547> | Benchmark: <https://github.com/fchollet/ARC>
13. Lake, Brenden M., and Marco Baroni. "Generalization Without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks." *Proceedings of the Thirty-Fifth International Conference on Machine Learning (ICML)*, 2018. <https://arxiv.org/abs/1711.00350> | Benchmark: <https://github.com/brendenlake/SCAN>
14. Kim, Najoung, and Tal Linzen. "COGS: A Compositional Generalization Challenge Based on Semantic Interpretation." *Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 2020. <https://arxiv.org/abs/2010.05465>
15. LeCun, Yann. "A Path Towards Autonomous Machine Intelligence." *OpenReview*, Version 0.9.2, June 2022. <https://openreview.net/forum?id=BZ5a1r-kVsf>
16. Terver, Basile, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. "What Drives Success in Physical Planning with Joint-Embedding Predictive World Models." *Transactions on Machine Learning Research (TMLR)*, 2026. arXiv:2512.24497. <https://arxiv.org/abs/2512.24497> | Code: <https://github.com/facebookresearch/jepa-wms>
17. Davis, Liam, et al. "Lattice Deduction Transformers." *arXiv preprint* arXiv:2605.08605 (2026). <https://arxiv.org/abs/2605.08605>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-5-extended-thinking.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.