> For the complete documentation index, see [llms.txt](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2/2-3-agent-patterns.md).

# 2.3 LLM Agent Patterns

The systems above represent fixed neuro-symbolic architectures. A second wave of research — motivated by practical deployment needs — focuses on *patterns* for integrating LLM reasoning with external symbolic tools at runtime. These patterns are agnostic to the specific tools involved and compose naturally with the planning architectures described in this book.

## ReAct — Synergizing Reasoning and Acting (2023)

**ReAct** (*Reason + Act*) is the most influential tool-use pattern in modern LLM-based agents.(Yao et al., 2023b)

The key observation: chain-of-thought prompting improves LLM reasoning but does not allow the model to retrieve external information or verify its reasoning against the world. Action-only agents (that call tools without reasoning traces) often take inappropriate actions due to insufficient reasoning. ReAct combines both into a single interleaved loop:

```
Thought:  I need to find the capital of the country where the 2024 IMO was held.
Action:   Wikipedia["2024 International Mathematical Olympiad"]
Obs:      The 2024 IMO was held in Bath, United Kingdom.
Thought:  The UK's capital is London. The answer is London.
Action:   Finish["London"]
```

Each step in ReAct consists of a natural-language *Thought* (internal reasoning), an *Action* (tool call — Wikipedia lookup, search engine query, calculator, code execution, etc.), and an *Observation* (tool result). The model uses observations to update its reasoning, recover from errors, and decide when to conclude.

**Key empirical results:** On HotpotQA (multi-hop question answering), FEVER (fact verification), and ALFWorld (text-based environment planning), ReAct significantly outperformed chain-of-thought alone and tool-use without reasoning traces. On ALFWorld, which requires multi-step planning in a household environment, ReAct achieved 71% success vs. 45% for pure action-taking baselines.

The neuro-symbolic reading is clear: the *Thought* steps are neural (LLM reasoning over implicit knowledge), while the *Action/Observation* cycle is symbolic (calls to external verifiable knowledge sources). The interleaving is precisely the "neural generates, symbolic verifies" pattern of LLM-Modulo ([§1.3](/neuro-symbolic-ai-in-practice/part-i-motivation/chapter-1/1-3-llm-modulo.md)), applied at the level of individual reasoning steps.

*Reference:* Yao, Shunyu, et al. "ReAct: Synergizing Reasoning and Acting in Language Models." *International Conference on Learning Representations (ICLR)*, 2023. <https://arxiv.org/abs/2210.03629> | Code: <https://github.com/ysymyth/ReAct>

***

## Reflexion — Verbal Reinforcement Learning (2023)

**Reflexion** (Shinn et al., NeurIPS 2023) adds a verbal reinforcement learning loop to the ReAct pattern. When an LLM agent fails a task, a *reflective LLM call* analyzes the failure trajectory and produces a concise verbal summary of what went wrong and how to avoid the same mistake. This verbal feedback is stored in an **episodic memory buffer** (a short text block prepended to subsequent task prompts) rather than updating model weights.

The cycle:

```
Task attempt (ReAct) → Failure
        │
        ▼
Reflective LLM call:
"I failed because I searched for X but needed Y first.
 Next attempt: search for Y before X."
        │
        ▼
Verbal feedback stored in memory
        │
        ▼
Next attempt: memory prepended to prompt
→ Agent avoids the same failure mode
```

**Results:** Reflexion consistently outperforms ReAct across HotpotQA (multi-hop question answering), ALFWorld (embodied task planning), and HumanEval (programming), using the same base LLM — improvements range from several to over 20 percentage points depending on task and model, with no fine-tuning required. Consult Shinn et al. (2023) for benchmark-specific figures.(Shinn et al., 2023)

**Neuro-symbolic interpretation:** Verbal feedback functions as a symbolic summary of the agent's error trajectory. It is not a gradient update (neural) nor a formal constraint (hard symbolic) — it is a *soft symbolic constraint* expressed in natural language, which the LLM can interpret, generalize, and apply to novel but structurally similar situations. This is a lightweight but effective instance of "symbolic helps neural" at the agent level.

*Reference:* Shinn, Noah, et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2303.11366> | Code: <https://github.com/noahshinn/reflexion>

***

## Toolformer — Teaching LLMs to Use Tools via Self-Supervision (2023)

**Toolformer** (Schick et al., 2023) addresses a different question: rather than prompting a pre-trained LLM to use tools at inference time, can a language model *learn* when and how to call APIs as part of its training?

The Toolformer approach is elegant in its self-supervised framing:

1. A pre-trained LLM generates candidate API call annotations for a text corpus (e.g., inserting `[Calculator(15*7) → 105]` into a sentence that requires arithmetic).
2. These candidate annotations are filtered: only those that demonstrably improve the model's ability to predict the rest of the text (measured by perplexity reduction) are retained.
3. The model is fine-tuned on the filtered, annotated corpus — learning to predict both natural language and API call syntax.

Toolformer demonstrated that a 6.7B-parameter GPT-J model fine-tuned this way outperformed GPT-3 (175B parameters) on tasks requiring tool use — including arithmetic (calculator), factual QA (Wikipedia search), translation, and date arithmetic (calendar API) — while maintaining language generation quality.

The neuro-symbolic interpretation: Toolformer learns a *policy* for when the neural component (language model) should defer to symbolic subsystems (calculators, retrieval systems, APIs). This is the "neural decides, symbolic computes" pattern — a soft hybrid that distributes computation between neural and symbolic modules based on task requirements learned from data.

*Reference:* Schick, Timo, et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2302.04761>

***

## Voyager — Open-Ended Autonomous Agents with Lifelong Learning (2023)

**Voyager** (Wang et al., 2023) is the most complete demonstration to date of a neuro-symbolic autonomous agent that acquires skills, manages a symbolic knowledge base, and pursues open-ended goals without human intervention.

Voyager is a Minecraft agent powered by GPT-4, with three interlocking components:

**1. Automatic curriculum:** An LLM proposes the next learning task based on the agent's current skill inventory and observed environment state ("You have wood and stone. Try crafting a stone pickaxe next."). This is symbolic goal generation driven by neural world knowledge.

**2. Skill library (symbolic memory):** Successfully completed behaviors are stored as executable JavaScript programs in a persistent code library. When the agent needs to craft a pickaxe, it retrieves the `craftPickaxe()` skill from the library, rather than re-generating it from scratch. The skill library is a growing, queryable symbolic knowledge base — a form of procedural long-term memory.

**3. Iterative prompting with execution feedback:** When GPT-4 generates a new behavior program, it is executed in the Minecraft environment. If it fails (an exception is raised, or the goal is not achieved within a time limit), the error trace and current state are fed back to GPT-4, which revises the program. This iterative refinement loop continues until the program succeeds or a maximum retry count is reached.

**Results:** Voyager obtains 3.3× more unique items, travels 2.3× farther, and unlocks the technology tree 15.3× faster than a GPT-4 baseline without a skill library. Critically, it can apply previously learned skills to unseen tasks — demonstrating genuine compositional generalization through the symbolic skill library.

The architectural lesson: the symbolic skill library is what gives Voyager its advantage. Without persistent, reusable, verifiable code artifacts, each task is solved from scratch. The skill library converts transient neural generations into durable symbolic knowledge — exactly the role that formal knowledge representations have played in classical AI.

*Reference:* Wang, Guanzhi, et al. "Voyager: An Open-Ended Embodied Agent with Large Language Models." *arXiv preprint* arXiv:2305.16291, 2023. <https://arxiv.org/abs/2305.16291> | Project: <https://voyager.minedojo.org>

***

**Language Agent Tree Search (LATS)** (A. Zhou et al., 2024) synthesizes the patterns of this section — ReAct's tool use, Reflexion's verbal feedback, and Tree of Thoughts' structured exploration — into a principled MCTS framework where the LLM serves simultaneously as policy, value function, and reflective critic. LATS is the most rigorous current instantiation of these agent patterns and is treated in depth in the Neural MCTS section ([§4.2.2](/neuro-symbolic-ai-in-practice/part-iii-core-approaches/chapter-4/4-2-neural-helps-symbolic/4-2-neural-subroutines.md)), where its theoretical grounding in MCTS provides clearer analytical handles than the heuristic patterns of ReAct and Reflexion.

*Reference:* Zhou, Andy, et al. "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models." *Proceedings of ICML*, 2024. <https://arxiv.org/abs/2310.04406> | Code: <https://github.com/andyz245/LanguageAgentTreeSearch>

***

> **Evaluating LLM Agents — Benchmarks**
>
> The agent systems described in this section have spawned rigorous evaluation benchmarks that reveal a consistent picture: LLM agents perform well on simple tasks but degrade sharply as task complexity, tool diversity, and constraint count increase.
>
> **AgentBench** (Liu et al., 2024) evaluates LLMs across 8 environments: OS interaction, database queries, web browsing, knowledge graph QA, lateral thinking, card games, digital card games, and household tasks. GPT-4 leads with an average score of 3.05/10; open-source models lag by 3–4×. The hardest categories (OS, DB) require multi-step reasoning with no room for error — exactly the domains where symbolic verification would help.
>
> **WebArena** (S. Zhou et al., 2024) evaluates web agents on realistic browser tasks: booking flights, submitting GitLab issues, shopping on Amazon, managing Reddit. GPT-4 achieves \~14% task success rate. The primary failure modes are constraint propagation errors and multi-step action sequences where early mistakes cascade — the same failure pattern observed in PlanBench.

*References:* Liu, Xiao, et al. "AgentBench: Evaluating LLMs as Agents." *International Conference on Learning Representations (ICLR)*, 2024. <https://arxiv.org/abs/2308.03688> | Benchmark: <https://github.com/THUDM/AgentBench>

Zhou, Shuyan, et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." *International Conference on Learning Representations (ICLR)*, 2024b. <https://arxiv.org/abs/2307.13854> | Benchmark: <https://webarena.dev>

***

## Open Problems

The success stories in this chapter raise several problems that remain open at the time of writing.

**Verifying LLM-generated world models.** Guan et al. and RAP assume the LLM world model is accurate enough to guide planning. But LLMs hallucinate. How do we detect and repair world model errors before they cause planning failures? Formal methods for LLM output verification (model checking against world model axioms, online execution monitoring) are nascent areas.

**Scalable mixed-initiative planning.** MAPGEN required human scientists to review every daily plan. For systems with faster action cycles (autonomous driving, real-time robot manipulation), human review latency is prohibitive. Designing mixed-initiative frameworks where humans intervene only when the system's confidence is low remains an open engineering and human-factors problem.

**Skill library composition and maintenance.** Voyager's skill library grows monotonically. But skills become stale (game version updates, environment changes) or redundant (multiple skills that do the same thing). Principled methods for skill library pruning, versioning, and conflict detection are needed for production deployments.

**Long-horizon planning reliability.** ReAct, Tree of Thoughts, and RAP all degrade on tasks requiring more than 10–15 sequential reasoning steps. The fundamental limit is that LLMs do not maintain provably consistent state across many steps. Integrating external symbolic state representations (scene graphs, knowledge graphs, plan execution records) with LLM reasoning to extend reliable horizon length is an active frontier.

***

## Exercises

**Exercise 2.1 — Architecture Classification.** For each system in this chapter (Remote Agent, MAPGEN, LLM+P, SayCan, ReAct, Voyager), classify it according to the taxonomy in Chapter 4: "Symbolic Helps Neural," "Neural Helps Symbolic," or "Hybrid/Co-Processing." Justify your classification.

**Exercise 2.2 — Failure Mode Analysis.** For the Guan et al. world model construction approach, describe three specific ways an LLM-generated world model could fail for a kitchen robotics domain. For each failure mode, propose a mechanism to detect or recover from it.

**Exercise 2.3 — ReAct Extension.** ReAct is described for retrieval and QA tasks. Design a ReAct-style agent for a multi-step scheduling task (e.g., booking a flight + hotel + car for a conference trip). What tools (APIs) would the agent call? Write out a five-step Thought/Action/Observation trace for a concrete instance.

**Exercise 2.4 — Skill Library Design.** For a household robot using the Voyager pattern, design a skill library schema. What metadata should each skill entry contain (beyond the executable code) to enable reliable retrieval, versioning, and failure diagnosis? Consider: preconditions, postconditions, applicability conditions, test cases.

**Exercise 2.5 — Historical Replay.** Remote Agent used a plan-execute-replan cycle with fault detection. Modern autonomous agents (ReAct, Voyager) use similar patterns. Identify three specific architectural elements from Remote Agent (1999) that appear in modern LLM agent systems, and explain what was learned, lost, or rediscovered in the intervening 25 years.

> **Next:** [Chapter 3 — Formal Foundations](/neuro-symbolic-ai-in-practice/part-ii-background/chapter-3.md) provides the mathematical foundations — STRIPS, PDDL, complexity theory, and HTN planning — that underlie every system described in this chapter.

***

## References

1. Muscettola, Nicola, P. Pandurang Nayak, Barney Pell, and Brian C. Williams. "Remote Agent: To Boldly Go Where No AI System Has Gone Before." *Artificial Intelligence* 103.1–2 (1998): 5–47. <https://doi.org/10.1016/S0004-3702(98)00068-X>
2. Muscettola, Nicola. "HSTS: Integrating Planning and Scheduling." *IJCAI Workshop on Knowledge Engineering for Planning*, 1993. <https://ti.arc.nasa.gov/publications/2/download/>
3. Johnston, Mark D., and Glenn Miller. "Spike: Intelligent Scheduling of Hubble Space Telescope Observations." *Innovative Applications of Artificial Intelligence (IAAI)*, 1994, pp. 1–8. <https://dl.acm.org/doi/10.5555/893664>
4. Ai-Chang, Michael, et al. "MAPGEN: Mixed-Initiative Planning and Scheduling for the Mars Exploration Rover Mission." *IEEE Intelligent Systems* 19.1 (2004): 8–12. <https://doi.org/10.1109/MIS.2004.1265878>
5. Bresina, John L., et al. "Activity Planning for the Mars Exploration Rovers." *Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS)*, 2005. <https://ojs.aaai.org/index.php/ICAPS/article/view/13597>
6. Chien, Steve, et al. "Using Iterative Repair to Increase the Responsiveness of Planning and Scheduling for Observation Scheduling." *Journal of Autonomous Agents and Multi-Agent Systems* 4.1–2 (2000): 129–145. <https://doi.org/10.1023/A:1010081516793>
7. Liu, Bo, et al. "LLM+P: Empowering Large Language Models with Optimal Planning Proficiency." *arXiv preprint* arXiv:2304.11477 (2023). <https://arxiv.org/abs/2304.11477> | Code: <https://github.com/Cranial-XIX/llm-pddl>
8. Ahn, Michael, et al. "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." *Conference on Robot Learning (CoRL)*, 2022. <https://arxiv.org/abs/2204.01691> | Website: <https://say-can.github.io>
9. Liang, Jacky, et al. "Code as Policies: Language Model Programs for Embodied Control." *IEEE International Conference on Robotics and Automation (ICRA)*, 2023. <https://arxiv.org/abs/2209.07753> | Website: <https://code-as-policies.github.io>
10. Li, Yujia, et al. "Competition-Level Code Generation with AlphaCode." *Science* 378.6624 (2022): 1092–1097. <https://doi.org/10.1126/science.abq1158> | Dataset: <https://github.com/google-deepmind/code_contests>
11. AlphaProof Team and AlphaGeometry Team, DeepMind. "AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems." *DeepMind Technical Report*, 2024. <https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/>
12. Guan, Lin, et al. "Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2305.14909>
13. Yao, Shunyu, et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023a). <https://arxiv.org/abs/2305.10601> | Code: <https://github.com/princeton-nlp/tree-of-thought-llm>
14. Hao, Shibo, et al. "Reasoning with Language Model is Planning with World Model." *Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 2023. <https://arxiv.org/abs/2305.14992> | Code: <https://github.com/Ber666/llm-reasoners>
15. Yao, Shunyu, et al. "ReAct: Synergizing Reasoning and Acting in Language Models." *International Conference on Learning Representations (ICLR)*, 2023b. <https://arxiv.org/abs/2210.03629> | Code: <https://github.com/ysymyth/ReAct>
16. Schick, Timo, et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2302.04761>
17. Wang, Guanzhi, et al. "Voyager: An Open-Ended Embodied Agent with Large Language Models." *arXiv preprint* arXiv:2305.16291, 2023. <https://arxiv.org/abs/2305.16291> | Project: <https://voyager.minedojo.org>
18. Shinn, Noah, et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." *Advances in Neural Information Processing Systems (NeurIPS)* 36 (2023). <https://arxiv.org/abs/2303.11366> | Code: <https://github.com/noahshinn/reflexion>
19. Zhou, Andy, et al. "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models." *Proceedings of the Forty-First International Conference on Machine Learning (ICML)*, 2024a. <https://arxiv.org/abs/2310.04406> | Code: <https://github.com/andyz245/LanguageAgentTreeSearch>
20. Liu, Xiao, et al. "AgentBench: Evaluating LLMs as Agents." *International Conference on Learning Representations (ICLR)*, 2024. <https://arxiv.org/abs/2308.03688> | Benchmark: <https://github.com/THUDM/AgentBench>
21. Zhou, Shuyan, et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." *International Conference on Learning Representations (ICLR)*, 2024b. <https://arxiv.org/abs/2307.13854> | Benchmark: <https://webarena.dev>
22. Brohan, Anthony, et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." *arXiv preprint* arXiv:2307.15818 (2023). <https://arxiv.org/abs/2307.15818>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://neurosymbolicai.gitbook.io/neuro-symbolic-ai-in-practice/part-ii-background/chapter-2/2-3-agent-patterns.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.