LLM Powered Autonomous Agents: Interacting With External Applications

RAG allows LLMs to interact with external data sources. Similarly, we can also augment the LLM by allowing it to interact with external applications. In general, connecting LLMs to external applications allows the model to interact with the broader world, extending their utility beyond language tasks.

In general, we use LLMs as a reasoning engine, as the agent’s brain, to let it decide to take actions. LLMs are only so good at fact retrieval for example, due to things like hallucination, but give it APIs to go get its own facts and we uncover a lot of latent power from these models. Several proof-of-concepts demos, such as AutoGPT (opens in a new tab), GPT-Engineer (opens in a new tab) and BabyAGI (opens in a new tab), serve as inspiring examples. The potential for LLMs to solve a wide range of problems then extends far beyond just generating well-written content; it can be framed as a powerful general problem solver.

Massive thanks to Lilian Weng for the inspiration for this article. ¹

Agent System Overview

Motivating Example

Consider a shopping chatbot that can process a return request from start to finish. For instance, a customer might say, "I need to return a pair of jeans I purchased." The bot could respond, "Can you tell me your order number?" and proceed based on the customer's response. The chatbot can look up order details via SQL, confirm return requests, generate a return label from a shipping partner and, once the API request is completed, email the label to the customer. LLMs can be used to trigger actions when interacting with APIs, adn this enables them to perform complex tasks.

Components

In LLM-powered autonomous agent systems, the LLM functions as the agent’s brain. It is complemented by several key components:

Component One: Planning
- Sub-goal and decomposition: The agent breaks down large tasks into smaller, manageable sub-goals, enabling efficient handling of complex tasks. These need to understandable and correspond to allowed actions. In the above example, the important steps were:
  - Checking order ID
  - Requesting return label from shipper
  - Verifying user email
  - Emailing user label
- Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.
Component Two: Memory
- Short-term memory: The in-context learning is using the short-term memory of the model to learn (See Prompt Engineering).
- Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.
Component Three: Tool Use
- The agent learns to call external APIs for extra information that is missing from the model weights, which are often hard to add in after pre-training anyway. This includes current information, code execution capability, access to proprietary information sources etc.
- Validate Actions - The model might need to collect information that allows it to validate an actions. In the above example, the bot needed to verify the email that the user provided. Any information that is required for validation needs to be obtained from the user and contained in the completions so it can be passed through to the application.
- Format Outputs - The completions need to be formatted in way that the broader application can understand. This could be as simple as a specific sentence structure or as complex as writing a script in Python or generating an SQL query. For example, an SQL query for the above example might be:

SELECT COUNT(*)
FROM orders
WHERE order_id = 21104;

Overview of LLM-powered autonomous agent systems are shown below:

Overview of a LLM-powered autonomous agent system:

Prompts and completions are at the very heart of these workflows since the actions that the app will take in response to user requests will be determined by the LLM, which serves as the application's reasoning engine. Structuring the prompt in the correct way is important for all of these considerations and can make a huge difference in the quality of a plan generated or the adherence to a desired output format specification. This is why prompting techniques like CoT and ReAct are so important for these systems.

Model Size and Reasoning Ability

The ability of a model to reason well and plan actions depends on its scale. Generally speaking, larger models are better for techniques that use advanced prompting like PAL or ReAct. Smaller models may struggle to understand the tasks in highly structured prompts and may require additional fine-tuning to improve their ability to reason and plan.

This would increase the development time. Thus, it is better to start with a larger, more capable model and collect lots of user data in deployment. We can later use this data to train and fine-tune a smaller model that we can switch to at a later time.

Component One: Planning

A complicated task usually breaks down into many steps. An agent needs to know what they are and plan ahead.

Task Decomposition

Chain of thought (CoT) ² has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to "think step by step" to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.

Tree of Thoughts (ToT) ³ extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.

Tree of thought

Task decomposition, in general, can be done via:

LLM with simple prompting like "Steps for XYZ.", "What are the subgoals for achieving XYZ?"
Using task-specific instructions; e.g. "Write a story outline." for writing a novel
With human inputs.

There are many other techniques, like Self-Ask, Self-consistency sampling, IRCoT (Interleaving Retrieval CoT) etc that can be used to help the model decompose tasks. See more in Lilian Weng's post on prompting (opens in a new tab)

Another quite distinct approach, LLM+P ⁴, involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process:

The LLM translates the problem into "Problem PDDL"
It then requests a classical planner to generate a PDDL plan based on an existing "Domain PDDL"
It then translates the PDDL plan back into natural language.

Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.

Self-Reflection

Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable. ReAct ⁵ integrates reasoning and acting within LLMs by extending the action space to be a combination of task-specific discrete actions and the language space.

Task specific actions enable the LLM to interact with its environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language. The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as:

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)

Here are some examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop) from the ReAct paper:

Examples of reasoning trajectories for knowledge-intensive tasks

In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act-only baseline where Thought: … step is removed.

Reflexion ⁶ is another framework to allow agents to learn more effectively by using verbal self-reflection to understand and correct their mistakes. These self-reflections are stored in a dynamic memory to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps.

Reflexion uses a technique where the model reflects verbally on its performance after each action $a_{t}$ . It talks about what it did, what went right or wrong, and what it can do better next time. The agent then computes a heuristic $h_{t}$ and optionally may decide to reset the environment to start a new trial depending on the self-reflection results.

Illustration of the Reflexion framework.

The heuristic function determines when the trajectory is inefficient or contains hallucination and should be stopped. Inefficient planning refers to trajectories that take too long without success. Hallucination is defined as encountering a sequence of consecutive identical actions that lead to the same observation in the environment.

Self-reflection is created by showing two-shot examples to LLM, with each example being a pair of:

A failed trajectory
An ideal reflection for guiding future changes in the plan

The model keeps these reflections in a working memory called an "episodic memory buffer." This memory stores the feedback from past trials, to be used as context for future queries. Below are experiments on AlfWorld Env (decision-making) and HotpotQA (knowledge-intensive). Hallucination is a more common failure than inefficient planning in AlfWorld.

Experiments on AlfWorld Env and HotpotQA.

Chain of Hindsight (CoH) ⁷ encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with Human feedback. This data is a collection of:

\mathcal{D}_{h}=\left\{\left(x, y_{i}, r_{i}, z_{i}\right)\right\}_{i=1}^{n}

where:

$x$ is the prompt
Each $y_{i}$ is a model completion
Each $r_{i}$ is the human rating of $y_{i}$
$z_{i}$ is the corresponding human-provided hindsight feedback

The model is supervised fine-tuned on sequences of outputs and feedback, encouraging it to generate better responses over time. The training data is a sequence in the form of:

\tau_{h}=\left(x, z_{i}, y_{i}, z_{j}, y_{j}, \ldots, z_{n}, y_{n}\right)

Where $i \leq j \leq n$ . Assume the feedback tuples are ranked by reward, $r_{n} \geq r_{n-1} \geq \ldots \geq r_{1}$ . The model is trained to predict the best output $y_{n}$ based on the sequence of feedback, such that the model can self-reflect to produce better output based on the feedback sequence.

To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training.

The training dataset in their experiments is a combination of WebGPT comparisons (opens in a new tab), summarization from human feedback (opens in a new tab) and human preference dataset (opens in a new tab). After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence:

Chain of Hindsight

The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs.

Component Two: Memory

Types of Memory

Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains.

Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch).
Short-Term Memory (STM) or Working Memory: It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items ⁸ and lasts for 20-30 seconds.
Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM:
- Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
- Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard.

Categorization of human memory

We can roughly consider the following mappings:

Sensory memory as learning embedding representations for raw inputs, including text, image or other modalities;
Short-term memory as in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer.
Long-term memory as the external vector store that the agent can use to at query time, accessible via fast retrieval.
- The external memory can alleviate the restriction of finite attention span. A standard practice is to save the embedding representation of information into a vector store database. Another is using Knowledge Graphs (opens in a new tab)

Component Three: Tool Use

Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify and utilize external objects to do things that go beyond our physical and cognitive limits. Equipping LLMs with external tools can significantly extend the model capabilities.

Modular Reasoning, Knowledge and Language (MRKL) ⁹ is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of "expert" modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API).

These authors did an experiment on fine-tuning an LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability.

Both Tool Augmented Language Models (TALM) ¹⁰ and Toolformer ¹¹ fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs. See more details in Lilian Wengs "External APIs" section (opens in a new tab) of her Prompt Engineering post.

ChatGPT Plugins (opens in a new tab) and OpenAI API function calling (opens in a new tab) are good examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).

HuggingGPT ¹² is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions and summarize the response based on the execution results.

HuggingGPT

The system comprises of 4 stages:

Task planning: OpenAI works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning.
Model selection: They then distribute the tasks to expert models, where the request is framed as a multiple-choice question. The LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed to narrow down the model selection.
Task execution: Expert models execute on the specific tasks and log results.
Response generation: The LLM receives the execution results and provides summarized results to users.

To put HuggingGPT into real world usage, a couple challenges need to be solved

Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process.
It relies on a long context window to communicate over complicated task content.
Stability improvement of LLM outputs and external model services.

API-Bank ¹³ is the benchmark we use for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLMs first have access to a API search engine to find the right API to call and then uses the corresponding documentation to make a call.

In the API-Bank workflow, LLMs need to make a couple of decisions and at each step. We can evaluate how accurate that decision is. Decisions include:

Whether an API call is needed.
Identify the right API to call: if not good enough, LLMs need to iteratively modify the API inputs (e.g. deciding search keywords for Search Engine API).
Response based on the API results: the model can choose to refine and call again if results are not satisfied.

This benchmark evaluates the agent’s tool use capabilities at three levels:

Level-1: Evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.
Level-2: Examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.
Level-3: Assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.

Case Studies

Scientific Discovery Agent

ChemCrow ¹⁴ is a domain-specific example in which an LLM is augmented with 13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain (opens in a new tab), reflects what was previously described in the ReAct and MRKLs frameworks and combines CoT reasoning with tools relevant to the tasks:

The LLM is provided with a list of tool names, descriptions of their utility, and details about the expected input/output.
It is then instructed to answer a user-given prompt using the tools provided when necessary. The instruction suggests the model to follow the ReAct format - Thought, Action, Action Input, Observation.

One interesting observation is that while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a potential problem with using LLM to evaluate its own performance on domains that requires deep expertise. The lack of expertise may cause LLMs not knowing its flaws and thus cannot well judge the correctness of task results.

Boiko et al. ¹⁵ also looked into LLM-empowered agents for scientific discovery, to handle autonomous design, planning, and performance of complex scientific experiments. This agent can use tools to browse the internet, read documentation, execute code, call robotics experimentation APIs and leverage other LLMs. For example, when requested to "develop a novel anticancer drug", the model came up with the following reasoning steps:

It inquired about current trends in anticancer drug discovery
It selected a target
It requested a scaffold targeting these compounds
Once the compound was identified, the model attempted its synthesis.

They also discussed the risks, especially with illicit drugs and bioweapons. They developed a test set containing a list of known chemical weapon agents and asked the agent to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis solution and the agent attempted to consult documentation to execute the procedure. 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after a web search while 2 were rejected based on prompt only.

Proof-of-Concept Examples

AutoGPT (opens in a new tab) has drawn a lot of attention into the possibility of setting up autonomous agents with an LLM as the main controller. It has quite a lot of reliability issues given the natural language interface, but nevertheless a cool proof-of-concept demo. A lot of code in AutoGPT is about format parsing.

GPT-Engineer (opens in a new tab) is another project to create a whole repository of code given a task specified in natural language. The GPT-Engineer is instructed to think over a list of smaller components to build and ask for user input to clarify questions as needed. For example, when asked to build a Super Mario game in Python:

[
  {
    "role": "assistant",
    "content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"
  }
]

Then after these clarification, the agent moved into the code writing mode with a different system message

Challenges

After going through key ideas and demos of building LLM-centered agents, there are a couple common limitations:

Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.
Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.
Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.

References

Lilian Weng, Lil’Log: "LLM-powered Autonomous Agents (opens in a new tab)", Jun 2023 ↩
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (opens in a new tab)”, 2022 ↩
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan: “Tree of Thoughts: Deliberate Problem Solving with Large Language Models (opens in a new tab)”, 2023 ↩
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, Peter Stone: “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency (opens in a new tab)”, 2023 ↩
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao: “ReAct: Synergizing Reasoning and Acting in Language Models (opens in a new tab)”, 2022. ↩
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao: “Reflexion: Language Agents with Verbal Reinforcement Learning (opens in a new tab)”, 2023 ↩
Hao Liu, Carmelo Sferrazza, Pieter Abbeel: “Chain of Hindsight Aligns Language Models with Feedback (opens in a new tab)”, 2023 ↩
Miller, G. A: "The magical number seven, plus or minus two: Some limits on our capacity for processing information (opens in a new tab)". Psychological Review, 63(2), 81–97, 1956 ↩
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, Moshe Tenenholtz: “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning (opens in a new tab)”, 2022 ↩
Aaron Parisi, Yao Zhao, Noah Fiedel: “TALM: Tool Augmented Language Models (opens in a new tab)”, 2022 ↩
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom: “Toolformer: Language Models Can Teach Themselves to Use Tools (opens in a new tab)”, 2023 ↩
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang: “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face (opens in a new tab)”, 2023 ↩
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li: “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs (opens in a new tab)”, 2023 ↩
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, Philippe Schwaller: “ChemCrow: Augmenting large-language models with chemistry tools (opens in a new tab)”, 2023 ↩
Daniil A. Boiko, Robert MacKnight, Gabe Gomes: “Emergent autonomous scientific research capabilities of large language models (opens in a new tab)”, 2023 ↩

LLM Application Architecture Multi-Agent Collaboration