Multi-Agent Collaboration

Many papers, including those from MIT and Google Brain ¹, show that LLMs produce better results when you have multiple LLM instances with different roles proposing and debating their individual responses and reasoning processes over multiple rounds to arrive at a common final answer i.e. behaving as a "multiagent society".

Overall, their findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

Multiagent Debate Improves Reasoning and Factual Accuracy

This approach require only black-box access to language model generations – no model-internal information such as likelihoods or gradients is needed. This allows this method to be used with common public models serving interfaces.

The method is also orthogonal to other model generation improvements such as retrieval or prompt engineering. While the debate process is more costly, requiring multiple model instances and rounds, it arrives at significantly improved answers and may be used to generate additional model training data, effectively creating a model self-improvement loop.

Scenarios

LLMs are being used for various multi-agent collaboration scenarios:

Behavior Simulation: Using generative agents in a sandbox to mimic human behavior or simulate user behaviors in recommendation systems.
Data Construction: Collecting and evaluating multi-party conversations or generating detailed instructions for complex tasks using role-playing agents.
Performance Improvement: Enhancing performance through role adoption, improving factual correctness and reasoning with multi-agent debates, addressing thought degeneration in self-reflection, and improving negotiation strategies in role-playing games.

Researchers have found that having multiple agents, each with unique attributes and roles, can handle complex tasks more effectively, create more realistic simulations, and even align social behaviors in LLMs. Some work involves designing interactive environments where these agents can interact to achieve goals, like creating believable social interactions or improving negotiation outcomes.

This research area is expanding the capabilities of LLMs beyond single-agent tasks to collaborative multi-agent systems, offering innovative ways to tackle complex problems that are otherwise difficult for individual agents or traditional computational methods.

Multiple Models

While many of these papers mainly study multi-agent debate across multiple instances of the same language model, multi-agent debate can also be used to combine different language models together. This enables the strengths of one model to cover the weaknesses of another, leading to more robust and accurate results overall.

Libraries

These are some of the libraries that are being used to create multi-agent systems:

Generative Agents Society Simulation

Generative Agents ² is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandboxed environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications.

The design of generative agents combines LLMs with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents.

Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language.
- Each element is an observation, an event directly provided by the agent.
- Inter-agent communication can trigger new natural language statements.
Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.
- Recency: recent events have higher scores
- Importance: distinguish mundane from core memories. Ask LM directly.
- Relevance: based on how related it is to the current situation / query.
Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (note that this is a bit different from self-reflection)
- Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.
Planning & Reacting: translate the reflections and the environment information into actions
- Planning is essentially in order to optimize believability at the moment vs in time.
- Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
- Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
- Environment information is present in a tree structure.

The generative agent architecture:

The generative agent architecture

This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).

ChatDev

ChatDev ³ presented a natural-language-to-software framework and demonstrated a way to handle complexity, all the task flow. It has another layer and allocates agents to teams that you would normally see in companies, such as a Design Team, Coding Team, Testing, Documenting. It also brings together different roles, including a CEO, CTO, professional programmers, test engineers, and art designers, fostering collaborative dialogue and facilitating a seamless workflow. Essentially, the agents to simulate the entire software development process. The coders are able to also use Git for code management, and there is a possibility to build on existing code bases.

Chat Dev

The comapny mirrors the established waterfall model, meticulously dividing the development process into four distinct chronological stages: designing, coding, testing, and documenting. This paper explores an end-to-end software development framework driven by LLMs, encompassing:

Requirements analysis
Code development
System testing
Document generation

Aiming to provide a unified, efficient, and cost-effective paradigm for software development. This approach, that includes a back and forth between agents with a debate process and multiple rounds helps prevent code hallucinations. Code hallucinations arise primarily due to two reasons:

Lack of task specificity confuses LLMs when generating a software system in one step.
- Granular tasks in software development, such as analyzing user/client requirements and selecting programming languages, provide guided thinking that is absent in the high-level nature of the task handled by LLMs.
The absence of cross-examination in decision-making poses significant risks
- Individual LLMs propose a diverse range of answers
- Throwing the requirements to debate or examining the responses from other LLMs to converge on a single and more accurate common answer can improve the consensus answer
- Analogous to code peer-review and suggestion feedbacks in real life

ChatDev utilizes a proposed chat chain that divides each phase into atomic subtasks. By guiding the software development process along the chat chain, ChatDev delivers the final software to the user, including source code, dependency environment specifications, and user manuals. Within the chat chain, each node represents a specific subtask, and each subtask requires collaborative interaction and cross-examination between two roles, where they engage in context-aware, multi-turn discussions to propose and validate solutions.

In the paper, the authors describe that discussions between a reviewer and a programmer led to the identification and modification of nearly twenty types of code vulnerabilities, while discussions between a tester and a programmer resulted in the identification and resolution of more than ten types of potential bugs.

The process (waterfall) is as follows:

Designing
- Innovative ideas are generated through collaborative brainstorming
- Technical design requirements are defined
Coding
- Development and review of source code
Testing
- Integrates all components into a system and utilizes feedback messages from interpreter for debugging
Documenting
- Encompasses the generation of environment specifications and user manuals

Visibility

Each of these phases necessitates effective communication among multiple roles, posing challenges in determining the sequence of interactions and identifying the relevant individuals to engage with.

To get around this, the authors propose a generalized architecture by breaking down each phase into multiple atomic chats, each with a specific focus on task-oriented role-playing involving two distinct roles.

Through the exchange of instructions and collaboration between the participating agents, the desired output for each chat, which forms a vital component of the target software, is achieved.

The chat chain provides a transparent view of the software development process, shedding light on the decision-making path and offering opportunities for debugging when errors arise, which enables users to inspect intermediate outputs, diagnose errors, and intervene in the reasoning process if necessary.

Chat Chain

Chat Mechanisms

Three key mechanisms are utilized in each chat

Role specialization: ensures that each agent fulfills their designated functions and contributes effectively to the task-oriented dialogue.
Memory stream: maintains a comprehensive record of previous dialogues within the chat, enabling agents to make informed decisions.
Self-reflection: prompts the assistant to reflect on proposed decisions when both parties reach a consensus without triggering predefined termination conditions.
- To implement this mechanism, the authors implement a “pseudo self” as a new questioner and initiate a fresh chat. The pseudo questioner informs the current assistant of all the historical records from previous dialogues and requests a summary of the conclusive information from the dialogue
- This mechanism effectively encourages the assistant to reflect upon the decisions proposed and discussed during the dialogue
- This was a key feature that helped streamline the process was a self-reflection mechanism that allowed the agents to summarize and conclude discussions efficiently, preventing them from getting sidetracked by irrelevant chatter.

Other Notable Mechanisms are thought instruction and role swapping to make the coding instructions clearer and more specific

Coding

When asking questions to get help with coding, sometimes the answers can be wrong or not quite what's needed, which is particularly tricky with code because you might end up with extra or imaginary pieces that don't belong. Imagine telling a coder to finish all the parts of the code that aren't done yet, but without being clear, they might add things that should actually be left out.

To fix this, the authors suggest a method called "thought instruction," which is similar to step-by-step problem-solving. It's like giving instructions in stages: you first find out which parts of the code are incomplete by temporarily taking on a different role, and then you switch back to give exact directions on what to do next.

This Role Swapping approach makes the coding instructions clearer and more specific, which helps avoid confusion and ensures the final code really does what it's supposed to do. By using this method, the process of finishing the code is more on point, less likely to introduce errors, and ends up with better and more trustworthy results.

Testing

In ChatDev, during the testing stage, there are three jobs: the coder writes the code, the reviewer checks the code for possible problems (static debugging), and the tester runs the code to make sure it works properly (dynamic debugging).

However, just having two computer agents talk to each other based on what the code interpreter says doesn't guarantee the code will work without any bugs. The coder might not make changes exactly right, which can lead to more mistakes. To solve this, ChatDev uses the thought instruction method again. Here, the tester runs the software, figures out what's wrong, suggests how to fix it, and gives detailed directions to the coder. They keep doing this until the software works without any problems.

ChatDev also allows a human to give feedback just like a reviewer or tester would, using their own words and different testing methods.

Documenting

After the designing, coding, and testing phases, ChatDev employs four agents (CEO, CPO, CTO, and programmer) to generate software project documentation. Using LLMs they leverage few-shot prompting with in-context examples for document generation.

Thoughts on Interface

OpenAi for example used the same tech; GPTs had been around for a while, but they are the first ones to put it into a user intuitive UI; a chat interface. What would be the UI for agents? What is the UI to understand all the AI employees, all the data, all the complexities, all the task flow? The answer people have settled on is a Video Game.

AgentCoder

AgentCoder ⁴. Vision Agent introduces the idea of agents generating instructions, planning, coding agent creates code for vision based task and tester agent validates the code. Output of the agent is not just the results, but also the code to test it out.

References

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch: “Improving Factuality and Reasoning in Language Models through Multiagent Debate (opens in a new tab)”, 2023 ↩
Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein: “Generative Agents: Interactive Simulacra of Human Behavior (opens in a new tab)”, 2023. ↩
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun: “ChatDev: Communicative Agents for Software Development (opens in a new tab)”, 2023 ↩
Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, Heming Cui: “AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation (opens in a new tab)”, 2023 ↩

LLM Powered Autonomous Agents LangGraph