Research
Hi! I’m Grant, a 3rd-year computer science student at the University of California, San Diego with interests in natural language processing and reinforcement learning. This blog discusses using multiple large language models (LLMs) to solve problems that one alone cannot solve.
While we’ve marveled at ChatGPT’s capabilities, LLMs encounter several issues when responding individually. Hallucination, for instance, is when LLMs generate responses with inaccurate information. Additionally, LLMs have limited context length or memory, leading them to misremember facts from earlier in the conversation. Moreover, the more powerful and responsive an LLM is, the more computational power and time it must use. These issues lay the groundwork for exploring how to use multiple interacting LLMs to work together.
The first thing I did to tackle the problem was naturally to get a better idea of the current research on the topic. There are a few important papers that have explored a concept similar to interacting LLMs and the issues that LLMs can have. For example, Stanford’s Generative Agents paper explored a different approach to memory, where it assigns importance to different pieces of memory and has the LLMs ask themselves questions to generate more observations. LLM’s as Toolmakers is a paper that explores using GPT 4’s advanced capabilities to generate tools that the faster, more lightweight GPT 3.5 can then use and reuse. Princeton’s SocraticAI article discusses using LLMs to talk out problems amongst themselves in order to come up with methods to solve problems.
In order to evaluate how well the interacting LLMs were doing in comparison to a singular LLM, I settled on Wikirace, a game where the LLM navigates from one seemingly random Wikipedia page to another by only clicking the links presented to you on the Wikipedia page.
From there, I explored different ways that LLMs could interact with each other. My first course of action was to replicate common group dynamics that people tend towards. These included interactions such as a doer and a thinker dynamic, where one LLM would do most of the analysis and summary while another would handle the chatting and actions. Another idea was some form of hierarchy where one LLM would distribute tasks to other LLMs to break down the task accurately and efficiently.
In order to evaluate how well the interacting LLMs were doing in comparison to a singular LLM, I settled on Wikirace, a game where the LLM navigates from one seemingly random Wikipedia page to another by only clicking the links presented to you on the Wikipedia page. Wikipedia itself is too large, so I am currently testing with a small subset of Wikipedia (around 600 pages of Wikipedia). The performance of the LLM is then compared to a traditional graph search algorithm, where the metrics used to determine how well the model/algorithm did is based on the number of links clicked and the time it takes to find the answer. Since this task is relatively simple, it makes sense to take smaller LLMs like Flan T5 and see if I can improve the results of Flan T5 by having multiple of them interact with each other.
For the graph search algorithm, I am using breadth-first search to find the shortest path. For LLMs I am testing the capability of a singular LLM as a baseline for LLM performance, which runs into 3 main problems where the model 1) loops between pages, 2) responds with a link that doesn’t exist, 3) breaks down when provided with too many links due to its limited context length. Luckily, each problem is solvable with the use of multiple LLMs.
While not completely solved, the first and second problems are mitigated by making LLMs consult on its choices. In this method, the LLM will choose a link. The chosen link will then be put through 3 series of consultations with LLMs to verify the validity of the link, if the chosen link would create a loop (if it has previously been visited), or if the link is related to the final result. The LLM will then make its final decision based on the results of the consultation. The third problem was solved with a divide and conquer method. A list of 200 links or so is divided into equal parts of 10 links or less, and each query to an LLM will analyze those 10 links, pick the best one, and send it up. This process is repeated recursively until there is only 1 link. The results of both are already a huge step ahead of the results of a singular LLM.
Aside from that, I am working on an ensemble method where multiple LLMs will vote on a single choice, weighting their votes with a self-assigned confidence factor denoting how confident they are in their decision. I’m also thinking of stacking the different methods in order to solve all 3 problems at the same time and see what the result is like for that.
If I had a whole year to work on this, I would come up with more tasks and a larger variety of tasks that multiple LLMs could try and tackle to see which tasks LLMs could do better on if more LLMs were assigned to the task. I would also try to expand my question and research to see if the result of these LLMs working together could imply something about how human teams can work together.
Overall, I think that my project is heading in an exciting direction, and I am looking forward to seeing, interpreting, and reporting my results!
Large Language Models as Toolmakers: https://arxiv.org/abs/2305.17126
SocraticAI: https://princeton-nlp.github.io/SocraticAI/
Stanford Generative Agents: https://arxiv.org/abs/2304.03442