Understanding LLM behaviour

LLMs are radically different from traditional machine learning models.

However, these differences are not well understood, and they are not accounted for in many of the traditional processes that we use to manage machine learning and AI projects. Software engineers tend to use LLMs just as they would use an API. ML people tend to try to apply the traditional evaluation processes that they have been trained to use. This project is an attempt to develop a better understanding of LLM behaviour, and to apply these insights to processes like measuring performance and quantifying and representing Model Risk.

Trust in Agent Systems

People wrap LLMs in a loop and call them agents! The expectation that this word/description creates is that an Agent will be able to go off and undertake a complex sequence of tasks and that it will collaborate with other agents to achieve a goal.

Implicit in both of these ideas and expectations is that we can trust the Agent and that the Agent can trust other agents that it's working with. This project explores the question of how well LLMs cope with concepts of trust and scenarios where trust is required for successful collaboration, and more importantly what models of trust and processes of trust propagation can we expect in the users of LLM powered agent systems.

This project explores the question of how well LLMs cope with concepts of trust and scenarios where trust is required for successful collaboration, and more importantly what models of trust and processes of trust propagation can we expect in the users of LLM powered agent systems.

LLM as a judge

LLM as a judge is a common pattern that attempts to substitute human judgement about model performance and quality with the output of an LLM.

LLMs are used for a range of tasks, and it's unclear what the differences in performance of LLM as a judge are across tasks as diverse as quality of summarisation, code production, or classification.

Additionally it's clear that different LLMs behave differently as judges, and again this has not been well characterised. In this project we will explore these questions and attempt to develop metrics that demonstrate these differences.