Understanding LLM behaviour
LLMs are radically different from traditional machine learning models.
- They accept unstructured input of massive size - millions of tokens
- They generate unstructured outputs, again much bigger and more complex than those generated by a statistical model
- Many LLM outputs are semantically identical - they mean the same thing, but they are syntactically different (their tokens are different)
- LLMs can generate different outputs for the same input if they aren't carefully managed. For example, you can vary the temperature parameter of the model to get different outputs, gpu optimisation can create different outputs, running on a different server can create different outputs, and so on.
Trust in Agent Systems
People wrap LLMs in a loop and call them agents! The expectation that this word/description creates is that an Agent will be able to go off and undertake a complex sequence of tasks and that it will collaborate with other agents to achieve a goal.
Implicit in both of these ideas and expectations is that we can trust the Agent and that the Agent can trust other agents that it's working with. This project explores the question of how well LLMs cope with concepts of trust and scenarios where trust is required for successful collaboration, and more importantly what models of trust and processes of trust propagation can we expect in the users of LLM powered agent systems.
This project explores the question of how well LLMs cope with concepts of trust and scenarios where trust is required for successful collaboration, and more importantly what models of trust and processes of trust propagation can we expect in the users of LLM powered agent systems.
LLM as a judge
LLM as a judge is a common pattern that attempts to substitute human judgement about
model performance and quality with the output of an LLM.
LLMs are used for a range of tasks, and it's unclear what the differences in performance
of LLM as a judge are across tasks as diverse as quality of summarisation, code production, or
classification.
Additionally it's clear that different LLMs behave differently as judges, and again this has not been well
characterised. In this project we will explore these questions and attempt to develop metrics that demonstrate
these differences.