Jack McKinlay

Value Alignment for Opaque Agents Using Preference Estimation Models

Project Summary

The value alignment problem is a key barrier in the deployment of artificial intelligence (AI) within society. While AI has shown powerful capabilities in autonomously developing solutions to a wide range of tasks, we do not have a way of determining whether these solutions will respect the values of the society that they are implemented in.

In this project, we examine the nature of values as latent variables that lead to preferences in given contexts, and how this can be used to infer the values embodied by an agent. First, we ground our definition of value alignment in the literature using a systematic literature review, to generate a computationally tractable objective. We then use logic programming to assess the alignment of different value sets given this definition, allowing us to inspect multiple agents and systems in a robust manner. Finally, we build a preference estimation system for agents based on observations of their behaviour within a system, to extract the required value sets without needing access to the agent’s inner workings.

This research will produce a definition of value alignment that is compatible with AI agents, substantiated by previous literature. We also produce software for assessing value alignment between multiple agents and systems in an interpretable manner. In doing so we lay the foundation for future research on the topic of integrating ethical values into AI systems, contributing towards the resolution of the value alignment problem in AI ethics.

Research Interests

AI Safety & Alignment

AI Ethics

Explainable AI

Background

BSc in Applied Mathematics from Cardiff University.

MSc in Applied Mathematics from University of Bath.

Four years working in actuarial science between bachelor’s and master’s degree.

Eight months working as an AI researcher after master’s, focusing on Bayesian approaches for playing games and different methods for training reinforcement learning agents.

Supervisors

Dr Marina De Vos

Dr Janina Hoffmann

Dr Andreas Theodorou (Universitat Politècnica de Catalunya)