Owen Parsons

Investigating RLHF as an alignment technique

Project Summary

The current technique for aligning large language models with our values is reinforcement learning from human feedback (RLHF) and this was used to fine-tune ChatGPT to make it more helpful and politically correct. My research will focus on the failure modes of RLHF as AI systems become increasingly capable.

Research Interests

AI alignment and understanding LLM’s

Background

BSc in Mathematics at the University of Bristol

Supervisor

Prof Özgür Şimşek

Prof Maria Battarra