The current technique for aligning large language models with our values is reinforcement learning from human feedback (RLHF) and this was used to fine-tune ChatGPT to make it more helpful and politically correct. My research will focus on the failure modes of RLHF as AI systems become increasingly capable.
AI alignment and understanding LLM’s
BSc in Mathematics at the University of Bristol
Prof Özgür Şimşek
Prof Maria Battarra
