Ethan Perez | Researcher Profile

Alignment and RLHF Constitutional AI: Harmlessness from AI Feedback

Alignment and Safety Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Alignment and Safety Alignment faking in large language models

Alignment and Safety Auditing language models for hidden objectives

Alignment and Safety Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alignment and Safety Constitutional Classifiers++: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Interpretability On the Biology of a Large Language Model

Interpretability Tracing the thoughts of a large language model