Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Research scientist at Anthropic whose public work includes AI alignment, reinforcement learning from human feedback, and model behavior.

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.

Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.

Researcher at Anthropic and coauthor of the Constitutional Classifiers report.

Tomás Riofrío is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

William Saunders is a research scientist at Anthropic working on aligning and evaluating language models. His public homepage says he works at the intersection of game theory, optimization, and deep learning, previously interned at OpenAI, DeepMind, and Mila, studied mathematics at the University of Oxford, and is a PhD student in machine learning at Carnegie Mellon University.

Member of technical staff at Anthropic focused on safe and reliable AI.

Jordan Taylor is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

Member of technical staff at Anthropic whose work focuses on language models, model understanding, and alignment.

Yanda Chen is a member of technical staff at Anthropic and a PhD candidate in computer science at Georgetown University advised by Kevin Knight. His homepage says he previously worked at Allen Institute for AI and focuses on AI safety, natural language processing, and deep learning.

President of METR and former team member at Anthropic whose work focuses on evaluating and forecasting frontier AI capabilities.

Jacob Hilton is a researcher and executive director at Alignment Research Center, where he works on mechanistic approaches to outperforming random sampling. He previously worked at OpenAI on truthfulness, reinforcement learning, and interpretability for language models, earlier worked at Jane Street, completed a PhD in mathematics at the University of Leeds, and later coauthored Anthropic work on constitutional classifiers.

Canonical link

Liane Lovitt

Ethan Perez

Nicholas Schiefer

Samuel Marks

Maxwell Tegmark

Tomás Riofrío

William Saunders

Alexey Nazarov

Jordan Taylor

Alex Tamkin

Yanda Chen

Beth Barnes

Jacob Hilton