Atlas / Reports / Detail
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Alignment and Safety report from Anthropic with 13 connected researchers in the LLMpeople atlas.
Connected researchers
Liane Lovitt
Anthropic
Research scientist at Anthropic whose public work includes AI alignment, reinforcement learning from human feedback, and model behavior.
Ethan Perez
Anthropic
Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.
Nicholas Schiefer
Anthropic
Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.
Samuel Marks
Anthropic
Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.
Maxwell Tegmark
Anthropic
Researcher at Anthropic and coauthor of the Constitutional Classifiers report.
Tomás Riofrío
Anthropic
Tomás Riofrío is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.
William Saunders
Anthropic
William Saunders is a research scientist at Anthropic working on aligning and evaluating language models. His public homepage says he works at the intersection of game theory, optimization, and deep learning, previously interned at OpenAI, DeepMind, and Mila, studied mathematics at the University of Oxford, and is a PhD student in machine learning at Carnegie Mellon University.
Alexey Nazarov
Anthropic
Member of technical staff at Anthropic focused on safe and reliable AI.
Jordan Taylor
Anthropic
Jordan Taylor is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.
Alex Tamkin
Anthropic
Member of technical staff at Anthropic whose work focuses on language models, model understanding, and alignment.
Yanda Chen
Anthropic
Yanda Chen is a member of technical staff at Anthropic and a PhD candidate in computer science at Georgetown University advised by Kevin Knight. His homepage says he previously worked at Allen Institute for AI and focuses on AI safety, natural language processing, and deep learning.
Beth Barnes
Anthropic
President of METR and former team member at Anthropic whose work focuses on evaluating and forecasting frontier AI capabilities.
Jacob Hilton
Anthropic
Jacob Hilton is a researcher and executive director at Alignment Research Center, where he works on mechanistic approaches to outperforming random sampling. He previously worked at OpenAI on truthfulness, reinforcement learning, and interpretability for language models, earlier worked at Jane Street, completed a PhD in mathematics at the University of Leeds, and later coauthored Anthropic work on constitutional classifiers.