Atlas / Reports / Detail
Alignment faking in large language models
Alignment and Safety
Connected researchers
Samuel Marks
Anthropic
Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.
Samuel R. Bowman
Anthropic
Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.
David Duvenaud
Anthropic
Associate Professor at the University of Toronto whose research spans deep learning, probabilistic modeling, and machine learning methods for science and AI safety.
Linda Petrini
Anthropic
Research scientist at Anthropic focused on safety and robustness for language models and reinforcement learning.
Jared D. Kaplan
Anthropic
Anthropic co-founder and Chief Science Officer. Formerly a physicist at Johns Hopkins, he helped develop scaling laws for neural language models and works on the science and safety of large AI systems.
Sören Mindermann
Anthropic
Research scientist at Anthropic working on machine learning and AI safety.
Jack Chen
Anthropic
Researcher at Anthropic with interests in machine learning, AI alignment, and economics.
Ethan Perez
Anthropic
Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.
Buck Shlegeris
Anthropic
Buck Shlegeris is a Member of Technical Staff at Anthropic whose public homepage focuses on AI safety, model evaluations, and alignment.
Carson Denison
Anthropic
Member of Technical Staff at Anthropic and PhD student at Carnegie Mellon University focused on AI safety, evaluations, and oversight of large language models.
Monte MacDiarmid
Anthropic
Member of technical staff at Anthropic working on alignment science and the evaluation of hidden objectives in language models.
Johannes Treutlein
Anthropic
Member of Technical Staff at Anthropic and researcher in neural circuits and mechanistic interpretability, building tools for understanding AI systems.
Evan Hubinger
Anthropic
Profile still being enriched.
Ryan Greenblatt
Anthropic
Profile still being enriched.
Akbir Khan
Anthropic
Profile still being enriched.
Benjamin Wright
Anthropic
Profile still being enriched.
Fabien Roger
Anthropic
Profile still being enriched.
Jonathan Uesato
Anthropic
Profile still being enriched.
Julian Michael
Anthropic
Profile still being enriched.
Tim Belonax
Anthropic
Profile still being enriched.