LLMpeople
Home People Organizations Reports Fields Schools
Public Atlas People first, reports as evidence, organizations as context.

Atlas / Reports / Detail

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Alignment and Safety report from Anthropic with 39 connected researchers in the LLMpeople atlas.

Anthropic2024-01-1039 researchers
Field
Alignment and Safety
Organization
Anthropic
arXiv
2401.05566

Canonical link

https://arxiv.org/abs/2401.05566

Connected researchers

Amanda Askell portrait
Researcher 7 reports

Amanda Askell

Anthropic / OpenAI

Amanda Askell is a philosopher and AI alignment researcher at Anthropic. Her personal site says she previously worked as a research scientist on the policy team at OpenAI.

AnthropicOpenAI
United States
Jack Clark portrait
Researcher 7 reports

Jack Clark

Anthropic / OpenAI

Co-founder and Head of Policy at Anthropic. His public biography also notes earlier work as Policy Director at OpenAI, a technical journalist, and author of the Import AI newsletter.

AnthropicOpenAI
Yuntao Bai portrait
Researcher 4 reports

Yuntao Bai

Anthropic

Anthropic researcher whose work includes reinforcement learning from human feedback and Constitutional AI; previously a Sherman Fairchild Postdoctoral Scholar in theoretical high-energy physics at Caltech.

Anthropic
Kamal Ndousse portrait
Researcher 5 reports

Kamal Ndousse

Anthropic

Researcher at Anthropic working on alignment, reasoning, and evaluation for large language models.

Anthropic
Nova DasSarma portrait
Researcher 5 reports

Nova DasSarma

Anthropic

Anthropic report author whose public publication record includes work on language model evaluations, AI safety, and model behavior.

Anthropic
Deep Ganguli portrait
Researcher 6 reports

Deep Ganguli

Anthropic

Research scientist at Anthropic who leads the Societal Impacts team and works on AI evaluation, alignment, and societal impacts.

Anthropic
United States
Shauna Kravec portrait
Researcher 3 reports

Shauna Kravec

Anthropic

Researcher focused on AI safety, reinforcement learning, and language models, with public work spanning red teaming, adversarial robustness, and model behavior.

Anthropic
United States
Jared D. Kaplan portrait
Researcher 6 reports

Jared D. Kaplan

Anthropic

Jared D. Kaplan is a co-founder and Chief Science Officer at Anthropic. Anthropic's public materials also identify him as the company's Responsible Scaling Officer.

Anthropic
Ethan Perez portrait
Researcher 8 reports

Ethan Perez

Anthropic

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Anthropic
Samuel R. Bowman portrait
Researcher 5 reports

Samuel R. Bowman

Anthropic

Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.

Anthropic
United States
Nicholas Schiefer portrait
Researcher 8 reports

Nicholas Schiefer

Anthropic

Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.

Anthropic
Evan Hubinger portrait
Researcher 2 reports

Evan Hubinger

Anthropic

Evan Hubinger is Head of Alignment Stress-Testing at Anthropic, where he works on AI safety and alignment. He previously worked at MIRI and OpenAI, studied mathematics and computer science at Harvey Mudd College, and is known for work on inner alignment, deceptive alignment, and alignment stress-testing.

Anthropic
Carson Denison portrait
Researcher 2 reports

Carson Denison

Anthropic

Member of Technical Staff at Anthropic and PhD student at Carnegie Mellon University focused on AI safety, evaluations, and oversight of large language models.

Anthropic
Jesse Mu portrait
Researcher 1 reports

Jesse Mu

Anthropic

Jesse Mu is a Research Scientist at Anthropic and a visiting researcher at Stanford University. His work spans machine learning, AI safety, reinforcement learning, and deep learning theory.

Anthropic
Monte MacDiarmid portrait
Researcher 2 reports

Monte MacDiarmid

Anthropic

Member of technical staff at Anthropic working on alignment science and the evaluation of hidden objectives in language models.

Anthropic
Daniel M. Ziegler portrait
Researcher 1 reports

Daniel M. Ziegler

Anthropic

Daniel M. Ziegler is listed as an author of the Anthropic technical report Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

Anthropic
Newton Cheng portrait
Researcher 1 reports

Newton Cheng

Anthropic

Anthropic researcher on the Frontier Red Team focused on cyber misuse evaluation and threat modeling; previously a physics PhD student at UC Berkeley and now also mentors in the MATS program.

Anthropic
Adam Jermyn portrait
Researcher 1 reports

Adam Jermyn

Anthropic

Research scientist at Anthropic and former professor of theoretical astrophysics at Stony Brook University.

Anthropic
Cem Anil portrait
Researcher 1 reports

Cem Anil

Anthropic

Cem Anil is a research scientist at Anthropic and part of the company's Alignment Science team. His homepage says he recently completed a PhD at the University of Toronto and Vector Institute supervised by Roger Grosse and Geoffrey Hinton. He studies the intersection of deep learning and AI safety, especially robustness and generalization in large language models and scaling laws for dangerous capabilities.

Anthropic
David Duvenaud portrait
Researcher 4 reports

David Duvenaud

Anthropic

Associate Professor at the University of Toronto whose research spans deep learning, probabilistic modeling, and machine learning methods for science and AI safety.

Anthropic
Canada
Kshitij Sachan portrait
Researcher 1 reports

Kshitij Sachan

Anthropic

Kshitij Sachan is a research scientist at Anthropic whose public homepage and Google Scholar profile highlight work on language models, reasoning, code generation, and machine learning systems.

Anthropic
Michael Sellitto portrait
Researcher 1 reports

Michael Sellitto

Anthropic

Research scientist at Anthropic working on trustworthy AI and deceptive alignment.

Anthropic
Mrinank Sharma portrait
Researcher 1 reports

Mrinank Sharma

Anthropic

AI safety researcher who led Anthropic's Safeguards Research Team and worked on jailbreak robustness, automated red teaming, and monitoring for misuse and misalignment.

Anthropic
Roger Grosse portrait
Researcher 1 reports

Roger Grosse

Anthropic

Associate Professor of Computer Science at the University of Toronto and director of the machine learning group, with research spanning probabilistic models and optimization algorithms.

Anthropic

LLMpeople is a public atlas for discovering frontier AI researchers with context, provenance, and respect.

Privacy · Terms