Alignment faking in large language models

Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.

Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.

Associate Professor at the University of Toronto whose research spans deep learning, probabilistic modeling, and machine learning methods for science and AI safety.

Research scientist at Anthropic focused on safety and robustness for language models and reinforcement learning.

Anthropic co-founder and Chief Science Officer. Formerly a physicist at Johns Hopkins, he helped develop scaling laws for neural language models and works on the science and safety of large AI systems.

Research scientist at Anthropic working on machine learning and AI safety.

Researcher at Anthropic with interests in machine learning, AI alignment, and economics.

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Buck Shlegeris is a Member of Technical Staff at Anthropic whose public homepage focuses on AI safety, model evaluations, and alignment.

Member of Technical Staff at Anthropic and PhD student at Carnegie Mellon University focused on AI safety, evaluations, and oversight of large language models.

Member of technical staff at Anthropic working on alignment science and the evaluation of hidden objectives in language models.

Member of Technical Staff at Anthropic and researcher in neural circuits and mechanistic interpretability, building tools for understanding AI systems.

Canonical link

Samuel Marks

Samuel R. Bowman

David Duvenaud

Linda Petrini

Jared D. Kaplan

Sören Mindermann

Jack Chen

Ethan Perez

Buck Shlegeris

Carson Denison

Monte MacDiarmid

Johannes Treutlein