Alignment and Safety | Field

Research scientist at Anthropic whose public work includes AI alignment, reinforcement learning from human feedback, and model behavior.

Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.

Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.

Anthropic researcher on the Frontier Red Team focused on cyber misuse evaluation and threat modeling; previously a physics PhD student at UC Berkeley and now also mentors in the MATS program.

Co-founder and head of policy at Anthropic. He previously served as policy director at OpenAI, worked as a technology journalist, and writes the Import AI newsletter.

Associate Professor at the University of Toronto whose research spans deep learning, probabilistic modeling, and machine learning methods for science and AI safety.

Researcher focused on AI safety, reinforcement learning, and language models, with public work spanning red teaming, adversarial robustness, and model behavior.

Assistant Professor of Philosophy at The University of Hong Kong and Research Fellow at Anthropic, working in ethics, epistemology, and social and political philosophy.

Jesse Mu is a Research Scientist at Anthropic and a visiting researcher at Stanford University. His work spans machine learning, AI safety, reinforcement learning, and deep learning theory.

Research scientist at Anthropic focused on safety and robustness for language models and reinforcement learning.

Associate Professor of Computer Science at the University of Toronto and director of the machine learning group, with research spanning probabilistic models and optimization algorithms.

Alignment researcher at OpenAI working on making AI understandable to and aligned with human values.

Anthropic co-founder and Chief Science Officer. Formerly a physicist at Johns Hopkins, he helped develop scaling laws for neural language models and works on the science and safety of large AI systems.

Anthropic researcher whose work includes reinforcement learning from human feedback and Constitutional AI; previously a Sherman Fairchild Postdoctoral Scholar in theoretical high-energy physics at Caltech.

Research scientist at Anthropic and assistant professor of computer science at Northeastern University working on interpretability and model understanding.

AI safety researcher and director of the Center for AI Safety; advisor to xAI and Scale AI, previously an advisor to OpenAI and Anthropic.

Member of Anthropic's Societal Impacts team, where she studies the real-world impacts of AI systems.

Researcher at Anthropic working on alignment, reasoning, and evaluation for large language models.

Research scientist at Anthropic working on machine learning and AI safety.

Anthropic researcher focused on AI safety, alignment, and auditing hidden objectives in language models.

Member of technical staff at Anthropic interested in understanding deep learning and AI safety; previously a research scientist at OpenAI.

PhD student at the University of Oxford working on AI safety, including scalable oversight and interpretability.

Researcher at Anthropic with interests in machine learning, AI alignment, and economics.

Kshitij Sachan is a research scientist at Anthropic whose public homepage and Google Scholar profile highlight work on language models, reasoning, code generation, and machine learning systems.

Research scientist at Anthropic working on trustworthy AI and deceptive alignment.

AI safety researcher who led Anthropic's Safeguards Research Team and worked on jailbreak robustness, automated red teaming, and monitoring for misuse and misalignment.

Zachary Witten is a member of technical staff at Anthropic.

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.

Co-founder and head of alignment science at Anthropic.

CEO and co-founder of Anthropic. Before Anthropic, he served as vice president of research at OpenAI.

Research scientist at Anthropic interested in understanding neural networks and applying that understanding to alignment.

Researcher working on AI safety and adversarial evaluation, including Anthropic many-shot jailbreaking research.

Research scientist at Anthropic interested in understanding and steering AI systems.

Software engineer at Anthropic, previously at Google, with public writing on language models, agents, and reinforcement learning.

Member of technical staff at Anthropic whose work focuses on language models, model understanding, and alignment.

Buck Shlegeris is a Member of Technical Staff at Anthropic whose public homepage focuses on AI safety, model evaluations, and alignment.

President of METR and former team member at Anthropic whose work focuses on evaluating and forecasting frontier AI capabilities.

Member of Technical Staff at Anthropic and PhD student at Carnegie Mellon University focused on AI safety, evaluations, and oversight of large language models.

Jared Kaplan is a researcher at Anthropic known for work on scaling laws and large language models.

Member of technical staff at Anthropic working on alignment science and the evaluation of hidden objectives in language models.

Research scientist at Anthropic and former professor of theoretical astrophysics at Stony Brook University.

Member of technical staff at Anthropic focused on safe and reliable AI.

Research scientist at Anthropic whose public work spans reinforcement learning from human feedback, AI alignment, and scalable language model training.

Assistant professor of marketing at Stanford Graduate School of Business whose research uses AI systems to study human decision-making and related machine learning questions.

Co-founder and president of Anthropic and writer of the Cold Takes blog.

Computer scientist at Anthropic focused on making advanced AI systems safe and beneficial.

Member of Technical Staff at Anthropic and researcher in neural circuits and mechanistic interpretability, building tools for understanding AI systems.

Assistant Professor of Computer Science at the University of Oxford whose research spans generalization, reasoning, and large language model agents.

Researcher focused on AI alignment, reasoning under uncertainty, and the long-term safety of advanced AI systems.

Research scientist at Anthropic focused on AI alignment, language model behavior, and scalable oversight.

Member of Technical Staff at Anthropic working on AI control, hidden objectives, alignment, and evaluations, with a background in language models, efficient training, and scientific machine learning.

Member of technical staff at Anthropic working on deep learning, mechanistic interpretability, and AI safety.

Liane Lovitt

Samuel Marks

Samuel R. Bowman

Newton Cheng

Jack Clark

David Duvenaud

Shauna Kravec

Simon Goldstein

Jesse Mu

Linda Petrini

Roger Grosse

Amanda Askell

Jared D. Kaplan

Yuntao Bai

David Bau

Dan Hendrycks

Carina Kauf

Kamal Ndousse

Sören Mindermann

Jan Leike

Josh Batson

Henry Sleight

Jack Chen

Kshitij Sachan

Michael Sellitto

Mrinank Sharma

Zachary Witten

Ethan Perez

Nicholas Schiefer

Deep Ganguli

Dario Amodei

Nova DasSarma

Anna Chen

Saurav Kadavath

Tom Conerly

Alex Tamkin

Buck Shlegeris

Beth Barnes

Carson Denison

Jared Kaplan

Monte MacDiarmid

Adam Jermyn

Alexey Nazarov

Daniel M. Ziegler

Esin Durmus

Holden Karnofsky

Jan Brauner

Johannes Treutlein

Owain Evans

Paul Christiano

Rylan Schaeffer

Scott Emmons

Wes Gurnee