Kamal Ndousse | Researcher Profile

Alignment and RLHF Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Alignment and RLHF Constitutional AI: Harmlessness from AI Feedback

Alignment and RLHF Collective Constitutional AI: Aligning a Language Model with Public Input

Alignment and Safety Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Alignment and Safety Constitutional Classifiers++: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming