Technical Papers

This page is intended for researchers considering pivoting to AI Safety, as well as for advanced undergraduates excited to get started in the field.

Mechanistic Interpretability

Mechanistic interpretability is the study of taking a trained neural network and analyzing the weights to reverse engineer the algorithms learned by the model.

Several good papers:

Articles in Anthropic’s Transformer Circuits Thread (a few great papers are here)
Indirect Object Identification (IOI) in GPT-2 Small by Wang et al (one of the authors, Kevin Wang, is an AISST member!)

We recommend Neel Nanda’s materials for getting started in Mechanistic Interpretability, including:

Concrete Steps to Getting Started
An Opinionated Annotated List of His Favorite Papers
200 Concrete Open Problems in Mechanistic Interpretability
TransformerLens, a Python library for doing mechanistic interpretability on GPT-2-style language models

Eliciting Latent Knowledge & Hallucinations

Because language models are trained to predict the next token in naturally occurring text, they often reproduce common human errors and misconceptions, even when they "know better" in some sense. More worryingly, when models are trained to generate text that's rated highly by humans, they may learn to output false statements that human evaluators can't detect. One attempt to circumvent this issue is by directly eliciting latent knowledge inside the activations of a language model.

A few good papers:

Tuned Lens by Belrose et al (Eleuther AI)
Discovering Latent Knowledge in Language Models Without Supervision by Burns et al
Improving Factuality and Reasoning in Language Models through Multiagent Debate by Du et al
How Language Model Hallucinations can Snowball by Zhang et al
Language Models (Mostly) Know What They Know by Anthropic

AI Evaluations and Standards

AI evaluations and standards (or "evals") are processes that check or audit AI models. Evaluations can focus on how powerful models are (“capability evaluations”) and on whether models are exhibiting dangerous behaviors or are misaligned (“alignment evaluations” or "safety evaluations"). Working on AI evaluations might involve developing standards and enforcing compliance with the standards. Evaluations can help labs determine whether it's safe to deploy new models, and can help with AI governance and regulation.

A couple good papers:

Model Evaluations for Extreme Risks by Shevalne et al and the accompanying technical blog (DeepMind)
GPT-4 System Card (OpenAI)

Goal Misgeneralization & Specification Gaming

Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal.

Goal Misgeneralization in Deep Reinforcement Learning by Langosco et al
Goal Misgeneralization by Shah et al (DeepMind)
Specification Gaming (DeepMind)

Emergent Abilities

It sometimes seems that large language models’ abilities emerge suddenly and unpredictably. If we don’t know the capabilities of our models, we don’t know how dangerous they may be.

Two papers with contrasting perspectives:

Emergent Abilities of Large Language Models by Wei et al
Are Emergent Abilities of Large Language Models a Mirage? by Schaeffer, Miranda, and Koyejo

Survey Papers

Other

AI Alignment Problem introduction by Ngo, Chan, and Mindermand
Locating and Editing Factual Associations in GPT by Meng et al
Constitutional AI: Harmlessness from AI Feedback by Anthropic
Power-seeking by Turner et al
Heuristic Arguments (Alignment Research Center)