Technical Papers

This page is intended for researchers considering pivoting to AI Safety, as well as for advanced undergraduates excited to get started in the field.

Mechanistic Interpretability

Mechanistic interpretability is the study of taking a trained neural network and analyzing the weights to reverse engineer the algorithms learned by the model.

Several good papers:

We recommend Neel Nanda’s materials for getting started in Mechanistic Interpretability, including:

Eliciting Latent Knowledge & Hallucinations

Because language models are trained to predict the next token in naturally occurring text, they often reproduce common human errors and misconceptions, even when they "know better" in some sense. More worryingly, when models are trained to generate text that's rated highly by humans, they may learn to output false statements that human evaluators can't detect. One attempt to circumvent this issue is by directly eliciting latent knowledge inside the activations of a language model.

A few good papers:

AI Evaluations and Standards

AI evaluations and standards (or "evals") are processes that check or audit AI models. Evaluations can focus on how powerful models are (“capability evaluations”) and on whether models are exhibiting dangerous behaviors or are misaligned (“alignment evaluations” or "safety evaluations"). Working on AI evaluations might involve developing standards and enforcing compliance with the standards. Evaluations can help labs determine whether it's safe to deploy new models, and can help with AI governance and regulation.

A couple good papers:

Goal Misgeneralization & Specification Gaming

Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal.

Emergent Abilities

It sometimes seems that large language models’ abilities emerge suddenly and unpredictably. If we don’t know the capabilities of our models, we don’t know how dangerous they may be.

Two papers with contrasting perspectives:

Survey Papers

Other