AISST’s 2024 Summer in Research

Sep 16

The summer months have come and gone, and the AI Safety Student Team (AISST) has returned to Harvard! Our members were flung far, soaking up sun in San Francisco, New York, London, Cambridge (UK), Cambridge (US), Washington D.C., Chicago, and many other hotbeds of opportunity. In every case, opportunity was seized.

In service of our mission, “to reduce catastrophic risks from AI through research, education, and capacity building,” our members pursued and released over 24 technical AI safety papers and AI governance works, many published in prestigious venues around the world, since June 1, 2024, alone. Many of these research projects were led or substantially contributed to by undergraduates. Some are already becoming influential works in the field.

To celebrate the accomplishments and scientific contributions of our members who pursued research this summer, and to further our mission to educate on the safety of AI systems, we present AISST’s first summer in research!

Adversarial Robustness

How can we ensure AI models act the way we want them to, even when adversaries may prompt them, change them, or analyze them in ways that can elicit bad behavior? This is the question taken up by the researchers studying the “adversarial robustness” of AI systems. Three of our members tackled this problem over the summer, yielding four papers between them.

Rowan Wang (College ’27) contributed to two major papers that create Large Language Models (LLMs) which are resistant to novel types of adversarial attacks. In the first, “Improving Alignment and Robustness with Circuit Breakers,” the authors train a Llama-3-8B finetune called “Cygnet” that triggers a “circuit breaker” when it detects harmful representations in its intermediate activations, sending further generation to a harmless direction in representation space. This approach is particularly novel because it targets harmful outputs, rather than harmful inputs, meaning it can generalize to different types of attacks more effectively than other methods. The excellent results of this paper have already spun out multiple related research projects. One of these is the second major paper that Rowan contributed to, on creating model safeguards that persist despite adversarial fine-tuning. This method is unique in that it is designed to secure not only API-deployed models, but even open sourced ones, which are notoriously difficult to defend.

Luke Bailey (College ’24, now pursuing a PhD in CS at Stanford) collaborated with a number of researchers at Stanford, Berkeley, and elsewhere to study the transferability of jailbreaking attacks on vision-language models (VLMs). This paper tested the ability of an adversarial image, created using gradient-based attack methods, to jailbreak open-source VLMs other than the one on which the attack was optimized. They find that adversarial images do not transfer well between VLMs, potentially implying that multimodal models are more robust against image inputs than text.

AISST member Ekdeep Singh Lubana (CS PhD ’24, University of Michigan) contributed to a study of the mechanisms through which safety fine-tuning can prevent models from responding helpfully to harmful queries. They find that models often learn to project unsafe samples to the null space of model weights. Further experiments show that jailbreaks often work by following an activation distribution that is similar to safe samples, avoiding this redirection.

All of these works further the safety of AI systems by making it more difficult for models to be misused to cause harm.

Interpretability

AI systems are notorious as “black boxes,” opaque answer-generating machines that make unpredictable and often inscrutable decisions. Can we understand how and why models create the outputs that they do? This area of research, called “interpretability,” uses techniques often derived from model internals (such as activations and gradients) to answer such questions. Four AISST members published in this area this summer, spread across at least four papers.

Members Logan Smith, Claudio Mayrink Verdun (SEAS postdoc), and Samuel Marks (former Northeastern postdoc, now AI safety researcher at Anthropic) collaborated on a paper—accepted as an Oral Presentation at the 2024 International Conference on Machine Learning—studying the use of Sparse Autoencoders (SAEs) on models trained to play the board games chess and Othello. Their impressive results include a metric based on whether a board state can be reconstructed using only SAE features and an impressive new method for training sparse and computationally cheap SAEs called “p-annealing.” Progress in SAEs is an important area of safety research, given they can be used to identify specific behaviors and patterns in LLM responses.

Oam Patel (College ’25) contributed to a paper that, among other work, used interpretability techniques to determine that conversational language models keep track of a “user model.” This user model tracks information about the age, gender, socioeconomic status, and education of the user being conversed with in residual stream activations. Probing demonstrates that these features are strongly linearly represented (especially in later layers) and causally linked to responses. For instance, models were more likely to recommend cheap options for flights when, on the same inputs, interventions set “the internal representation of the user to low socioeconomic status.”

Samuel Marks contributed to two additional interpretability papers released this summer. The first is a broad position paper on the history and methods of this nascent field. It discusses the pros and cons of different methods for understanding neural networks. The authors argue that future methods should be developed which are insightful beyond linear probing and are less compute-intensive. The second of these papers provides an open-source package called “NNsight” that can be used to perform a number of techniques in interpretability. It also introduces the National Deep Inference Fabric, which provides access for researchers to large models. In the future, it hopes to provide secure access to proprietary models, as well.

These contributions to model interpretability help us understand the behaviors of models, perhaps towards reliable methods like identifying misbehavior before it can be used to cause harm.

Governance

It has become increasingly popular to believe that certain aspects of the frontier AI industry are not conducive to safe and responsible development. To take just one example, firms face intense competitive pressures that may incentivize them to skimp on safety measures and push out products before they are adequately vetted. For reasons like these, regulatory action seems to be a fruitful area for research as policymakers attempt to determine how best to govern this emerging technology. At least five members have published on this topic since June. Many more worked in governance roles that do not focus on publications.

Early in the summer, AISST Director Gabe Wu (College ’25) and Deputy Director Nikola Jurković (College ’25) released an analysis of a survey on the “influence of AI on the study habits, class choices, and career prospects of Harvard undergraduates.” The results of the 326-participant study show 40% of respondents agree that extinction risk from AI should be a global priority and a similar number believe that “AI systems will be more capable than humans in almost all regards within 30 years.” These observations can inform policy and governance at the university, firm, and government levels.

Nikola Jurković was also an intern at Model Evaluation and Threat Research (METR), where he worked on evaluations for general autonomous capabilities in frontier language models. Evaluations like these are critical for informing policymakers of the risks posed by AI systems now and in the future. METR released an article summarizing the work this August.

Carson Ezell (College ’25) is a first author on “Black-Box Access is Insufficient for Rigorous AI Audits,” an argument paper published in the 2024 Association for Computing Machinery Conference on Fairness, Accountability, and Transparency. They forcefully argue that the status quo level of access provided by developers to external oversight—input and output scrutiny, dubbed “black-box access”—cannot be used to truly determine the safety of a system. Instead, “white-box” and “outside-the-box” access to gradients, weights, activations, training information, data, etc. should be provided.

Carson Ezell authored another major work this summer, this one accepted to the 2024 AAAI/ACM conference on AI, Ethics, and Society in San Jose. This paper, a collaboration with Harvard Professor of Government Daniel Carpenter (who is not a member of AISST), discusses the “Pitfalls and Plausibility” of an “approval regulation” model of AI regulation based on the US Food and Drug Administration’s drug approval process. They argue that, although such a proposal enjoys widespread support among various stakeholders and may provide benefits, it is currently far from realistic to implement.

Cole Salvador (College ’26) is the sole author of a policy report that also touches on the question of applying approval regulation to frontier AI. The report, called “Certified Safe,” offers a schematic in which large-scale AI projects can only develop and deploy models after scrutiny and permission from a government regulator. It also diagnoses and discusses a number of the most salient challenges facing such a proposal.

Eddie Zhang (pursuing a PhD in CS at Harvard, currently on leave at OpenAI) is first author on a position paper detailing a research agenda known as “Social Environment Design.” This paper, presented at the 2024 International Conference on Machine Learning in Vienna, argues for an application of AI simulation to decision making in order to achieve common objectives. This use of AI may be able to better align the desires of constituents with the decisions of policymakers.

Works like these improve the entire landscape of discussion and policy around frontier AI, especially benefiting from our members’ intimate technical knowledge, a highly valuable trait in the governance space.

Model Behavior

One fundamental challenge in ensuring future safe development and use of AI is the general observation that model behavior can be unpredictable, especially under novel circumstances. Six AISST members published on this topic over the summer, trying to better understand and manipulate AI behaviors.

Eddie Zhang first authored and Benjamin Edelman (Harvard CS PhD ’24, incoming TechCongress Fellow at US AI Safety Institute) contributed to a paper aptly titled “Transcendence: Generative Models Can Outperform the Experts that Train Them.” This work demonstrates a startling phenomenon: models can outperform their training data under specific conditions. In particular, their experiments show (among other results) that transformers trained to play chess using only data from players under a specific skill level can perform above that level.

Member Core Francisco Park (pursuing Harvard CS PhD) first authored and Ekdeep Singh Lubana contributed to a work on “Emergence of Hidden Capabilities.” This paper proposes that there exist hidden (i.e., latent) capabilities in generative models that arise more obviously over the course of training. This describes the observed phenomenon that some models have capabilities that cannot easily be elicited by naive prompting, but can nevertheless be elicited with more effort. This work contributes to a broader scholarly conversation on the rapid emergence of unpredictable capabilities in generative AI, which has obvious implications for the safety of such systems. Lubana also first authored another paper extending this work, which analogizes emergence to phase transitions in physics. It also uses a toy setting to test the extent to which emergent capabilities can be predicted.

Not to be outdone by his three contributions in interpretability, Samuel Marks contributed to two prominent studies of model behavior published this summer. The first, “Sycophancy to Subterfuge,” investigates the phenomenon of “reward tampering” in language models. The authors find that specification gaming is encouraged in complex environments merely through training in simpler environments, a potentially worrying tendency. In the second paper, “Connecting the Dots,” experiments demonstrate that LLMs can infer information about concepts even when their training data only includes peripheral information about these concepts. For example, a model fine-tuned on a corpus of coin flip results can internally store and verbally report the coin’s bias. This capability may make trouble for oversight that depends on direct observation of harmful language or patterns of representation.

Eric Li (College ’24) is first author on a paper which introduces a “diagnostic benchmark for generalist web agents” that determines why agents fail. It includes discrete tasks (e.g., button clicking) and end-to-end tasks (e.g., purchasing an item). Evaluations of this type may become increasingly important as more AI systems are implemented as internet-navigating agents. This one is especially interesting because it allows for evaluators to directly observe causes of failure.

Other Technical Work

Some members pursued technical work that doesn’t fit neatly into one of the above categories.

Chloe Loughridge (College ’24) is a first author on “DafnyBench,” a paper detailing a benchmark for the use of LLMs in formal software verification. This paper is part of a larger project to explore the use of formal verification, which can guarantee that specific software meets its specifications, in developing reliable and safe AI. They find that LLMs can be quite helpful in formulating formally verified programs in the language Dafny, which may mean the costliness of verifying software declines over time due to AI assistance.

Claudio Mayrink Verdun has published an impressive four further works this summer. He is a first author on “Multi-Group Proportional Representation,” which introduces a new metric for reducing bias in image search and retrieval. In particular, it focuses on ensuring that intersectional (i.e., combinations of) identity groups are sufficiently represented.

His three other papers solve problems in the application of ML to MRI scanning. In one, the authors show and prove methods for using MRI data to construct confidence intervals at each pixel. In the process they prove an upper bound on the amount of data needed to create these intervals. This paper won best student paper at the most prestigious signal processing conference (ICASSP). In another paper, the authors improve on existing uncertainty quantification results from MRI scans. They use a new “sampling without replacement” method to improve reconstruction. Finally, he released “Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning.” This work further prevents overestimation of uncertainty in MRI diagnosis, while improving the stability of deep learning training.

Conclusion

In the AI Safety Student Team’s short history at Harvard, it has already amassed a substantial body of dedicated and effective scholars. Many come to AISST equipped to contribute. Others are forged in the fires of the technical and policy intro fellowships, Saturday technical training, and AISST member meetings. Nevertheless, it is clear that our members are not just learning and teaching about, but also solving, the foremost problems in the field.

We are incredibly proud of all the scholarly, impactful, and novel research published by our members over this summer and featured here. We cannot wait to see what amazing things, in research and beyond, our members do this fall!

Interested in what we’re doing? Join our mailing list!

Cole Salvador