Leveraging Mechanistic Interpretability for Red-Teaming: Haize Labs x Goodfire
Probing black-box AI systems for harmful, unexpected, and out-of-distribution behavior has historically been very hard. Canonically, the only way to test models for unexpected behaviors (i.e. red-team) has been to operate in the prompt domain, i.e. by crafting jailbreak prompts. This is of course a lot of what we think about at Haize Labs. But this need not be the only way. Red-Teaming by Manipulating Model Internals One can also red-team models in a mechanistic fashion by analyzing and manipulating their internal activations....
Simple, Safe, and Secure RAG: Haize Labs x MongoDB
RAG is a powerful and popular approach to ground GenAI responses in external knowledge. It has the potential to enable truly useful tools for high-stakes enterprise use cases, especially when paired with a powerful vector store solution like MongoDB Atlas. However, RAG apps may not be trustworthy and reliable out-of-the-box. In particular, they lack two things: Role-based access control (RBAC) when performing retrieval over sensitive enterprise documents Mechanisms to defend against malicious instructions (e....
Endless Jailbreaks with Bijection Learning: a Powerful, Scale-Agnostic Attack Method
Lately, weโve been working on understanding the impact of model capabilities on model safety. Models are becoming more and more capable. Recently released models may be better aligned with human preferences through more sophisticated safety guardrails; however, when jailbroken, these models can incorporate deep world knowledge and complex reasoning into unsafe responses, leading to more severe misuse. We believe that more powerful models come with new, emergent vulnerabilities not present in small models....
Red-Teaming Language Models with DSPy
At Haize Labs, we spend a lot of time thinking about automated red-teaming. At its core, this is really an autoprompting problem: how does one search the combinatorially infinite space of language for an adversarial prompt? If you want to skip this exposition and go straight to the code, check out our GitHub Repo. Enter DSPy One way to go about this problem is via DSPy, a new framework out of Stanford NLP used for structuring (i....
Making a SOTA Adversarial Attack on LLMs 38x Faster
Last summer, a group of CMU researchers developed one of the first automated red-teaming methods for LLMs [1]. Their approach, the Greedy Coordinate Gradient (GCG) algorithm, was able to jailbreak white-box models like Vicuna convincingly and models like Llama 2 to a partial degree. Interestingly, the attacks generated by GCG also transferred to black-box models, implying the existence of a shared, underlying set of common vulnerabilities across models. GCG is a perfectly good initial attack algorithm, but one common complaint is its lack of efficiency....
Practical Mechanisms for Preventing Harmful LLM Generations
So letโs say that we have procured a set of adversarial inputs that successfully elicit harmful behaviors from a language model โ provided by Haize Labs or some other party. This is great, but now what do we do to defend against and prevent these attacks in practice? Background: Standard Safety Fine Tuning The standard procedure at places like OpenAI [1], Anthropic [2], and Meta [3], and from various open-source efforts [4] is to โbake inโ defensive behavior in response to some adversarial attack....
Red-Teaming Resistance Leaderboard
Last week, Haize Labs released the Red-Teaming Resistance Leaderboard in collaboration with the amazing Hugging Face team! The Red-Teaming Resistance Leaderboard! Why a New Leaderboard? While there has been no shortage of great work in the recent automated red-teaming literature, we felt that many of these attacks were extremely contrived and unlikely to appear in-the-wild in a way that would realistically and negatively impact language models. Moreover, the majority of these attacks were easily marred by simple and lightweight classifier-based defenses....
Content Moderation APIs are Really, Really Bad
With an ever-growing collection of text content on the web, and ever-growing LLM-powered systems that automatically interact with such data, scalable content moderation emerges as a significant challenge. One seemingly reasonable solution is from OpenAI. Since a tenet of their mission is to ensure AI benefits all of humanity, careful development and deployment of their models is top of mind. As part of this effort, OpenAI offers the free Moderation API....