Red-Teaming Language Models with DSPy

At Haize Labs, we spend a lot of time thinking about automated red-teaming. At its core, this is really an autoprompting problem: how does one search the combinatorially infinite space of language for an adversarial prompt? If you want to skip this exposition and go straight to the code, check out our GitHub Repo. Enter DSPy One way to go about this problem is via DSPy, a new framework out of Stanford NLP used for structuring (i....

April 9, 2024

Making a SOTA Adversarial Attack on LLMs 38x Faster

Last summer, a group of CMU researchers developed one of the first automated red-teaming methods for LLMs [1]. Their approach, the Greedy Coordinate Gradient (GCG) algorithm, was able to jailbreak white-box models like Vicuna convincingly and models like Llama 2 to a partial degree. Interestingly, the attacks generated by GCG also transferred to black-box models, implying the existence of a shared, underlying set of common vulnerabilities across models. GCG is a perfectly good initial attack algorithm, but one common complaint is its lack of efficiency....

March 28, 2024

Practical Mechanisms for Preventing Harmful LLM Generations

So letโ€™s say that we have procured a set of adversarial inputs that successfully elicit harmful behaviors from a language model โ€“ provided by Haize Labs or some other party. This is great, but now what do we do to defend against and prevent these attacks in practice? Background: Standard Safety Fine Tuning The standard procedure at places like OpenAI [1], Anthropic [2], and Meta [3], and from various open-source efforts [4] is to โ€œbake inโ€ defensive behavior in response to some adversarial attack....

March 24, 2024

Red-Teaming Resistance Leaderboard

Last week, Haize Labs released the Red-Teaming Resistance Leaderboard in collaboration with the amazing Hugging Face team! The Red-Teaming Resistance Leaderboard! Why a New Leaderboard? While there has been no shortage of great work in the recent automated red-teaming literature, we felt that many of these attacks were extremely contrived and unlikely to appear in-the-wild in a way that would realistically and negatively impact language models. Moreover, the majority of these attacks were easily marred by simple and lightweight classifier-based defenses....

February 24, 2024

Content Moderation APIs are Really, Really Bad

With an ever-growing collection of text content on the web, and ever-growing LLM-powered systems that automatically interact with such data, scalable content moderation emerges as a significant challenge. One seemingly reasonable solution is from OpenAI. Since a tenet of their mission is to ensure AI benefits all of humanity, careful development and deployment of their models is top of mind. As part of this effort, OpenAI offers the free Moderation API....

January 10, 2024