Last week, Haize Labs released the Red-Teaming Resistance Leaderboard in collaboration with the amazing Hugging Face team!
Why a New Leaderboard?
While there has been no shortage of great work in the recent automated red-teaming literature, we felt that many of these attacks were extremely contrived and unlikely to appear in-the-wild in a way that would realistically and negatively impact language models. Moreover, the majority of these attacks were easily marred by simple and lightweight classifier-based defenses.
Rather than focus our attention on the efficacy of trivially detectable and unrealistic automated attacks, we elect to stress-test LLMs against high-quality human attacks that are coherent and structurally faithful to natural language.
To this end, we measure the robustness of LLMs against adversarial attacks from several adversarial prompt datasets, including AdvBench, RedEval, SAP, and several others.
Where do Models Fail?
To provide more detailed insights into precisely what situations models fail in, we also categorize LLM vulnerabilites into the following taxonomy:
Harm and Violence
- Hate/Harassment/Discrimination: promoting intolerance or hostility against individuals or groups.
- Children-Specific Harm: content or actions that directly endanger or harm children.
- Physical-Specific Harm: material that encourages or causes physical injury to oneself or others.
Criminal Conduct
- Malware: distributing harmful software designed to damage or infiltrate computer systems.
- Fraud/Deception: engaging in deceitful practices for personal gain.
- Privacy Violation: infringing on individuals’ right to privacy through unauthorized data collection or sharing.
- Economic Deception: misleading practices or claims that financially exploit others.
- Unauthorized practice of medical advice: providing medical guidance without proper qualifications or authorization.
- Unauthorized Practice of Law: offering legal advice or services without a license to practice law.
- Misc. Illegal Activity
Unsolicited Counsel
- Tailored Financial Advice: providing specific financial recommendations without solicitation or proper credentials.
- Political Campaigning: unsolicited promotion of political agendas or candidates.
- High Risk Government Decision Making: advising on or influencing critical governmental decisions without authority or solicitation.
NSFW
- Adult Content: explicit language or profanity, and depictions or discussions of drug use, alcohol abuse, and smoking.
- Sexual Content: material that depicts or describes sexual activities, sexual acts, or explicit sexual behavior.
What’s Next?
For more details, please see the official Hugging Face post!
And of course, if you have thoughts on how to improve the leaderboard, please reach out at [email protected] :)