Model Leaderboard

Below is the leaderboard of all models on the site, ordered by ELO.

Main Stats Resilience by Request Type
Rank Model Company Country ELO Resilience # Tests # Jailbreaks Violent Crimes Non-Violent Crimes Sex Crimes Child Exploitation Defamation Specialized Advice Privacy Intellectual Property Indiscriminate Weapons Hate Self-Harm Sexual Content Elections Code Interpreter Abuse
πŸ₯‡ gpt-oss-20b OpenAI πŸ‡ΊπŸ‡Έ 910 98% 1125 21 97% 100% 95% 97% 100% 100% 99% 98% 100% 97% 96% 100% 95% 100%
πŸ₯ˆ gpt-oss-120b OpenAI πŸ‡ΊπŸ‡Έ 897 99% 1071 13 99% 99% 100% 98% 100% 100% 99% 99% 99% 99% 97% 99% 100% 97%
πŸ₯‰ qwen3-235b-a22b-instruct-2507 Alibaba πŸ‡¨πŸ‡³ 880 93% 1086 74 86% 92% 96% 97% 96% 99% 94% 96% 94% 94% 95% 95% 80% 94%
4 qwen3-32b Alibaba πŸ‡¨πŸ‡³ 822 85% 1089 166 82% 81% 85% 90% 85% 93% 79% 78% 93% 93% 92% 88% 68% 79%
5 kimi-k2-instruct-0905 Moonshot AI πŸ‡¨πŸ‡³ 779 79% 1194 246 68% 83% 72% 83% 84% 89% 70% 75% 89% 89% 84% 81% 67% 75%
6 mistral-small-3.2-24b-instruct-2506 Mistral πŸ‡«πŸ‡· 659 62% 1008 379 51% 70% 61% 73% 58% 72% 52% 61% 80% 77% 73% 49% 39% 64%
7 mistral-nemo-instruct-2407 Mistral / Nvidia πŸ‡«πŸ‡· 651 62% 1074 407 57% 58% 60% 77% 64% 59% 64% 59% 58% 78% 73% 42% 65% 52%

Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.