Model Leaderboard
Below is the leaderboard of all models on the site, ordered by ELO.
| Main Stats | Resilience by Request Type | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | Model | Company | Country | ELO | Resilience | # Tests | # Jailbreaks | Violent Crimes | Non-Violent Crimes | Sex Crimes | Child Exploitation | Defamation | Specialized Advice | Privacy | Intellectual Property | Indiscriminate Weapons | Hate | Self-Harm | Sexual Content | Elections | Code Interpreter Abuse | |
| π₯ | gpt-oss-20b | OpenAI | πΊπΈ | 986 | 98% | 2148 | 51 | 98% | 100% | 94% | 97% | 100% | 100% | 97% | 98% | 99% | 96% | 96% | 99% | 94% | 99% | |
| π₯ | gpt-oss-120b | OpenAI | πΊπΈ | 970 | 98% | 2097 | 40 | 99% | 98% | 98% | 97% | 99% | 99% | 98% | 98% | 98% | 99% | 97% | 99% | 97% | 97% | |
| π₯ | kimi-k2.5new! | Moonshot AI | π¨π³ | 911 | 97% | 873 | 30 | 92% | 97% | 96% | 98% | 98% | 97% | 96% | 99% | 100% | 100% | 96% | 100% | 89% | 95% | |
| 4 | qwen3-235b-a22b-instruct-2507 | Alibaba | π¨π³ | 893 | 91% | 2166 | 195 | 88% | 95% | 90% | 94% | 91% | 99% | 89% | 90% | 92% | 94% | 95% | 97% | 73% | 93% | |
| 5 | qwen3-8bnew! | Alibaba | π¨π³ | 796 | 78% | 1041 | 228 | 81% | 88% | 61% | 67% | 77% | 86% | 62% | 81% | 92% | 81% | 88% | 97% | 66% | 66% | |
| 6 | qwen3-32b | Alibaba | π¨π³ | 793 | 83% | 2205 | 383 | 84% | 81% | 76% | 88% | 81% | 90% | 72% | 83% | 93% | 88% | 94% | 91% | 61% | 79% | |
| 7 | kimi-k2-instruct-0905 | Moonshot AI | π¨π³ | 680 | 75% | 2265 | 563 | 63% | 80% | 66% | 80% | 77% | 83% | 66% | 70% | 82% | 88% | 80% | 85% | 61% | 74% | |
| 8 | mistral-small-3.2-24b-instruct-2506 | Mistral | π«π· | 614 | 62% | 2046 | 777 | 47% | 70% | 54% | 69% | 56% | 74% | 50% | 62% | 75% | 79% | 73% | 57% | 42% | 64% | |
| 9 | mistral-nemo-instruct-2407 | Mistral / Nvidia | π«π· | 551 | 57% | 2085 | 898 | 49% | 54% | 50% | 73% | 56% | 53% | 53% | 49% | 47% | 83% | 74% | 46% | 59% | 47% | |
Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.