Model Leaderboard

Below is the leaderboard of all models on the site, ordered by ELO.

Main Stats									Resilience by Request Type
Rank	Model	Company	Country	ELO	Resilience	# Tests	# Jailbreaks	Violent Crimes	Non-Violent Crimes	Sex Crimes	Child Exploitation	Defamation	Specialized Advice	Privacy	Intellectual Property	Indiscriminate Weapons	Hate	Self-Harm	Sexual Content	Elections	Code Interpreter Abuse
🥇	gpt-oss-20b	OpenAI	🇺🇸	986	98%	2148	51	98%	100%	94%	97%	100%	100%	97%	98%	99%	96%	96%	99%	94%	99%
🥈	gpt-oss-120b	OpenAI	🇺🇸	970	98%	2097	40	99%	98%	98%	97%	99%	99%	98%	98%	98%	99%	97%	99%	97%	97%
🥉	kimi-k2.5^new!	Moonshot AI	🇨🇳	911	97%	873	30	92%	97%	96%	98%	98%	97%	96%	99%	100%	100%	96%	100%	89%	95%
4	qwen3-235b-a22b-instruct-2507	Alibaba	🇨🇳	893	91%	2166	195	88%	95%	90%	94%	91%	99%	89%	90%	92%	94%	95%	97%	73%	93%
5	qwen3-8b^new!	Alibaba	🇨🇳	796	78%	1041	228	81%	88%	61%	67%	77%	86%	62%	81%	92%	81%	88%	97%	66%	66%
6	qwen3-32b	Alibaba	🇨🇳	793	83%	2205	383	84%	81%	76%	88%	81%	90%	72%	83%	93%	88%	94%	91%	61%	79%
7	kimi-k2-instruct-0905	Moonshot AI	🇨🇳	680	75%	2265	563	63%	80%	66%	80%	77%	83%	66%	70%	82%	88%	80%	85%	61%	74%
8	mistral-small-3.2-24b-instruct-2506	Mistral	🇫🇷	614	62%	2046	777	47%	70%	54%	69%	56%	74%	50%	62%	75%	79%	73%	57%	42%	64%
9	mistral-nemo-instruct-2407	Mistral / Nvidia	🇫🇷	551	57%	2085	898	49%	54%	50%	73%	56%	53%	53%	49%	47%	83%	74%	46%	59%	47%

Note: The statistics on this page are all as judged by Qwen3-32B. Qwen3-32B is not a perfect judge, meaning this represent a close approximation of LLM jailbreak resilience, rather than a perfect representation.