RSAC’s research team ran the largest-scale knowledge-based benchmarking study in the cybersecurity domain that pitted humans against LLMs, and the LLMs won:
- LLMs outperformed human experts in all 21 topics
- Just three of the 39 LLMs we tested had higher failure rates than the human baseline rate
- But this doesn’t mean LLMs can replace human cybersecurity experts; instead, experienced professionals can use them to quickly re-learn things they used to know but can no longer recall
At RSAC 2025 Conference, we ran an experiment; we challenged cybersecurity professionals to pit their domain knowledge against the capabilities of a battery of 39 different LLMs in a game we called “AI Showdown.” In all, 279 attendees answered the call, submitting a total of 2,439 answers to our difficulty-calibrated questions, and the results were sobering—the LLMs outperformed the humans across all 21 topical categories.i Nearly half of RSAC Conference attendees boast 10 or more years of cybersecurity experience, so they’re hardly novices.ii Nonetheless, the human experts’ best average performance (a failure rate of just 19% in the “Law” category) couldn’t match the LLMs’ worst effort (a 17% failure rate for “Open Source Tools”).
Of the 39 LLMs that participated in our evaluation, just three had a failure rate higher than the baseline human failure rate of nearly 33%. So if you’re going to compete against an LLM in a cybersecurity multiple-choice question competition, make sure your opponent is one of those three (qwen2-0.5b-instruct, qwen1.5-0.5b-chat, and llama-7b). Prior to this research, we would have broadly expected the smallest models to perform the worst, but llama-7b had the highest failure rate of the 39 models we tested, despite being in our “medium-sized” group (between four and 15 billion parameters).iii
But humans shouldn’t hang up their cybersecurity tools in despair just yet—the “AI Showdown” game is a set of multiple-choice questions which tests recognition of the single correct answer among a set of four answers rather than the ability to recall the answer from memory. And pattern recognition (having been trained on vast volumes of human-generated knowledge) is exactly what LLMs excel at. When one poses the same questions in open-ended format, LLM accuracy drops substantially. And the real-world problems that cybersecurity practitioners must solve are seldom as clear as well-designed multiple-choice questions.
In practice, LLMs offer a superior solution to a perennial problem for experienced cybersecurity professionals—vanishingly few of them have perfect recall. But LLMs provide an extremely good approximation of perfect recall; so instead of searching the internet for something they used to know but have forgotten the details of, cybersecurity experts can just ask their favorite (search capable) LLM!iv
_____________________________________________________________________________________________
i With 625 questions across 21 subtopics, ours is the largest-scale knowledge-based benchmarking study in the cybersecurity domain, and the 279 participants make this the largest human baseline in this domain.
ii 49% of RSAC Conference attendees report having 10 or more years of industry experience. Source: RSAC 2026 Conference. Note also that our 279 study participants represent a convenience sample of RSAC 2025 Conference attendees who chose voluntarily to play the game; they’re not a random sample of cybersecurity practitioners.
iii Note that despite the name, llama-7b is the first generation llama.
iv With the caveat that if more “wrong” than “correct” answers exist in the LLM’s training data, it’ll provide you with the wrong answer.