LLM for pentesting: speed isn't everything

One of the hottest topics this year in cybersecurity is the use of LLM for pentesting. News articles claim that new models can quickly find numerous vulnerabilities, making the sleep of “meat” pentesters uneasy.

However, the devil is in the details. For example, using cloud-based LLMs for such purposes can be severely limited, both by their owners (who fear malicious use of their models) and by users (who fear leaking their secrets through cloud services).

But what about using local LLMs for pentesting? There are pitfalls here as well.

Our expert Ahmed Khlief analyzed how different local versions of popular LLMs (GLM, Qwen, GPT OSS, Gemma) solve the problem of finding vulnerabilities.

For the test, a custom web application with several vulnerabilities was developed, allowing SQL injections. The locally deployed models had to identify the web application’s endpoints and detect vulnerabilities using only their own reasoning and knowledge: they did not have access to RAG or internet search, but had access to some MCP tools, such as Chrome DevTools.

The main result of the study: inference speed alone is not a reliable indicator of real-world AI agent performance. Many other factors have a great impact on overall task completion and result quality, including the agent’s ability to clearly follow instructions. We know some models tend to do what they were not asked for — and this can be dangerous (as well as insecure work with MCP tools).

And here are the models that performed best in pentesting:

— Best overall model: GLM-4.7-Flash-UD-Q8_K_XL

— Best speed/performance ratio: Qwen3.5-35B-A3B-Q8_0

— Best vulnerability detail quality: Qwen3.6-27B-UD-Q8_K_XL

— Best small-sized model: Qwen3.5-9B-UD-Q8_K_XL

Read the details of this research in Ahmed Khlief’s article “Pentesting by AI: Local LLMs Benchmark”.

Related