Local LLMs are advancing rapidly, with new releases delivering significantly improved performance while requiring less memory. In many cases, these newer models are approaching—or even matching—the capabilities of larger, older models, making them increasingly practical for on-premise deployment.
While cloud-based LLMs remain more powerful overall, concerns around exposing sensitive data limit their use in critical or regulated environments. As a result, investing in local LLM solutions has become an important strategic consideration for organizations seeking to leverage AI while maintaining full control over their data.
In this article, I present a comparative analysis of multiple LLMs across different quantization formats, providing practical insights into their performance, trade-offs, and expected behavior in real-world penetration testing scenarios. Additionally, I outline recommended configuration parameters for LLM inference engines (llama.cpp) to help optimize results.
Key Takeaways#
Best overall model: GLM-4.7-Flash-UD-Q8_K_XL
Best speed/performance ratio: Qwen3.5-35B-A3B-Q8_0
Best vulnerability detail quality: Qwen3.6-27B-UD-Q8_K_XL
Best small-sized model: Qwen3.5-9B-UD-Q8_K_XL (this model outperformed OpenAI GPT-OSS-120B in vulnerability identification accuracy)
Best MCP: Chrome DevTools MCP performed better than Burp MCP
Test Lab Infrastructure#
The tests were conducted on a desktop system with the following specifications:
RAM: 48 GB
GPU: 3090 TI + 3090 (Total 48 GB Vram)
CPU: AMD Ryzen™ 9 5900X × 24
OS: Ubuntu 24.04.3 LTS
LLM Inference Engines: llama.cpp
Test Case#
The main objective of this test case is to evaluate whether local LLMs can identify web application endpoints and perform SQL injection testing to uncover vulnerabilities using only their own reasoning and embedded knowledge. The goal is to compare different local models and determine which one is better suited for penetration testing tasks and demonstrates stronger autonomous pentesting capabilities.
To ensure a fair evaluation, I developed a custom web application with multiple SQL injection vulnerabilities. Pre-built vulnerable applications were intentionally avoided to eliminate the possibility that the models had prior knowledge of them, which could skew the results and not accurately reflect their true pentesting skills.
The prompt used in this evaluation is available here
The test web application contains five intentionally vulnerable endpoints affected by SQL injection vulnerabilities:
/login -> SQL injection allows authentication bypass
/users -> SQL injection in search parameter exposes user data
/profile/<user_id> -> SQL injection allows access to any user profile
/products -> SQL injection via category and sort parameters
/orders -> SQL injection allows viewing all orders
This test web application is intended for educational and research purposes only. It should be deployed exclusively within an isolated lab environment, as it is intentionally vulnerable and not designed for production use.
The test environment consisted of the following components:
providing LLM access to MCP tools to perform penetration tests like Chrome DevTools MCP and access to terminal to execute any command.
provide LLM access to MCP servers that allow it to store and retrieve findings.
No further information, web search or RAG access was provided and LLMs were working using their own knowledge.
The test script allowed the LLM to iteratively retry tasks in a loop, with access to a Chrome MCP server and terminal capabilities to execute any required commands.
The test terminated if the LLM exceeded 200 tool calls, reached a 150K context limit, or explicitly ended the session using the session-end tool.
The evaluation was divided into two separate phases:
Endpoint discovery phase
Vulnerability identification phase
The completion time, token usage, and number of interaction turns reported in this article were measured during the vulnerability identification phase only, after the application endpoints had already been identified successfully by the LLM in a separate session.
Each test session started with an initial 10,000-token context containing all required MCP configurations, tools, and system prompts.
Models Tested#
The following GGUF models were included in this evaluation:
GLM-4.7-Flash-UD-Q8_K_XL
GLM-4.7-Flash-UD-Q5_K_XL
GLM_4.7_flash-Q4_K_M
Qwen3.5-35B-A3B-UD-Q8_K_XL
Qwen3.5-35B-A3B-UD-Q4_K_XL
Qwen3.5-27B-UD-Q8_K_XL
Qwen3.5-27B-UD-Q4_K_XL
Qwen3.5-9B-UD-Q8_K_XL
Qwen3.5-9B-UD-Q4_K_XL
Qwen3.6-35B-A3B-UD-Q8_K_XL
Qwen3.6-35B-A3B-UD-Q4_K_XL
Qwen3.6-27B-UD-Q8_K_XL
Qwen3.6-27B-UD-Q4_K_XL
Qwen3-Coder-Next-UD-Q5_K_XL
Qwen3-Coder-Next-UD-Q4_K_XL
gpt-oss-120b-F16
gemma-4-31B
All models were tested using Unsloth GGUF variants, which provide a reduced memory footprint while maintaining comparable — and in some cases improved — performance relative to the original releases.
Quantization Overview#
Q4 and Q8 quantization formats were used throughout the evaluation to cover medium-sized and large-sized variants of each model.
Quantization is the process of reducing the precision of a model’s numbers (e.g., from 16-bit to 4-bit) to improve performance and lower memory usage, but may slightly reduce accuracy due to this loss of precision.
Precision refers to how detailed and exact those numbers are, higher precision means better accuracy but more resource usage.
In simple Analogy: Q4 is like watching YouTube in 480p, and Q8 is like watching YouTube in 1080p. Both work, but one provides clearer output quality.
Below is a simple comparison between Q4 and Q8.
| Feature | Q4 (4-bit) | Q8 (8-bit) |
| Bits per value | 4 bits | 8 bits |
| Model size | small (~50% of Q8) | Larger (about 2× Q4) |
| VRAM usage | Low | Medium to high |
| Speed | Faster | Slightly slower |
| Accuracy | Lower (more approximation) | Higher (closer to original) |
Benchmark Results#
| Model | Quantization | Inference speed (token/s)* | Time taken to finish (minutes)* | Tokens Used* | Number of turns* | Number of Vulnerabilities Identified (out of 5)* |
| GLM-4.7-Flash | UD-Q8_K_XL | 80 | 2 | 26,051 | 40 | 5 |
| Qwen3.5-35B-A3B | Q8_0 | 96 | 2 | 33307 | 51 | 5 |
| Qwen3.5-9B | UD-Q8_K_XL | 62 | 3 | 33,209 | 49 | 5 |
| Qwen3.5-35B-A3B | UD-Q8_K_XL | 83 | 4 | 54,862 | 42 | 5 |
| Qwen3.6-35B-A3B | UD-Q8_K_XL | 101 | 5 | 52098 | 130 | 5 |
| Qwen3.6-35B-A3B | UD-Q4_K_XL | 83 | 5 | 81817 | 89 | 5 |
| Qwen3.5-27B | UD-Q8_K_XL | 21 | 7 | 50720 | 72 | 5 |
| Qwen3.6-27B | UD-Q8_K_XL | 25 | 16 | 56740 | 74 | 5 |
| Qwen3.5-35B-A3B | UD-Q4_K_XL | 107 | 1 | 33307 | 51 | 4 |
| Qwen3.5-9B | UD-Q4_K_XL | 95 | 2 | 66,797 | 54 | 4 |
| Qwen3.5-27B | UD-Q4_K_XL | 35 | 5 | 41,995 | 65 | 4 |
| Qwen3.5-122B-A10B | UD-Q4_K_XL | 22 | 11 | 42097 | 41 | 4 |
| GLM-4.7-flash | Q4_K_M | 74 | 2 | 40,470 | 37 | 3 |
| gpt-oss-120b | f16 | 31 | 4 | 22,344 | 20 | 3 |
| Qwen3-Coder-Next | UD-Q5_K_XL | 52 | 4 | 36421 | 66 | 3 |
| gemma-4-31B | Q4_K_M | 21 | 4 | 50752 | 40 | 3 |
| GLM-4.7-Flash | UD-Q5_K_XL | 69 | 5 | 54250 | 59 | 3 |
| Qwen3.6-27B | UD-Q4_K_XL | 29 | 24 | 71612 | 136 | 3 |
Inference speed: How fast the model generates output, measured in tokens per second.
Time taken to finish (minutes): Total time from task start to completion.
Tokens used: Total input + output tokens consumed across the entire task.
Number of turns: How many model request/response cycles required to complete the task.
Number of Vulnerabilities Identified (out of 5): How many of the 5 known vulnerabilities the agent successfully found.
Note: All benchmark results presented in this article represent the best-performing run out of five executions per model. The purpose of selecting the best run was to evaluate the maximum practical capability of each model under stable conditions, rather than measuring average consistency. Actual results may vary depending on prompt structure, llama.cpp build version, inference configuration, and tool-calling stability.
Top 5 Models#
GLM-4.7-Flash-UD-Q8_K_XL: This model demonstrated the best instruction-following capabilities, consistently reaching the intended objectives with the lowest token usage while still delivering detailed and high-quality results. In addition, its strong coding capabilities significantly increased its overall value for technical and automation-focused tasks.
Qwen3.5-35B-A3B-Q8_0: This model achieved the best balance between speed, low token consumption, and overall performance, while still delivering reliable and high-quality results throughout the testing process.
Qwen3.6-27B-UD-Q8_K_XL: This model provided the most detailed vulnerability analysis and the clearest technical explanations. Although it was slower than other models, the quality of its vulnerability descriptions, evidence, and reasoning made it stand out for report-quality output.
Qwen3.5-9B-UD-Q8_K_XL: Despite being a relatively small 9B model, it successfully identified all vulnerabilities. While the depth of analysis and vulnerability descriptions were not comparable to larger models, it still managed to complete the objective successfully.
Qwen3.6-35B-A3B-UD-Q4_K_XL: This model demonstrated that lower-bit quantization can still deliver high-quality pentesting results. Despite using Q4 quantization, it successfully identified all vulnerable endpoints while maintaining strong inference speed, showing that reduced VRAM usage does not necessarily result in major capability loss for agentic penetration testing tasks.
Key Findings#
After testing multiple builds of llama.cpp throughout this evaluation, I found that build b8175 (d903f30e2) is the most stable and produced a noticeable difference in performance when running Qwen3-Coder-Next and GLM4.7-Flash. Newer builds up to the latest (b8541, ded446b34) exhibit performance regressions, with inference results deviating from those produced by b8175. I tested identical llama.cpp parameters across multiple builds, and b8175 was the only version capable of reliably completing a full pentest session end-to-end.
llama.cpp version 8942 (f53577432) introduced a significant performance improvement in token generation (from 49 token/s to 80 token/s) for GLM-4.7-Flash-UD-Q8_K_XL. In addition, the model’s reasoning quality and inference behavior were noticeably improved.
I encountered tool-calling issues with Qwen3.5 and Qwen3.6 models until I identified the correct chat template.
I noticed that with Qwen3.5 models, if your prompt instructs the LLM to return output as JSON while tool calling is enabled, the model will start emitting tool calls as raw JSON text output rather than using the OpenAI tool calling syntax — causing your agent to misinterpret them as regular output instead of tool invocations, which breaks the execution flow.
The GLM-4.7-Flash-UD-Q8_K_XL model stood out as the best local LLM for instruction following. Despite its lower inference speed, the model demonstrated strong consistency across multiple tests including real-world usage with OpenClaw and the RooCode plugin in VS Code.
Qwen3-Coder-Next performed well on pentesting tasks and excelled at coding, but fell short on tool calling consistency. I observed occasional failures both during pentest sessions and when using it via the RooCode plugin in VS Code.
Qwen3.6-35B-A3B delivered the best inference speed without compromising output quality.
Using the Chrome DevTools MCP proved significantly more efficient than relying on the Burp Suite MCP or granting the agent terminal access to run curl.
The llama.cpp build version has a significant impact on LLM inference results.
Inference speed does not necessarily mean the LLM will complete the task faster. I observed that slower models finished before faster ones, largely due to a more methodical approach and fewer mistakes along the way.
LLM Configuration & Performance Results (llama.cpp)#
GLM-4.7-Flash-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model GLM-4.7-Flash-UD-Q8_K_XL.gguf –alias
“unsloth/GLM_4.7_flash” –temp 0.6 –top-p 1.0 –min-p 0 –repeat-penalty
1.0 –top-k 20 –port 8021 –ctx-size 100000 –host 0.0.0.0 –fit onNotes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Followed instructions and presented data in professional intended way.
Vulnerabilities were described in a detailed and well-organized manner.
Best results with llama.cpp version: 8175 (d903f30e2).
Qwen3.5-27B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-27B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.5-27B-UD-Q4_K_XL.gguf” –temp 0.6 –top-p 1.0 –min-p 0
–repeat-penalty 1.0 –top-k 20 –port 8021 –ctx-size 100000 –host 0.0.0.0
–fit on –flash-attn onNotes:
Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8667 (c08d28d08).
Qwen3.5-35B-A3B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-35B-A3B-Q8_0.gguf –alias
“unsloth/Qwen3.5-35b” –host 0.0.0.0 –port 8021 –temp 0.6 –top-p 1.0
–min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit on –jinja
–chat-template-file ../llama_tests/template.txt –flash-attn on
–chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8667 (c08d28d08).
Qwen3.5-35B-A3B-Q8_0.gguf#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-35B-A3B-Q8_0.gguf –alias
“unsloth/Qwen3.5-35b-Q8_0” –host 0.0.0.0 –port 8080 –temp 0.6 –top-p 1.0
–min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit on –jinja
–chat-template-file ../llama_tests/template.txt –flash-attn on
–chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Best results with llama.cpp version: 8809 (b1be68e8c).
Qwen3-Coder-Next-UD-Q5_K_XL#
Llama parameters:
./llama.cpp/llama-server –model
Qwen3-Coder-Next-UD-Q5_K_XL-00001-of-00003.gguf –alias
“unsloth/Qwen3-Coder-Next-UD-Q5_K_XL” –temp 0.6 –top-p 1.0 –min-p 0
–repeat-penalty 1.0 –top-k 20 –port 8021 –ctx-size 100000 –host
0.0.0.0 –fit onNotes:
Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8175 (d903f30e2).
Qwen3.5-122B-A10B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model
Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf –alias
“unsloth/Qwen3.5-122B-A10B-UD-Q4_K_XL” –temp 0.6 –top-p 1.0 –min-p
0 –repeat-penalty 1.0 –top-k 20 –port 8021 –ctx-size 100000 –host
0.0.0.0 –fit on –flash-attn on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”Notes:
Best results with llama.cpp version: 8809 (b1be68e8c).
Identified all endpoints.
Identified 4/5 vulnerable endpoints.
GLM-4.7-Flash-UD-Q5_K_XL#
Llama parameters:
./llama.cpp/llama-server –model GLM-4.7-Flash-UD-Q5_K_XL.gguf
–alias “unsloth/GLM_4.7_flash-UD-Q5_K_XL” –temp 0.6 –top-p 1.0 –min-p
0 –repeat-penalty 1.0 –top-k 20 –port 8021 –ctx-size 100000 –host
0.0.0.0 –fit onNotes:
Identified all endpoints but also added an endpoint that does not exist with not found status which was not asked for.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8175 (d903f30e2).
Qwen3.5-35B-A3B-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf –alias
“unsloth/Qwen3.5-35b-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –ctx-size
100000 –fit on –split-mode layer –cache-type-k q8_0 –cache-type-v q8_0
–flash-attn on –parallel 1 –temp 0.6 –top-p 0.95 –min-p 0.00 –top-k 20
–kv-offload –flash-attn on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints.
If I run it without this chat template I get issue with tool calling.
GLM_4.7_flash-Q4_K_M#
Llama parameters:
./llama.cpp/llama-server –model GLM-4.7-Flash-Q4_K_M.gguf –alias
“unsloth/GLM_4.7_flash-Q4_K_M” –temp 0.7 –top-p 1.0 –min-p 0.01 –port
8021 –ctx-size 100000 –host 0.0.0.0 –fit onNotes:
Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best llama.cpp version tested with most detections: 8175 (d903f30e2).
gpt-oss-120b-f16#
Llama parameters:
./llama.cpp/llama-server –model gpt-oss-120b-F16.gguf –alias
“gpt-oss-120b-F16” –temp 0.6 –top-p 1.0 –min-p 0.0
–top-k 0.0 -ngl -1 –port 8021 –ctx-size 100000 –host 0.0.0.0 -fit
onNotes:
Identified all endpoints but also included duplicate endpoints.
Identified 3/5 vulnerable endpoints.
Qwen3.5-9B-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-9B-UD-Q8_K_XL.gguf
–alias “unsloth/Qwen3.5-9B-UD-Q8_K_XL” –seed 3407 –temp 0.6 –top-p
0.95 –min-p 0.01 –repeat-penalty 1.0 –top-k 40 –port 8021
–ctx-size 100000 –host 0.0.0.0 –fit on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Did not follow the exact instructions but managed to provide a good result.
Qwen3.6-35B-A3B-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf –alias
“unsloth/Qwen3.6-35b-A3B-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Detailed output summary.
If I run it without this chat template I get issue with tool calling.
Did not consistently follow instructions to only use Chrome DevTools MCP and ended using curl which consumed more time.
Qwen3.6-35B-A3B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.6-35B-A3B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints along with an out-of-scope vulnerability.
If I run it without this chat template I get issue with tool calling.
Qwen3.5-27B-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-27B-UD-Q8_K_XL.gguf–alias
“unsloth/Qwen3.5-27B-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Identified all endpoints.
Identified 5/5 vulnerable endpoints with detailed info.
Detailed output summary.
If you run it without this chat template you will face issue with tool calling.
Qwen3.6-27B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.6-27B-UD-Q4_K_XL.gguf–alias
“unsloth/Qwen3.6-27B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
This model performed the weakest among the tested models.
Identified 3/5 vulnerable endpoints.
Required the longest time to identify vulnerable endpoints.
Qwen3.5-9B-UD-Q4_K_XL#
Llama parameters:
./llama.cpp/llama-server –model Qwen3.5-9B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.5-9B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”Notes:
Did not identify all endpoints.
Identified 4/5 vulnerable endpoints with detailed info.
Didn’t follow instructions.
Good results compared to its size.
If I run it without this chat template I get issue with tool calling.
gemma-4-31B-it-UD-Q8_K_XL#
Llama parameters:
./llama.cpp/llama-server –model gemma-4-31B-it-UD-Q8_K_XL.gguf –alias
“unsloth/gemma4” –host 0.0.0.0 –port 8021 –temp 0.6 –top-p 1.0 –min-p 0
–repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit onNotes:
Every test run caused llama.cpp to crash. I saw issues in llama.cpp repo related to this bug.
Loading the model with LM Studio and using built-in chat to use Сhrome MCP also resulted in a crash.
The issue may be with the gemma4 unsloth version as I used Ollama to load the original model from Google and it worked without issues.
gemma-4:31B - Ollama#
Notes:
I used Ollama to run this model.
Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Limitations#
Only SQL injection was tested.
Single application architecture.
No authentication complexity.
No RAG/web search.
No memory persistence.
Limited MCP ecosystem.
Results depend heavily on prompts/templates.
llama.cpp build instability affects reproducibility.
Conclusion#
The results of this evaluation demonstrate that modern local LLMs are becoming increasingly capable of performing autonomous penetration testing tasks when combined with MCP tooling and properly configured inference environments. While cloud-hosted models still provide stronger overall capabilities, several local models demonstrated highly competitive performance in endpoint discovery, vulnerability identification, reasoning quality, and tool-calling workflows.
The evaluation also highlighted that inference speed alone is not a reliable indicator of real-world agent performance. Factors such as instruction following, reasoning quality, tool-calling reliability, prompt structure, and llama.cpp build stability had a significantly greater impact on overall task completion and result quality.