Pentesting by AI: Local LLMs Benchmark

Table of Contents

Local LLMs are advancing rapidly, with new releases delivering significantly improved performance while requiring less memory. In many cases, these newer models are approaching—or even matching—the capabilities of larger, older models, making them increasingly practical for on-premise deployment.

While cloud-based LLMs remain more powerful overall, concerns around exposing sensitive data limit their use in critical or regulated environments. As a result, investing in local LLM solutions has become an important strategic consideration for organizations seeking to leverage AI while maintaining full control over their data.

In this article, I present a comparative analysis of multiple LLMs across different quantization formats, providing practical insights into their performance, trade-offs, and expected behavior in real-world penetration testing scenarios. Additionally, I outline recommended configuration parameters for LLM inference engines (llama.cpp) to help optimize results.

Key Takeaways
#

Best overall model: GLM-4.7-Flash-UD-Q8_K_XL

Best speed/performance ratio: Qwen3.5-35B-A3B-Q8_0

Best vulnerability detail quality: Qwen3.6-27B-UD-Q8_K_XL

Best small-sized model: Qwen3.5-9B-UD-Q8_K_XL (this model outperformed OpenAI GPT-OSS-120B in vulnerability identification accuracy)

Best MCP: Chrome DevTools MCP performed better than Burp MCP

Test Lab Infrastructure
#

The tests were conducted on a desktop system with the following specifications:

RAM: 48 GB

GPU: 3090 TI + 3090 (Total 48 GB Vram)

CPU: AMD Ryzen™ 9 5900X × 24

OS: Ubuntu 24.04.3 LTS

LLM Inference Engines: llama.cpp

Test Case
#

The main objective of this test case is to evaluate whether local LLMs can identify web application endpoints and perform SQL injection testing to uncover vulnerabilities using only their own reasoning and embedded knowledge. The goal is to compare different local models and determine which one is better suited for penetration testing tasks and demonstrates stronger autonomous pentesting capabilities.

To ensure a fair evaluation, I developed a custom web application with multiple SQL injection vulnerabilities. Pre-built vulnerable applications were intentionally avoided to eliminate the possibility that the models had prior knowledge of them, which could skew the results and not accurately reflect their true pentesting skills.

The prompt used in this evaluation is available here

The test web application contains five intentionally vulnerable endpoints affected by SQL injection vulnerabilities:

/login -> SQL injection allows authentication bypass

/users -> SQL injection in search parameter exposes user data

/profile/<user_id> -> SQL injection allows access to any user profile

/products -> SQL injection via category and sort parameters

/orders -> SQL injection allows viewing all orders

This test web application is intended for educational and research purposes only. It should be deployed exclusively within an isolated lab environment, as it is intentionally vulnerable and not designed for production use.

The test environment consisted of the following components:

providing LLM access to MCP tools to perform penetration tests like Chrome DevTools MCP and access to terminal to execute any command.
provide LLM access to MCP servers that allow it to store and retrieve findings.
No further information, web search or RAG access was provided and LLMs were working using their own knowledge.
The test script allowed the LLM to iteratively retry tasks in a loop, with access to a Chrome MCP server and terminal capabilities to execute any required commands.
The test terminated if the LLM exceeded 200 tool calls, reached a 150K context limit, or explicitly ended the session using the session-end tool.
The evaluation was divided into two separate phases:
1. Endpoint discovery phase
2. Vulnerability identification phase
The completion time, token usage, and number of interaction turns reported in this article were measured during the vulnerability identification phase only, after the application endpoints had already been identified successfully by the LLM in a separate session.
Each test session started with an initial 10,000-token context containing all required MCP configurations, tools, and system prompts.

Models Tested
#

The following GGUF models were included in this evaluation:

GLM-4.7-Flash-UD-Q8_K_XL
GLM-4.7-Flash-UD-Q5_K_XL
GLM_4.7_flash-Q4_K_M
Qwen3.5-35B-A3B-UD-Q8_K_XL
Qwen3.5-35B-A3B-UD-Q4_K_XL
Qwen3.5-27B-UD-Q8_K_XL
Qwen3.5-27B-UD-Q4_K_XL
Qwen3.5-9B-UD-Q8_K_XL
Qwen3.5-9B-UD-Q4_K_XL
Qwen3.6-35B-A3B-UD-Q8_K_XL
Qwen3.6-35B-A3B-UD-Q4_K_XL
Qwen3.6-27B-UD-Q8_K_XL
Qwen3.6-27B-UD-Q4_K_XL
Qwen3-Coder-Next-UD-Q5_K_XL
Qwen3-Coder-Next-UD-Q4_K_XL
gpt-oss-120b-F16
gemma-4-31B

All models were tested using Unsloth GGUF variants, which provide a reduced memory footprint while maintaining comparable — and in some cases improved — performance relative to the original releases.

Quantization Overview
#

Q4 and Q8 quantization formats were used throughout the evaluation to cover medium-sized and large-sized variants of each model.

Quantization is the process of reducing the precision of a model’s numbers (e.g., from 16-bit to 4-bit) to improve performance and lower memory usage, but may slightly reduce accuracy due to this loss of precision.

Precision refers to how detailed and exact those numbers are, higher precision means better accuracy but more resource usage.

In simple Analogy: Q4 is like watching YouTube in 480p, and Q8 is like watching YouTube in 1080p. Both work, but one provides clearer output quality.

Below is a simple comparison between Q4 and Q8.

Feature	Q4 (4-bit)	Q8 (8-bit)
Bits per value	4 bits	8 bits
Model size	small (~50% of Q8)	Larger (about 2× Q4)
VRAM usage	Low	Medium to high
Speed	Faster	Slightly slower
Accuracy	Lower (more approximation)	Higher (closer to original)

Benchmark Results
#

Model	Quantization	Inference speed (token/s)*	Time taken to finish (minutes)*	Tokens Used*	Number of turns*	Number of Vulnerabilities Identified (out of 5)*
GLM-4.7-Flash	UD-Q8_K_XL	80	2	26,051	40	5
Qwen3.5-35B-A3B	Q8_0	96	2	33307	51	5
Qwen3.5-9B	UD-Q8_K_XL	62	3	33,209	49	5
Qwen3.5-35B-A3B	UD-Q8_K_XL	83	4	54,862	42	5
Qwen3.6-35B-A3B	UD-Q8_K_XL	101	5	52098	130	5
Qwen3.6-35B-A3B	UD-Q4_K_XL	83	5	81817	89	5
Qwen3.5-27B	UD-Q8_K_XL	21	7	50720	72	5
Qwen3.6-27B	UD-Q8_K_XL	25	16	56740	74	5
Qwen3.5-35B-A3B	UD-Q4_K_XL	107	1	33307	51	4
Qwen3.5-9B	UD-Q4_K_XL	95	2	66,797	54	4
Qwen3.5-27B	UD-Q4_K_XL	35	5	41,995	65	4
Qwen3.5-122B-A10B	UD-Q4_K_XL	22	11	42097	41	4
GLM-4.7-flash	Q4_K_M	74	2	40,470	37	3
gpt-oss-120b	f16	31	4	22,344	20	3
Qwen3-Coder-Next	UD-Q5_K_XL	52	4	36421	66	3
gemma-4-31B	Q4_K_M	21	4	50752	40	3
GLM-4.7-Flash	UD-Q5_K_XL	69	5	54250	59	3
Qwen3.6-27B	UD-Q4_K_XL	29	24	71612	136	3

Inference speed: How fast the model generates output, measured in tokens per second.

Time taken to finish (minutes): Total time from task start to completion.

Tokens used: Total input + output tokens consumed across the entire task.

Number of turns: How many model request/response cycles required to complete the task.

Number of Vulnerabilities Identified (out of 5): How many of the 5 known vulnerabilities the agent successfully found.

Note: All benchmark results presented in this article represent the best-performing run out of five executions per model. The purpose of selecting the best run was to evaluate the maximum practical capability of each model under stable conditions, rather than measuring average consistency. Actual results may vary depending on prompt structure, llama.cpp build version, inference configuration, and tool-calling stability.

Top 5 Models
#

GLM-4.7-Flash-UD-Q8_K_XL: This model demonstrated the best instruction-following capabilities, consistently reaching the intended objectives with the lowest token usage while still delivering detailed and high-quality results. In addition, its strong coding capabilities significantly increased its overall value for technical and automation-focused tasks.

Qwen3.5-35B-A3B-Q8_0: This model achieved the best balance between speed, low token consumption, and overall performance, while still delivering reliable and high-quality results throughout the testing process.

Qwen3.6-27B-UD-Q8_K_XL: This model provided the most detailed vulnerability analysis and the clearest technical explanations. Although it was slower than other models, the quality of its vulnerability descriptions, evidence, and reasoning made it stand out for report-quality output.

Qwen3.5-9B-UD-Q8_K_XL: Despite being a relatively small 9B model, it successfully identified all vulnerabilities. While the depth of analysis and vulnerability descriptions were not comparable to larger models, it still managed to complete the objective successfully.

Qwen3.6-35B-A3B-UD-Q4_K_XL: This model demonstrated that lower-bit quantization can still deliver high-quality pentesting results. Despite using Q4 quantization, it successfully identified all vulnerable endpoints while maintaining strong inference speed, showing that reduced VRAM usage does not necessarily result in major capability loss for agentic penetration testing tasks.

Key Findings
#

After testing multiple builds of llama.cpp throughout this evaluation, I found that build b8175 (d903f30e2) is the most stable and produced a noticeable difference in performance when running Qwen3-Coder-Next and GLM4.7-Flash. Newer builds up to the latest (b8541, ded446b34) exhibit performance regressions, with inference results deviating from those produced by b8175. I tested identical llama.cpp parameters across multiple builds, and b8175 was the only version capable of reliably completing a full pentest session end-to-end.
llama.cpp version 8942 (f53577432) introduced a significant performance improvement in token generation (from 49 token/s to 80 token/s) for GLM-4.7-Flash-UD-Q8_K_XL. In addition, the model’s reasoning quality and inference behavior were noticeably improved.
I encountered tool-calling issues with Qwen3.5 and Qwen3.6 models until I identified the correct chat template.
I noticed that with Qwen3.5 models, if your prompt instructs the LLM to return output as JSON while tool calling is enabled, the model will start emitting tool calls as raw JSON text output rather than using the OpenAI tool calling syntax — causing your agent to misinterpret them as regular output instead of tool invocations, which breaks the execution flow.
The GLM-4.7-Flash-UD-Q8_K_XL model stood out as the best local LLM for instruction following. Despite its lower inference speed, the model demonstrated strong consistency across multiple tests including real-world usage with OpenClaw and the RooCode plugin in VS Code.
Qwen3-Coder-Next performed well on pentesting tasks and excelled at coding, but fell short on tool calling consistency. I observed occasional failures both during pentest sessions and when using it via the RooCode plugin in VS Code.
Qwen3.6-35B-A3B delivered the best inference speed without compromising output quality.
Using the Chrome DevTools MCP proved significantly more efficient than relying on the Burp Suite MCP or granting the agent terminal access to run curl.
The llama.cpp build version has a significant impact on LLM inference results.
Inference speed does not necessarily mean the LLM will complete the task faster. I observed that slower models finished before faster ones, largely due to a more methodical approach and fewer mistakes along the way.

LLM Configuration & Performance Results (llama.cpp)
#

GLM-4.7-Flash-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model GLM-4.7-Flash-UD-Q8_K_XL.gguf –alias
“unsloth/GLM_4.7_flash” –temp 0.6 –top-p 1.0 –min-p 0 –repeat-penalty
1.0 –top-k 20 –port 8021 –ctx-size 100000 –host 0.0.0.0 –fit on

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Followed instructions and presented data in professional intended way.
Vulnerabilities were described in a detailed and well-organized manner.
Best results with llama.cpp version: 8175 (d903f30e2).

Qwen3.5-27B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-27B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.5-27B-UD-Q4_K_XL.gguf” –temp 0.6 –top-p 1.0 –min-p 0
–repeat-penalty 1.0 –top-k 20 –port 8021 –ctx-size 100000 –host 0.0.0.0
–fit on –flash-attn on

Notes:

Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8667 (c08d28d08).

Qwen3.5-35B-A3B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-35B-A3B-Q8_0.gguf –alias
“unsloth/Qwen3.5-35b” –host 0.0.0.0 –port 8021 –temp 0.6 –top-p 1.0
–min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit on –jinja
–chat-template-file ../llama_tests/template.txt –flash-attn on
–chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8667 (c08d28d08).

Qwen3.5-35B-A3B-Q8_0.gguf
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-35B-A3B-Q8_0.gguf –alias
“unsloth/Qwen3.5-35b-Q8_0” –host 0.0.0.0 –port 8080 –temp 0.6 –top-p 1.0
–min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit on –jinja
–chat-template-file ../llama_tests/template.txt –flash-attn on
–chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Best results with llama.cpp version: 8809 (b1be68e8c).

Qwen3-Coder-Next-UD-Q5_K_XL
#

Llama parameters:

 ./llama.cpp/llama-server     –model
Qwen3-Coder-Next-UD-Q5_K_XL-00001-of-00003.gguf     –alias
“unsloth/Qwen3-Coder-Next-UD-Q5_K_XL”    –temp 0.6 –top-p 1.0 –min-p 0 
–repeat-penalty 1.0 –top-k 20      –port 8021     –ctx-size 100000 –host
0.0.0.0 –fit on

Notes:

Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8175 (d903f30e2).

Qwen3.5-122B-A10B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server     –model
Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf    –alias
“unsloth/Qwen3.5-122B-A10B-UD-Q4_K_XL”       –temp 0.6 –top-p 1.0 –min-p
0  –repeat-penalty 1.0 –top-k 20 –port 8021     –ctx-size 100000 –host
0.0.0.0 –fit on –flash-attn on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”

Notes:

Best results with llama.cpp version: 8809 (b1be68e8c).
Identified all endpoints.
Identified 4/5 vulnerable endpoints.

GLM-4.7-Flash-UD-Q5_K_XL
#

Llama parameters:

./llama.cpp/llama-server     –model GLM-4.7-Flash-UD-Q5_K_XL.gguf   
–alias “unsloth/GLM_4.7_flash-UD-Q5_K_XL”       –temp 0.6 –top-p 1.0 –min-p
0  –repeat-penalty 1.0 –top-k 20 –port 8021     –ctx-size 100000 –host
0.0.0.0 –fit on

Notes:

Identified all endpoints but also added an endpoint that does not exist with not found status which was not asked for.
Identified 3/5 vulnerable endpoints.
Best results with llama.cpp version: 8175 (d903f30e2).

Qwen3.5-35B-A3B-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf –alias
“unsloth/Qwen3.5-35b-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –ctx-size
100000 –fit on –split-mode layer –cache-type-k q8_0 –cache-type-v q8_0
–flash-attn on –parallel 1 –temp 0.6 –top-p 0.95 –min-p 0.00 –top-k 20
–kv-offload –flash-attn on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints.
If I run it without this chat template I get issue with tool calling.

GLM_4.7_flash-Q4_K_M
#

Llama parameters:

./llama.cpp/llama-server –model GLM-4.7-Flash-Q4_K_M.gguf –alias
“unsloth/GLM_4.7_flash-Q4_K_M” –temp 0.7 –top-p 1.0 –min-p 0.01 –port
8021 –ctx-size 100000 –host 0.0.0.0 –fit on

Notes:

Identified all endpoints.
Identified 3/5 vulnerable endpoints.
Best llama.cpp version tested with most detections: 8175 (d903f30e2).

gpt-oss-120b-f16
#

Llama parameters:

./llama.cpp/llama-server     –model gpt-oss-120b-F16.gguf  –alias
“gpt-oss-120b-F16”         –temp 0.6     –top-p 1.0     –min-p 0.0   
–top-k 0.0 -ngl -1    –port 8021     –ctx-size 100000 –host 0.0.0.0 -fit
on

Notes:

Identified all endpoints but also included duplicate endpoints.
Identified 3/5 vulnerable endpoints.

Qwen3.5-9B-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server     –model Qwen3.5-9B-UD-Q8_K_XL.gguf    
–alias “unsloth/Qwen3.5-9B-UD-Q8_K_XL”  –seed 3407   –temp 0.6 –top-p
0.95 –min-p 0.01  –repeat-penalty 1.0 –top-k 40      –port 8021    
–ctx-size 100000 –host 0.0.0.0 –fit on –jinja –chat-template-file
../llama_tests/template.txt –flash-attn on –chat-template-kwargs
“{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Did not follow the exact instructions but managed to provide a good result.

Qwen3.6-35B-A3B-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf –alias
“unsloth/Qwen3.6-35b-A3B-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on  –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints.
Detailed output summary.
If I run it without this chat template I get issue with tool calling.
Did not consistently follow instructions to only use Chrome DevTools MCP and ended using curl which consumed more time.

Qwen3.6-35B-A3B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.6-35B-A3B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on  –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints along with an out-of-scope vulnerability.
If I run it without this chat template I get issue with tool calling.

Qwen3.5-27B-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-27B-UD-Q8_K_XL.gguf–alias
“unsloth/Qwen3.5-27B-UD-Q8_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on  –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Identified all endpoints.
Identified 5/5 vulnerable endpoints with detailed info.
Detailed output summary.
If you run it without this chat template you will face issue with tool calling.

Qwen3.6-27B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.6-27B-UD-Q4_K_XL.gguf–alias
“unsloth/Qwen3.6-27B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on  –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

This model performed the weakest among the tested models.
Identified 3/5 vulnerable endpoints.
Required the longest time to identify vulnerable endpoints.

Qwen3.5-9B-UD-Q4_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model Qwen3.5-9B-UD-Q4_K_XL.gguf –alias
“unsloth/Qwen3.5-9B-UD-Q4_K_XL” –host 0.0.0.0 –port 8021 –temp 0.6
–top-p 1.0 –min-p 0 –repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit
on  –jinja –chat-template-file ../llama_tests/template.txt –flash-attn
on –chat-template-kwargs “{\\enable_thinking\\: false}”

Notes:

Did not identify all endpoints.
Identified 4/5 vulnerable endpoints with detailed info.
Didn’t follow instructions.
Good results compared to its size.
If I run it without this chat template I get issue with tool calling.

gemma-4-31B-it-UD-Q8_K_XL
#

Llama parameters:

./llama.cpp/llama-server –model gemma-4-31B-it-UD-Q8_K_XL.gguf –alias
“unsloth/gemma4” –host 0.0.0.0 –port 8021 –temp 0.6 –top-p 1.0 –min-p 0
–repeat-penalty 1.0 –top-k 20 –ctx-size 100000 –fit on

Notes:

Every test run caused llama.cpp to crash. I saw issues in llama.cpp repo related to this bug.
Loading the model with LM Studio and using built-in chat to use Сhrome MCP also resulted in a crash.
The issue may be with the gemma4 unsloth version as I used Ollama to load the original model from Google and it worked without issues.

gemma-4:31B - Ollama
#

Notes:

I used Ollama to run this model.
Identified all endpoints.
Identified 3/5 vulnerable endpoints.

Limitations
#

Only SQL injection was tested.
Single application architecture.
No authentication complexity.
No RAG/web search.
No memory persistence.
Limited MCP ecosystem.
Results depend heavily on prompts/templates.
llama.cpp build instability affects reproducibility.

Conclusion
#

The results of this evaluation demonstrate that modern local LLMs are becoming increasingly capable of performing autonomous penetration testing tasks when combined with MCP tooling and properly configured inference environments. While cloud-hosted models still provide stronger overall capabilities, several local models demonstrated highly competitive performance in endpoint discovery, vulnerability identification, reasoning quality, and tool-calling workflows.

The evaluation also highlighted that inference speed alone is not a reliable indicator of real-world agent performance. Factors such as instruction following, reasoning quality, tool-calling reliability, prompt structure, and llama.cpp build stability had a significantly greater impact on overall task completion and result quality.

Key Takeaways#

Test Lab Infrastructure#

Test Case#

Models Tested#

Quantization Overview#

Benchmark Results#

Top 5 Models#

Key Findings#

LLM Configuration & Performance Results (llama.cpp)#

GLM-4.7-Flash-UD-Q8_K_XL#

Qwen3.5-27B-UD-Q4_K_XL#

Qwen3.5-35B-A3B-UD-Q4_K_XL#

Qwen3.5-35B-A3B-Q8_0.gguf#

Qwen3-Coder-Next-UD-Q5_K_XL#

Qwen3.5-122B-A10B-UD-Q4_K_XL#

GLM-4.7-Flash-UD-Q5_K_XL#

Qwen3.5-35B-A3B-UD-Q8_K_XL#

GLM_4.7_flash-Q4_K_M#

gpt-oss-120b-f16#

Qwen3.5-9B-UD-Q8_K_XL#

Qwen3.6-35B-A3B-UD-Q8_K_XL#

Qwen3.6-35B-A3B-UD-Q4_K_XL#

Qwen3.5-27B-UD-Q8_K_XL#

Qwen3.6-27B-UD-Q4_K_XL#

Qwen3.5-9B-UD-Q4_K_XL#

gemma-4-31B-it-UD-Q8_K_XL#

gemma-4:31B - Ollama#

Limitations#

Conclusion#

Key Takeaways
#

Test Lab Infrastructure
#

Test Case
#

Models Tested
#

Quantization Overview
#

Benchmark Results
#

Top 5 Models
#

Key Findings
#

LLM Configuration & Performance Results (llama.cpp)
#

GLM-4.7-Flash-UD-Q8_K_XL
#

Qwen3.5-27B-UD-Q4_K_XL
#

Qwen3.5-35B-A3B-UD-Q4_K_XL
#

Qwen3.5-35B-A3B-Q8_0.gguf
#

Qwen3-Coder-Next-UD-Q5_K_XL
#

Qwen3.5-122B-A10B-UD-Q4_K_XL
#

GLM-4.7-Flash-UD-Q5_K_XL
#

Qwen3.5-35B-A3B-UD-Q8_K_XL
#

GLM_4.7_flash-Q4_K_M
#

gpt-oss-120b-f16
#

Qwen3.5-9B-UD-Q8_K_XL
#

Qwen3.6-35B-A3B-UD-Q8_K_XL
#

Qwen3.6-35B-A3B-UD-Q4_K_XL
#

Qwen3.5-27B-UD-Q8_K_XL
#

Qwen3.6-27B-UD-Q4_K_XL
#

Qwen3.5-9B-UD-Q4_K_XL
#

gemma-4-31B-it-UD-Q8_K_XL
#

gemma-4:31B - Ollama
#

Limitations
#

Conclusion
#