GLM-4.7-Flash on Nvidia GB10

So I have gotten into running my own local llm for privacy reasons, and like to use it to assist with incident response tasks and collecting OSINT, so I want to keep everything local, and only share my searches with the “public” and for this I have a searxng proxy setup, to help anonymize the web traffic (but this is all separate).

I ended up using ollama, simply because this was not working when trying to use vLLM, I’m hoping in a few weeks after some updates, I will be able to swap out ollama for vLLM. This post is hopefully will help whoever finds it or a bot / search engine. The rest of the post was written by them – I don’t claim this is mine, just want to share what eventually worked for me to help others.

If you’ve just taken delivery of an NVIDIA DGX Spark and want to run the latest GLM-4.7-Flash model at full F16 precision using Ollama, this post covers everything we had to figure out the hard way. Hardware: GB10, compute capability sm_121a, CUDA 13.1, 119.6 GiB unified memory, aarch64 Ubuntu 24.04.

What is GLM-4.7-Flash and Why Does It Matter?

GLM-4.7-Flash is a 47-billion parameter Mixture-of-Experts model from Zhipu AI’s zai-org. Despite the “Flash” name suggesting a smaller model, the version number 4.7 is a version identifier — this is a full-weight, production-grade MoE model with:

  • 47B total parameters (MoE architecture)
  • 131,072 token context window (131K)
  • Native tool calling support
  • Released under MIT licence
  • BF16 weights at approximately 59GB on disk (48 safetensor shards)

Running it at full F16 precision on a single machine requires roughly 115GB of GPU memory. The DGX Spark’s 119.6 GiB unified memory pool is one of very few consumer/prosumer platforms that can actually do this without quantisation degradation.

The difference matters for complex reasoning tasks. Q4_K_M quantisation compresses each weight from 16 bits to ~4 bits — a 75% reduction. For simple queries this is largely invisible. For multi-step reasoning chains, technical precision (CVE numbers, exact API signatures, structured outputs), and reliable tool call formatting, full precision is noticeably better.


The Hardware: NVIDIA DGX Spark (GB10)

Quick specs relevant to this guide:

  • GPU: NVIDIA GB10 — compute capability 12.1 (sm_121a)
  • Memory: 119.6 GiB unified CPU/GPU memory
  • CUDA: 13.1
  • Architecture: aarch64 (ARM64)
  • OS: Ubuntu 24.04

The GB10 shipped in early 2026 and at the time of writing the open source ecosystem (vLLM, SGLang) hadn’t fully caught up to sm_121a. This guide uses Ollama, which works reliably on the GB10 today.


Why Not vLLM or SGLang?

We spent considerable time trying both before landing on Ollama. Here’s why they didn’t work at the time:

vLLM 0.14.0

GLM-4.7 uses a glm4_moe_lite model type with MLA (Multi-head Latent Attention). vLLM 0.14.0 doesn’t recognise this architecture — it fails at model load with a weight shape mismatch. This isn’t fixable with config patches; the model architecture support simply isn’t there yet.

SGLang (stock Docker)

SGLang supports GLM-4.7 architecture and has a --tool-call-parser glm47 flag. However, the stock lmsysorg/sglang:latest Docker image bundles a version of Triton whose ptxas compiler doesn’t recognise sm_121a. The GB10 is new enough that the GPU architecture wasn’t known to PyTorch’s published builds at the time of writing.

We got further using scitrera/dgx-spark-vllm:0.14.0rc2-t4 as a base image (which includes NVIDIA’s custom PyTorch 2.10.0-rc6 and Triton 3.5.1 with sm_121a support), but ultimately hit a wall with sgl_kernel: the PyPI binary is compiled for sm100 (H100), not sm121. Building from source on the live machine was too unstable — the CUDA kernel compilation during docker build takes 20-30 minutes on aarch64, pegs available memory, and caused SSH sessions to drop mid-build.

Bottom line: Ollama works today. SGLang on the GB10 will likely be straightforward once the ecosystem catches up — watch this space.


The Working Solution: Ollama with a Custom F16 Modelfile

Step 1: Install Ollama

If you haven’t already:

curl -fsSL https://ollama.com/install.sh | sh

Confirm it’s running:

ollama list

Step 2: Set Ollama Environment Variables

Edit the Ollama systemd service to configure parallel inference and keep models loaded permanently:

sudo systemctl edit ollama

Add the following in the override file:

[Service]
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_REQUEST_TIMEOUT=600"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_NUM_PARALLEL=4 allows up to 4 concurrent requests against the same model. OLLAMA_KEEP_ALIVE=-1 keeps the model permanently loaded in memory (no reload penalty between requests). OLLAMA_REQUEST_TIMEOUT=600 is important — complex reasoning tasks on large context windows can take several minutes and the default timeout is too short.

Step 3: Download the BF16 Weights from HuggingFace

Ollama’s model registry doesn’t carry the full BF16 version of GLM-4.7-Flash. We’ll pull the weights directly from HuggingFace and import them into Ollama.

Install the HuggingFace CLI if you don’t have it:

pip install huggingface-hub --break-system-packages

Download the model (approximately 59GB, 48 safetensor shards):

huggingface-cli download zai-org/GLM-4.7-Flash \
  --local-dir ~/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/main \
  --local-dir-use-symlinks False

The --local-dir-use-symlinks False flag is critical. Ollama’s security policy refuses to follow symlinks when importing a model. Without this flag, the HF CLI creates a symlinked structure that Ollama will reject with an “insecure path” error.

Once downloaded, note the actual snapshot path:

ls ~/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/

You’ll see a directory named something like 7dd20894a642a0aa287e9827cb1a1f7f91386b67. Use that exact hash in the next step.

Step 4: Create the Modelfile

Replace <SNAPSHOT_HASH> with the hash from the previous step:

cat > ~/Modelfile-glm-f16 << 'EOF'
FROM /home/<YOUR_USER>/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/<SNAPSHOT_HASH>
PARAMETER num_gpu 999
PARAMETER num_ctx 131072
PARAMETER temperature 0.3
PARAMETER num_predict 16384
PARAMETER stop "<|user|>"
PARAMETER stop "<|observation|>"
PARAMETER stop "<|endoftext|>"
RENDERER glm-4.7
PARSER glm-4.7

TEMPLATE """[gMASK]<sop>{{ if .System }}<|system|>
{{ .System }}{{ end }}{{ range .Messages }}{{ if eq .Role "user" }}<|user|>
{{ .Content }}{{ else if eq .Role "assistant" }}<|assistant|>
{{ .Content }}{{ else if eq .Role "tool" }}<|observation|>
{{ .Content }}{{ end }}{{ end }}<|assistant|>
"""

SYSTEM You are a helpful AI assistant.
EOF

A few important notes on this Modelfile:

  • RENDERER glm-4.7 and PARSER glm-4.7 — these two lines are essential and easy to miss. They tell Ollama that this model supports GLM-4.7 tool calling format. Without them, Ollama will refuse tool call requests with a “does not support tools” 400 error, even though the model weights absolutely do support it.
  • num_ctx 131072 — sets the full 131K context window. The model supports this natively.
  • num_predict 16384 — maximum output tokens per response. Increase if you need longer generations.
  • The TEMPLATE — GLM-4.7 uses its own chat template format with [gMASK]<sop> tokens. This must be specified explicitly.
  • SYSTEM — replace the placeholder with your actual system prompt.

Step 5: Build the Ollama Model

This step converts the 48 BF16 safetensor shards into GGUF format and imports them into Ollama’s model store. It will take 10-15 minutes and the output model will occupy approximately 115GB in Ollama’s storage.

Run it in a tmux session so it survives any SSH disconnection:

tmux new-session -s f16build
ollama create glm-forensics-f16 -f ~/Modelfile-glm-f16

Detach safely with Ctrl+B then D. Do not press Ctrl+C — this will cancel the build.

To check progress:

tmux attach -t f16build

When complete you’ll see output like:

success
Model 'glm-forensics-f16' created successfully

Confirm the model is listed:

ollama list

You should see glm-forensics-f16:latest at approximately 59GB on disk (115GB when loaded into unified memory).

Step 6: Test the Model

Quick sanity check via curl:

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-forensics-f16",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | python3 -m json.tool

Test tool calling is working (the RENDERER/PARSER lines are what make this work):

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-forensics-f16",
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}}}],
    "messages": [{"role": "user", "content": "What is the weather in London?"}]
  }' | python3 -m json.tool

You should see a finish_reason: "tool_calls" in the response with a properly structured tool call object.


Checking Memory Usage

When GLM-4.7-Flash F16 is loaded, it occupies approximately 115GB of the 119.6GB unified memory pool. Confirm what’s loaded:

ollama ps

Example output:

NAME                     ID            SIZE      PROCESSOR    UNTIL
glm-forensics-f16:latest abc123def456  115 GiB   100% GPU     Forever

With OLLAMA_KEEP_ALIVE=-1 the model stays loaded permanently. This is the right setting for a dedicated inference machine — there’s no cold start penalty when the model is needed.


Context Window and Memory Considerations

The 131K context window is one of the most valuable features of this setup. However, there’s an important trade-off to understand:

KV cache (key-value cache, used to store conversation history for attention) grows linearly with context length. At 131K tokens, the KV cache for a single conversation can consume 10-30GB depending on the model’s configuration. Since the model itself uses ~115GB of the 119.6GB available, there is limited headroom.

In practice, Ollama handles this gracefully — it will start offloading KV cache to system RAM if GPU memory runs short, which slows generation speed. For most use cases, keeping individual conversations under 50-60K tokens keeps everything comfortably in unified memory.

If you’re running shorter conversations but want true maximum throughput, you can set a lower context in the Modelfile:

PARAMETER num_ctx 65536

This halves the maximum context to 65K tokens but leaves significantly more headroom in the memory pool.


Common Issues and Fixes

400 Error: “model does not support tools”

This means the RENDERER glm-4.7 and PARSER glm-4.7 lines are missing from your Modelfile. Recreate the model with those lines added. The model name you’re calling against also needs to be the one you created with those lines — not the base glm-4.7-flash:latest pulled from Ollama’s registry.

“insecure path” error during ollama create

You downloaded the HuggingFace weights with symlinks. The HF CLI creates a blob cache with symlinked snapshot directories, which Ollama refuses to import for security reasons. Re-download with --local-dir-use-symlinks False.

Ollama connection drops / timeout during long generations

Two causes:

  1. Add Environment="OLLAMA_REQUEST_TIMEOUT=600" to the Ollama systemd service as described in Step 2.
  2. Your client-side timeout may be shorter. Ensure whatever is calling Ollama (your application, curl, etc.) has a timeout of at least 600 seconds for complex tasks.

ollama create fails immediately with “pull model manifest: file does not exist”

Your FROM path is wrong. Check the exact snapshot hash:

ls ~/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/

Copy that hash exactly into the Modelfile path.

Model is slow / generating at low tokens/sec

The DGX Spark’s unified memory architecture has lower memory bandwidth (~273 GB/s) compared to discrete GPU VRAM. GLM-4.7-Flash F16 generates at roughly 10-25 tokens/second depending on context length — noticeably slower than a quantised model on a discrete GPU with high bandwidth memory, but producing higher quality output. This is the trade-off of running full precision on this hardware.


Comparing Q4 vs F16 in Practice

We ran both versions side by side for several weeks. The honest summary:

  • Simple Q&A, coding, summarisation: Very little difference. Q4 is faster and good enough.
  • Multi-step reasoning chains: F16 maintains coherence better over long chains. Q4 occasionally loses track of earlier context.
  • Technical precision (specific CVEs, API details, structured outputs): F16 hallucinates less. The difference is noticeable when accuracy matters.
  • Tool call reliability: F16 makes fewer formatting errors and chooses tools more consistently.
  • Long document analysis (>30K tokens): F16 draws better inferences from long contexts.

For any workload where reasoning quality matters more than raw speed, F16 is the right choice if your hardware can support it. The DGX Spark is one of the very few platforms where this is even possible with GLM-4.7-Flash.


What’s Next

A few things worth watching:

  • SGLang support for sm_121a: When this lands, it will unlock significantly higher throughput via continuous batching and PagedAttention. The Dockerfile is essentially working (see the SGLang issues above) — it just needs a stable build environment.
  • GLM-4.5 and GLM-5: Zhipu AI has released larger models in the GLM family. GLM-4.5 Air (106B total / 12B active MoE) would require two GB10 boxes at F16. GLM-5 (744B total / 40B active) is API-only for most hardware configurations.
  • Two-box setups: A second DGX Spark connected via NVLink or high-speed networking opens up the larger models. GLM-4.5 Air at F16 requires approximately 212GB — right in the sweet spot for two GB10s combined.

Summary: The Minimal Steps

  1. Install Ollama and configure OLLAMA_NUM_PARALLEL=4OLLAMA_KEEP_ALIVE=-1OLLAMA_REQUEST_TIMEOUT=600 in systemd
  2. Download BF16 weights: huggingface-cli download zai-org/GLM-4.7-Flash --local-dir-use-symlinks False
  3. Create a Modelfile with the correct FROM path, RENDERER glm-4.7PARSER glm-4.7, GLM chat template, and stop tokens
  4. Run ollama create glm-forensics-f16 -f ~/Modelfile-glm-f16 in tmux
  5. Test with a tool call to confirm the RENDERER/PARSER lines are working

The step that trips most people up is the RENDERER/PARSER omission — models load and respond fine without those lines, but silently fail on any tool call request. Don’t skip them.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.