Context length & performance

Context length and inference speed are the two knobs that matter most for a good local Hermes experience.

Why 64k context is required

Hermes agentic sessions include:

System prompt and SOUL.md
Tool definitions (60+ tools = lots of tokens)
Conversation history
Tool call results (file contents, command output)
Memory injections

All of this adds up fast. With Ollama's default 2048 tokens, Hermes runs out of room after 1-2 tool calls.

Minimum: 64,000 tokens. I use 65,536.

Setting context in both places

Context must match in Ollama and Hermes:

yaml

# ~/.hermes/config.yaml
model:
  context_length: 65536

bash

# Ollama
OLLAMA_CONTEXT_LENGTH=65536 ollama serve
# or Modelfile: PARAMETER num_ctx 65536

Mismatch symptom

Hermes thinks it has 128k context but Ollama only serves 2048. Mid-task, the agent "forgets" earlier tool results and loops or hallucinates. Always verify both sides.

Performance tuning

Apple Silicon (M1/M2/M3/M4)

Qwen 3.5 27B Q4 runs well on 24 GB unified memory.
Close memory-heavy apps before long agent sessions.
ollama ps shows GPU layer count; all layers on GPU = fastest.

NVIDIA GPU

Ollama auto-offloads layers. More VRAM = larger models.
For persistent serving: OLLAMA_KEEP_ALIVE=-1.

CPU only

Expect 2-5 tokens/sec on 27B (CPU). Apple Silicon GPU is much faster.
Use a 9B model for interactive work.
Hermes relaxes timeouts for local endpoints automatically; only set export HERMES_STREAM_READ_TIMEOUT=1800 if you still hit timeouts.

Reducing context pressure

If you hit context limits even at 64k:

Start fresh sessions for unrelated tasks.
Disable unused toolsets in config.
Ask Hermes to summarize before continuing long threads.
Use smaller models for simple tasks (less overhead per token).

Benchmark your setup

Run this and note the time:

text

Read every .py file in ~/Projects/FDE/hermes-course, count total lines,
and list the 3 largest files by line count.

Time	Verdict
Under 30s	Great setup
30-90s	Acceptable for offline
Over 2 min	Consider smaller model or GPU upgrade

Next: Configuration overview.

Context length & performance ​

Why 64k context is required ​

Setting context in both places ​

Performance tuning ​

Apple Silicon (M1/M2/M3/M4) ​

NVIDIA GPU ​

CPU only ​

Reducing context pressure ​

Benchmark your setup ​