Skip to content

Context length & performance

Context length and inference speed are the two knobs that matter most for a good local Hermes experience.

Why 64k context is required

Hermes agentic sessions include:

  • System prompt and SOUL.md
  • Tool definitions (60+ tools = lots of tokens)
  • Conversation history
  • Tool call results (file contents, command output)
  • Memory injections

All of this adds up fast. With Ollama's default 2048 tokens, Hermes runs out of room after 1-2 tool calls.

Minimum: 64,000 tokens. I use 65,536.

Setting context in both places

Context must match in Ollama and Hermes:

yaml
# ~/.hermes/config.yaml
model:
  context_length: 65536
bash
# Ollama
OLLAMA_CONTEXT_LENGTH=65536 ollama serve
# or Modelfile: PARAMETER num_ctx 65536

Mismatch symptom

Hermes thinks it has 128k context but Ollama only serves 2048. Mid-task, the agent "forgets" earlier tool results and loops or hallucinates. Always verify both sides.

Performance tuning

Apple Silicon (M1/M2/M3/M4)

  • Qwen 3.5 27B Q4 runs well on 24 GB unified memory.
  • Close memory-heavy apps before long agent sessions.
  • ollama ps shows GPU layer count; all layers on GPU = fastest.

NVIDIA GPU

  • Ollama auto-offloads layers. More VRAM = larger models.
  • For persistent serving: OLLAMA_KEEP_ALIVE=-1.

CPU only

  • Expect 2-5 tokens/sec on 27B (CPU). Apple Silicon GPU is much faster.
  • Use a 9B model for interactive work.
  • Hermes relaxes timeouts for local endpoints automatically; only set export HERMES_STREAM_READ_TIMEOUT=1800 if you still hit timeouts.

Reducing context pressure

If you hit context limits even at 64k:

  • Start fresh sessions for unrelated tasks.
  • Disable unused toolsets in config.
  • Ask Hermes to summarize before continuing long threads.
  • Use smaller models for simple tasks (less overhead per token).

Benchmark your setup

Run this and note the time:

text
Read every .py file in ~/Projects/FDE/hermes-course, count total lines,
and list the 3 largest files by line count.
TimeVerdict
Under 30sGreat setup
30-90sAcceptable for offline
Over 2 minConsider smaller model or GPU upgrade

Next: Configuration overview.

Personal learning notes on Hermes Agent. Not affiliated with Nous Research. Verify against official docs.