Appearance
Context length & performance
Context length and inference speed are the two knobs that matter most for a good local Hermes experience.
Why 64k context is required
Hermes agentic sessions include:
- System prompt and SOUL.md
- Tool definitions (60+ tools = lots of tokens)
- Conversation history
- Tool call results (file contents, command output)
- Memory injections
All of this adds up fast. With Ollama's default 2048 tokens, Hermes runs out of room after 1-2 tool calls.
Minimum: 64,000 tokens. I use 65,536.
Setting context in both places
Context must match in Ollama and Hermes:
yaml
# ~/.hermes/config.yaml
model:
context_length: 65536bash
# Ollama
OLLAMA_CONTEXT_LENGTH=65536 ollama serve
# or Modelfile: PARAMETER num_ctx 65536Mismatch symptom
Hermes thinks it has 128k context but Ollama only serves 2048. Mid-task, the agent "forgets" earlier tool results and loops or hallucinates. Always verify both sides.
Performance tuning
Apple Silicon (M1/M2/M3/M4)
- Qwen 3.5 27B Q4 runs well on 24 GB unified memory.
- Close memory-heavy apps before long agent sessions.
ollama psshows GPU layer count; all layers on GPU = fastest.
NVIDIA GPU
- Ollama auto-offloads layers. More VRAM = larger models.
- For persistent serving:
OLLAMA_KEEP_ALIVE=-1.
CPU only
- Expect 2-5 tokens/sec on 27B (CPU). Apple Silicon GPU is much faster.
- Use a 9B model for interactive work.
- Hermes relaxes timeouts for local endpoints automatically; only set
export HERMES_STREAM_READ_TIMEOUT=1800if you still hit timeouts.
Reducing context pressure
If you hit context limits even at 64k:
- Start fresh sessions for unrelated tasks.
- Disable unused toolsets in config.
- Ask Hermes to summarize before continuing long threads.
- Use smaller models for simple tasks (less overhead per token).
Benchmark your setup
Run this and note the time:
text
Read every .py file in ~/Projects/FDE/hermes-course, count total lines,
and list the 3 largest files by line count.| Time | Verdict |
|---|---|
| Under 30s | Great setup |
| 30-90s | Acceptable for offline |
| Over 2 min | Consider smaller model or GPU upgrade |
Next: Configuration overview.