Skip to content

Why run locally

Running Hermes with a local model via Ollama is the setup this handbook is built around. Here is why I chose it.

Benefits

BenefitWhat it means in practice
PrivacyConversations, memory, and files never leave your machine
Zero API costNo per-token billing after hardware is paid for
Offline capableWorks without internet (except web tools you choose to enable)
Full controlPick the model, context size, and what tools are allowed
No vendor lock-inSwap models anytime; Hermes config stays the same

Trade-offs (be honest)

Trade-offReality
SpeedLocal Qwen 3.5 27B on CPU: 2-5 tok/s. Cloud APIs are 10-50x faster
Quality ceilingSmaller local models miss nuance that frontier cloud models catch
Hardware limitsYour RAM/VRAM caps model size and context length
Setup effortContext length, tool support, and timeouts need tuning
No built-in web by defaultOffline = no live search unless you add it back

When local makes sense

  • Sensitive documents, code, or personal data you won't send to APIs.
  • Experimentation and learning without burning credits.
  • Always-on agent on a home server or laptop.
  • Air-gapped or restricted networks.

When to add a cloud fallback

Keep local as default, but configure a fallback for:

  • Complex reasoning that local models struggle with.
  • Tasks needing frontier-quality writing or analysis.
  • Long multi-step jobs where speed matters.

See Complete offline setup for hybrid config.

Alternatives to Ollama

Hermes works with any OpenAI-compatible server:

ServerBest for
OllamaEasiest local setup, macOS/Linux
vLLMProduction GPU serving, high throughput
llama.cpp serverMinimal resource usage, CPU inference
SGLangMulti-GPU, high performance
LocalAIDocker-based, multiple backends

All use the same Hermes config pattern: provider: custom + base_url.

Next: Ollama setup in detail.

Personal learning notes on Hermes Agent. Not affiliated with Nous Research. Verify against official docs.