Why run locally

Running Hermes with a local model via Ollama is the setup this handbook is built around. Here is why I chose it.

Benefits

Benefit	What it means in practice
Privacy	Conversations, memory, and files never leave your machine
Zero API cost	No per-token billing after hardware is paid for
Offline capable	Works without internet (except web tools you choose to enable)
Full control	Pick the model, context size, and what tools are allowed
No vendor lock-in	Swap models anytime; Hermes config stays the same

Trade-off	Reality
Speed	Local Qwen 3.5 27B on CPU: 2-5 tok/s. Cloud APIs are 10-50x faster
Quality ceiling	Smaller local models miss nuance that frontier cloud models catch
Hardware limits	Your RAM/VRAM caps model size and context length
Setup effort	Context length, tool support, and timeouts need tuning
No built-in web by default	Offline = no live search unless you add it back

Keep local as default, but configure a fallback for:

See Complete offline setup for hybrid config.

Hermes works with any OpenAI-compatible server:

Server	Best for
Ollama	Easiest local setup, macOS/Linux
vLLM	Production GPU serving, high throughput
llama.cpp server	Minimal resource usage, CPU inference
SGLang	Multi-GPU, high performance
LocalAI	Docker-based, multiple backends

All use the same Hermes config pattern: provider: custom + base_url.

Next: Ollama setup in detail.