Appearance
Why run locally
Running Hermes with a local model via Ollama is the setup this handbook is built around. Here is why I chose it.
Benefits
| Benefit | What it means in practice |
|---|---|
| Privacy | Conversations, memory, and files never leave your machine |
| Zero API cost | No per-token billing after hardware is paid for |
| Offline capable | Works without internet (except web tools you choose to enable) |
| Full control | Pick the model, context size, and what tools are allowed |
| No vendor lock-in | Swap models anytime; Hermes config stays the same |
Trade-offs (be honest)
| Trade-off | Reality |
|---|---|
| Speed | Local Qwen 3.5 27B on CPU: 2-5 tok/s. Cloud APIs are 10-50x faster |
| Quality ceiling | Smaller local models miss nuance that frontier cloud models catch |
| Hardware limits | Your RAM/VRAM caps model size and context length |
| Setup effort | Context length, tool support, and timeouts need tuning |
| No built-in web by default | Offline = no live search unless you add it back |
When local makes sense
- Sensitive documents, code, or personal data you won't send to APIs.
- Experimentation and learning without burning credits.
- Always-on agent on a home server or laptop.
- Air-gapped or restricted networks.
When to add a cloud fallback
Keep local as default, but configure a fallback for:
- Complex reasoning that local models struggle with.
- Tasks needing frontier-quality writing or analysis.
- Long multi-step jobs where speed matters.
See Complete offline setup for hybrid config.
Alternatives to Ollama
Hermes works with any OpenAI-compatible server:
| Server | Best for |
|---|---|
| Ollama | Easiest local setup, macOS/Linux |
| vLLM | Production GPU serving, high throughput |
| llama.cpp server | Minimal resource usage, CPU inference |
| SGLang | Multi-GPU, high performance |
| LocalAI | Docker-based, multiple backends |
All use the same Hermes config pattern: provider: custom + base_url.
Next: Ollama setup in detail.