Appearance
Complete offline setup
This is the core of my setup: Hermes Agent running entirely on local hardware with Ollama and Qwen 3.5. No API keys, no subscriptions, no data leaving my machine.
Architecture
text
You (CLI / Telegram)
│
▼
Hermes Agent ←── ~/.hermes/ (config, memory, skills)
│
▼
Ollama server ←── http://localhost:11434/v1
│
▼
Qwen 3.5 (on CPU or GPU)Step 1: Install and start Ollama
bash
# Install (macOS)
brew install ollama
# Start server
ollama servePull Qwen 3.5 (native tool calling, vision, 256K context):
bash
ollama pull qwen3.5:27bOr use the official one-liner:
bash
ollama launch hermes --model qwen3.5:27bVerify:
bash
ollama list
curl http://localhost:11434/v1/modelsStep 2: Fix context length (critical)
Qwen 3.5 supports 256K natively, but configure at least 64,000 for Hermes agentic work.
Option A: Environment variable (recommended)
bash
export OLLAMA_CONTEXT_LENGTH=65536
ollama serveOption B: Custom Modelfile (persistent per model)
bash
cat > Modelfile << 'EOF'
FROM qwen3.5:27b
PARAMETER num_ctx 65536
EOF
ollama create qwen3.5-64k -f ModelfileOption C: systemd (Linux, persistent across reboots)
bash
sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_CONTEXT_LENGTH=65536"
sudo systemctl daemon-reload && sudo systemctl restart ollamaMost common offline failure
If Hermes starts but can't use tools or loses context mid-task, context length is almost always the cause. Set 64k before anything else.
Step 3: Configure Hermes
Interactive:
bash
hermes model
# Select: Custom endpoint (self-hosted / VLLM / etc.)
# Base URL: http://localhost:11434/v1
# API key: (leave empty or type "ollama")
# Model: qwen3.5-64k
# Context length: 64000Or edit ~/.hermes/config.yaml directly:
yaml
model:
default: "qwen3.5-64k"
provider: "custom"
base_url: "http://localhost:11434/v1"
context_length: 64000Step 4: Keep the model loaded (optional)
Ollama unloads models after 5 minutes of idle time. For a persistent gateway bot:
bash
export OLLAMA_KEEP_ALIVE=-1 # never unload
ollama serveStep 5: Verify offline agentic mode
bash
hermestext
Create a file called /tmp/hermes-test.txt with today's date
and the text "offline mode works". Then read it back to confirm.If Hermes creates and reads the file, your offline agentic setup is working.
Performance expectations
| Hardware | Speed | Notes |
|---|---|---|
| Apple Silicon (M2/M3/M4, 24GB+) | ~15-35 tok/s | Good for qwen3.5:27b |
| NVIDIA GPU (16GB+) | ~20-40 tok/s | Comfortable for 27B |
| CPU only | ~2-5 tok/s | Works, but 30-120s per response |
Hermes auto-detects local endpoints and relaxes streaming timeouts, so slow CPU responses are usually fine out of the box. If you still hit timeouts on very large contexts, raise the streaming read timeout:
bash
export HERMES_STREAM_READ_TIMEOUT=1800Optional: hybrid fallback
If a local model fails on hard tasks, configure a cloud fallback without changing your default:
yaml
model:
default: "qwen3.5-64k"
provider: "custom"
base_url: "http://localhost:11434/v1"
context_length: 64000
fallback_providers:
- provider: openrouter
model: anthropic/claude-sonnet-4Local first, cloud only when needed.
My working config (reference)
yaml
model:
default: "qwen3.5-64k"
provider: "custom"
base_url: "http://localhost:11434/v1"
context_length: 65536With Ollama started via:
bash
OLLAMA_CONTEXT_LENGTH=65536 OLLAMA_KEEP_ALIVE=-1 ollama serveNext: understand why local models and how to choose the right one.