Skip to content

Complete offline setup

This is the core of my setup: Hermes Agent running entirely on local hardware with Ollama and Qwen 3.5. No API keys, no subscriptions, no data leaving my machine.

Architecture

text
You (CLI / Telegram)


  Hermes Agent  ←── ~/.hermes/ (config, memory, skills)


  Ollama server  ←── http://localhost:11434/v1


  Qwen 3.5  (on CPU or GPU)

Step 1: Install and start Ollama

bash
# Install (macOS)
brew install ollama

# Start server
ollama serve

Pull Qwen 3.5 (native tool calling, vision, 256K context):

bash
ollama pull qwen3.5:27b

Or use the official one-liner:

bash
ollama launch hermes --model qwen3.5:27b

Verify:

bash
ollama list
curl http://localhost:11434/v1/models

Step 2: Fix context length (critical)

Qwen 3.5 supports 256K natively, but configure at least 64,000 for Hermes agentic work.

Option A: Environment variable (recommended)

bash
export OLLAMA_CONTEXT_LENGTH=65536
ollama serve

Option B: Custom Modelfile (persistent per model)

bash
cat > Modelfile << 'EOF'
FROM qwen3.5:27b
PARAMETER num_ctx 65536
EOF

ollama create qwen3.5-64k -f Modelfile

Option C: systemd (Linux, persistent across reboots)

bash
sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_CONTEXT_LENGTH=65536"
sudo systemctl daemon-reload && sudo systemctl restart ollama

Most common offline failure

If Hermes starts but can't use tools or loses context mid-task, context length is almost always the cause. Set 64k before anything else.

Step 3: Configure Hermes

Interactive:

bash
hermes model
# Select: Custom endpoint (self-hosted / VLLM / etc.)
# Base URL: http://localhost:11434/v1
# API key: (leave empty or type "ollama")
# Model: qwen3.5-64k
# Context length: 64000

Or edit ~/.hermes/config.yaml directly:

yaml
model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 64000

Step 4: Keep the model loaded (optional)

Ollama unloads models after 5 minutes of idle time. For a persistent gateway bot:

bash
export OLLAMA_KEEP_ALIVE=-1   # never unload
ollama serve

Step 5: Verify offline agentic mode

bash
hermes
text
Create a file called /tmp/hermes-test.txt with today's date
and the text "offline mode works". Then read it back to confirm.

If Hermes creates and reads the file, your offline agentic setup is working.

Performance expectations

HardwareSpeedNotes
Apple Silicon (M2/M3/M4, 24GB+)~15-35 tok/sGood for qwen3.5:27b
NVIDIA GPU (16GB+)~20-40 tok/sComfortable for 27B
CPU only~2-5 tok/sWorks, but 30-120s per response

Hermes auto-detects local endpoints and relaxes streaming timeouts, so slow CPU responses are usually fine out of the box. If you still hit timeouts on very large contexts, raise the streaming read timeout:

bash
export HERMES_STREAM_READ_TIMEOUT=1800

Optional: hybrid fallback

If a local model fails on hard tasks, configure a cloud fallback without changing your default:

yaml
model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 64000
fallback_providers:
  - provider: openrouter
    model: anthropic/claude-sonnet-4

Local first, cloud only when needed.

My working config (reference)

yaml
model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 65536

With Ollama started via:

bash
OLLAMA_CONTEXT_LENGTH=65536 OLLAMA_KEEP_ALIVE=-1 ollama serve

Next: understand why local models and how to choose the right one.

Personal learning notes on Hermes Agent. Not affiliated with Nous Research. Verify against official docs.