Complete offline setup

This is the core of my setup: Hermes Agent running entirely on local hardware with Ollama and Qwen 3.5. No API keys, no subscriptions, no data leaving my machine.

Architecture

text

You (CLI / Telegram)
        │
        ▼
  Hermes Agent  ←── ~/.hermes/ (config, memory, skills)
        │
        ▼
  Ollama server  ←── http://localhost:11434/v1
        │
        ▼
  Qwen 3.5  (on CPU or GPU)

Step 1: Install and start Ollama

bash

# Install (macOS)
brew install ollama

# Start server
ollama serve

Pull Qwen 3.5 (native tool calling, vision, 256K context):

bash

ollama pull qwen3.5:27b

Or use the official one-liner:

bash

ollama launch hermes --model qwen3.5:27b

Verify:

bash

ollama list
curl http://localhost:11434/v1/models

Step 2: Fix context length (critical)

Qwen 3.5 supports 256K natively, but configure at least 64,000 for Hermes agentic work.

Option A: Environment variable (recommended)

bash

export OLLAMA_CONTEXT_LENGTH=65536
ollama serve

Option B: Custom Modelfile (persistent per model)

bash

cat > Modelfile << 'EOF'
FROM qwen3.5:27b
PARAMETER num_ctx 65536
EOF

ollama create qwen3.5-64k -f Modelfile

Option C: systemd (Linux, persistent across reboots)

bash

sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_CONTEXT_LENGTH=65536"
sudo systemctl daemon-reload && sudo systemctl restart ollama

Most common offline failure

If Hermes starts but can't use tools or loses context mid-task, context length is almost always the cause. Set 64k before anything else.

Step 3: Configure Hermes

Interactive:

bash

hermes model
# Select: Custom endpoint (self-hosted / VLLM / etc.)
# Base URL: http://localhost:11434/v1
# API key: (leave empty or type "ollama")
# Model: qwen3.5-64k
# Context length: 64000

Or edit ~/.hermes/config.yaml directly:

yaml

model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 64000

Step 4: Keep the model loaded (optional)

Ollama unloads models after 5 minutes of idle time. For a persistent gateway bot:

bash

export OLLAMA_KEEP_ALIVE=-1   # never unload
ollama serve

Step 5: Verify offline agentic mode

bash

hermes

text

Create a file called /tmp/hermes-test.txt with today's date
and the text "offline mode works". Then read it back to confirm.

If Hermes creates and reads the file, your offline agentic setup is working.

Performance expectations

Hardware	Speed	Notes
Apple Silicon (M2/M3/M4, 24GB+)	~15-35 tok/s	Good for qwen3.5:27b
NVIDIA GPU (16GB+)	~20-40 tok/s	Comfortable for 27B
CPU only	~2-5 tok/s	Works, but 30-120s per response

Hermes auto-detects local endpoints and relaxes streaming timeouts, so slow CPU responses are usually fine out of the box. If you still hit timeouts on very large contexts, raise the streaming read timeout:

bash

export HERMES_STREAM_READ_TIMEOUT=1800

Optional: hybrid fallback

If a local model fails on hard tasks, configure a cloud fallback without changing your default:

yaml

model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 64000
fallback_providers:
  - provider: openrouter
    model: anthropic/claude-sonnet-4

Local first, cloud only when needed.

My working config (reference)

yaml

model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 65536

With Ollama started via:

bash

OLLAMA_CONTEXT_LENGTH=65536 OLLAMA_KEEP_ALIVE=-1 ollama serve

Next: understand why local models and how to choose the right one.

Complete offline setup ​

Architecture ​

Step 1: Install and start Ollama ​

Step 2: Fix context length (critical) ​

Step 3: Configure Hermes ​

Step 4: Keep the model loaded (optional) ​

Step 5: Verify offline agentic mode ​

Performance expectations ​

Optional: hybrid fallback ​

My working config (reference) ​

Complete offline setup

Architecture

Step 1: Install and start Ollama

Step 2: Fix context length (critical)

Step 3: Configure Hermes

Step 4: Keep the model loaded (optional)

Step 5: Verify offline agentic mode

Performance expectations

Optional: hybrid fallback

My working config (reference)