Ollama setup

Ollama is the simplest way to run open-weight models locally. This page covers the setup I use daily with Hermes and Qwen 3.5.

Install

bash

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Start the server:

bash

ollama serve

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1.

Pull Qwen 3.5

From the Ollama Qwen 3.5 library:

bash

# Recommended daily driver (27B)
ollama pull qwen3.5:27b

# Lighter option (9B, default tag)
ollama pull qwen3.5:9b

# Best quality if you have 32 GB+ RAM (35B MoE)
ollama pull qwen3.5:35b

Or launch Hermes directly (official integration):

bash

ollama launch hermes --model qwen3.5:27b

List installed models:

bash

ollama list

Context length (do this before Hermes)

Qwen 3.5 supports 256K context natively, but Hermes needs at least 64,000 configured. Create a 64k variant:

bash

cat > Modelfile << 'EOF'
FROM qwen3.5:27b
PARAMETER num_ctx 65536
PARAMETER temperature 0.7
EOF

ollama create qwen3.5-64k -f Modelfile

Or set server-wide:

bash

OLLAMA_CONTEXT_LENGTH=65536 ollama serve

Connect Hermes

yaml

# ~/.hermes/config.yaml
model:
  default: "qwen3.5-64k"
  provider: "custom"
  base_url: "http://localhost:11434/v1"
  context_length: 65536

Verify the connection:

bash

curl http://localhost:11434/v1/models
hermes model   # confirm settings match

GPU offloading

Ollama auto-detects NVIDIA and Apple Silicon GPUs. On Apple Silicon, MLX variants are also available (e.g. qwen3.5:27b-mlx).

bash

ollama ps

Keep model loaded

bash

# Never unload (good for always-on agents)
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Or preload manually
ollama run qwen3.5-64k

Quick health check

bash

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-64k",
    "messages": [{"role": "user", "content": "Say hello in one word."}]
  }'

If you get a response, Ollama is healthy. Next step: wire Hermes and test tool calling.

Next: Choosing a model.

Ollama setup ​

Install ​

Pull Qwen 3.5 ​

Context length (do this before Hermes) ​

Connect Hermes ​

GPU offloading ​

Keep model loaded ​

Quick health check ​

Ollama setup

Install

Pull Qwen 3.5

Context length (do this before Hermes)

Connect Hermes

GPU offloading

Keep model loaded

Quick health check