Appearance
Ollama setup
Ollama is the simplest way to run open-weight models locally. This page covers the setup I use daily with Hermes and Qwen 3.5.
Install
bash
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | shStart the server:
bash
ollama serveOllama exposes an OpenAI-compatible API at http://localhost:11434/v1.
Pull Qwen 3.5
From the Ollama Qwen 3.5 library:
bash
# Recommended daily driver (27B)
ollama pull qwen3.5:27b
# Lighter option (9B, default tag)
ollama pull qwen3.5:9b
# Best quality if you have 32 GB+ RAM (35B MoE)
ollama pull qwen3.5:35bOr launch Hermes directly (official integration):
bash
ollama launch hermes --model qwen3.5:27bList installed models:
bash
ollama listContext length (do this before Hermes)
Qwen 3.5 supports 256K context natively, but Hermes needs at least 64,000 configured. Create a 64k variant:
bash
cat > Modelfile << 'EOF'
FROM qwen3.5:27b
PARAMETER num_ctx 65536
PARAMETER temperature 0.7
EOF
ollama create qwen3.5-64k -f ModelfileOr set server-wide:
bash
OLLAMA_CONTEXT_LENGTH=65536 ollama serveConnect Hermes
yaml
# ~/.hermes/config.yaml
model:
default: "qwen3.5-64k"
provider: "custom"
base_url: "http://localhost:11434/v1"
context_length: 65536Verify the connection:
bash
curl http://localhost:11434/v1/models
hermes model # confirm settings matchGPU offloading
Ollama auto-detects NVIDIA and Apple Silicon GPUs. On Apple Silicon, MLX variants are also available (e.g. qwen3.5:27b-mlx).
bash
ollama psKeep model loaded
bash
# Never unload (good for always-on agents)
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Or preload manually
ollama run qwen3.5-64kQuick health check
bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-64k",
"messages": [{"role": "user", "content": "Say hello in one word."}]
}'If you get a response, Ollama is healthy. Next step: wire Hermes and test tool calling.
Next: Choosing a model.