Appearance
Voice mode
Hermes supports full voice interaction: talk to it with your microphone in the CLI, hear spoken replies, and even have live conversations in a Discord voice channel. The part I care about for an offline setup: both speech-to-text and text-to-speech have free, local, no-API-key options.
The offline voice stack
| Stage | Local option | Notes |
|---|---|---|
| Speech-to-text (STT) | faster-whisper | Free, runs locally, no key. Model (~150 MB for base) downloads on first use |
| Text-to-speech (TTS) | Edge TTS or NeuTTS | Edge is free with no key; NeuTTS runs fully local |
With faster-whisper installed and Edge/NeuTTS for TTS, voice mode works with zero API keys, no audio leaves your machine.
Install
bash
# CLI voice (microphone + playback)
cd ~/.hermes/hermes-agent && uv pip install -e ".[voice]"
# Local STT (no key)
pip install faster-whisper
# Local TTS (optional, fully offline)
python -m pip install -U neutts[all]System dependencies:
bash
# macOS
brew install portaudio ffmpeg
brew install espeak-ng # for NeuTTS
# Ubuntu/Debian
sudo apt install portaudio19-dev ffmpeg
sudo apt install espeak-ng # for NeuTTSCLI voice mode
Start the CLI, then:
/voice on Enable voice mode
/voice tts Toggle spoken replies
/voice status Show current state
/voice off DisableHow a turn flows: press Ctrl+B, a beep plays and recording starts. Speak, and after 3 seconds of silence it auto-stops. Whisper transcribes locally, the agent replies, and if TTS is on the reply is spoken sentence-by-sentence as it generates. Recording then restarts automatically so you can keep talking hands-free.
Whisper sometimes hallucinates phantom text from silence ("thanks for watching"), Hermes filters known hallucination phrases automatically.
Configuration
yaml
# ~/.hermes/config.yaml
voice:
record_key: "ctrl+b"
silence_duration: 3.0
beep_enabled: true
stt:
provider: "local" # local (free) | groq | openai
local:
model: "base" # tiny, base, small, medium, large-v3
tts:
provider: "edge" # edge (free) | neutts (local) | elevenlabs | openai
edge:
voice: "en-US-AriaNeural"
neutts:
model: neuphonic/neutts-air-q4-gguf
device: cpuLocal Whisper models trade speed for quality: base is fast and good enough for commands; small or large-v3 are more accurate but slower on CPU.
Discord voice (optional, needs cloud Discord)
If you run a Discord gateway, the bot can join a voice channel, transcribe each speaker, and speak replies back. STT/TTS can still be local, but Discord itself is a cloud platform, so this breaks strict offline-only operation. Setup needs the Opus codec (brew install opus / apt install libopus0) and Connect + Speak bot permissions. Inside Discord: /voice join, /voice leave.
Troubleshooting
- "No audio device found": install PortAudio (see above).
- Whisper returns garbage: use a quieter room, raise
silence_threshold, or try a larger STT model. - Reply text but no speech: the TTS provider failed, Edge TTS (free, no key) is the default fallback; check logs at
~/.hermes/logs/.