Skip to content

Voice mode

Hermes supports full voice interaction: talk to it with your microphone in the CLI, hear spoken replies, and even have live conversations in a Discord voice channel. The part I care about for an offline setup: both speech-to-text and text-to-speech have free, local, no-API-key options.

The offline voice stack

StageLocal optionNotes
Speech-to-text (STT)faster-whisperFree, runs locally, no key. Model (~150 MB for base) downloads on first use
Text-to-speech (TTS)Edge TTS or NeuTTSEdge is free with no key; NeuTTS runs fully local

With faster-whisper installed and Edge/NeuTTS for TTS, voice mode works with zero API keys, no audio leaves your machine.

Install

bash
# CLI voice (microphone + playback)
cd ~/.hermes/hermes-agent && uv pip install -e ".[voice]"

# Local STT (no key)
pip install faster-whisper

# Local TTS (optional, fully offline)
python -m pip install -U neutts[all]

System dependencies:

bash
# macOS
brew install portaudio ffmpeg
brew install espeak-ng        # for NeuTTS

# Ubuntu/Debian
sudo apt install portaudio19-dev ffmpeg
sudo apt install espeak-ng    # for NeuTTS

CLI voice mode

Start the CLI, then:

/voice on       Enable voice mode
/voice tts      Toggle spoken replies
/voice status   Show current state
/voice off      Disable

How a turn flows: press Ctrl+B, a beep plays and recording starts. Speak, and after 3 seconds of silence it auto-stops. Whisper transcribes locally, the agent replies, and if TTS is on the reply is spoken sentence-by-sentence as it generates. Recording then restarts automatically so you can keep talking hands-free.

Whisper sometimes hallucinates phantom text from silence ("thanks for watching"), Hermes filters known hallucination phrases automatically.

Configuration

yaml
# ~/.hermes/config.yaml
voice:
  record_key: "ctrl+b"
  silence_duration: 3.0
  beep_enabled: true

stt:
  provider: "local"      # local (free) | groq | openai
  local:
    model: "base"        # tiny, base, small, medium, large-v3

tts:
  provider: "edge"       # edge (free) | neutts (local) | elevenlabs | openai
  edge:
    voice: "en-US-AriaNeural"
  neutts:
    model: neuphonic/neutts-air-q4-gguf
    device: cpu

Local Whisper models trade speed for quality: base is fast and good enough for commands; small or large-v3 are more accurate but slower on CPU.

Discord voice (optional, needs cloud Discord)

If you run a Discord gateway, the bot can join a voice channel, transcribe each speaker, and speak replies back. STT/TTS can still be local, but Discord itself is a cloud platform, so this breaks strict offline-only operation. Setup needs the Opus codec (brew install opus / apt install libopus0) and Connect + Speak bot permissions. Inside Discord: /voice join, /voice leave.

Troubleshooting

  • "No audio device found": install PortAudio (see above).
  • Whisper returns garbage: use a quieter room, raise silence_threshold, or try a larger STT model.
  • Reply text but no speech: the TTS provider failed, Edge TTS (free, no key) is the default fallback; check logs at ~/.hermes/logs/.

Personal learning notes on Hermes Agent. Not affiliated with Nous Research. Verify against official docs.