HootVoice — LLM Post-processing Guide

Overview

When LLM post-processing is enabled, HootVoice sends each Whisper transcript to a local LLM endpoint so the text can be polished, made polite, or summarised automatically. The feature expects an OpenAI-compatible API such as Ollama or LM Studio.

The toggle is disabled by default. Open Settings → LLM and enable Enable LLM post-processing, then configure the API base URL and model name to match your local server. In that tab we recommend starting with gemma-3-12b-it quantised to Q4_K_M for proofreading-focused workflows.

How it Works

HootVoice transcribes your recording locally with Whisper.
Once transcription finishes, it calls /v1/chat/completions on the configured LLM endpoint.
The LLM response replaces the raw transcript in the log and clipboard.
If auto-paste is enabled, the processed text is inserted into the frontmost app.

If the API fails or times out, HootVoice falls back to the original Whisper text. Logs capture HTTP status codes and any error payloads for quick diagnostics.

Setup Checklist

An OpenAI-compatible server (Ollama or LM Studio) is running on your machine
The target LLM model is downloaded and ready to serve
curl requests to /v1/models or /v1/chat/completions succeed
HootVoice settings reference the correct API base URL and model identifier

Using Ollama

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1, which matches the default value in HootVoice.

macOS

Install via brew install ollama (requires Homebrew).
Run ollama run llama3.1:8b to download and cache the model.
Keep the background service running with ollama serve or the Ollama menu bar app.

Windows

Download the installer from ollama.com and complete setup.
Open PowerShell and run ollama run llama3.1:8b to fetch the model.
The service stays active in the background; manage it from the system tray.

Linux

Run curl https://ollama.ai/install.sh | sh.
Enable the user service with systemctl --user enable --now ollama.
Download a model via ollama run llama3.1:8b and verify the API responds.

Test connectivity with:

curl http://localhost:11434/v1/models

Using LM Studio

LM Studio offers a GUI for managing models and ships with an OpenAI-compatible server. The default port is 1234, so set HootVoice’s base URL to http://localhost:1234/v1.

macOS

Download the DMG from the LM Studio website and install it.
Open “Download Models” and grab the models you want.
Click “Start Server” and enable the “OpenAI Compatible Server” option.

Windows

Run the Windows installer with the default options.
Download models from within the app, then switch to the “Server” tab.
Press “Start Server” and enable auto-start if you need it on boot.

Linux

Launch the AppImage or install the Debian package.
Download a model, then toggle the server switch in the top-right corner.
Allow inbound traffic on port 1234 if your firewall prompts.

Confirm the server is reachable:

curl http://localhost:1234/v1/models

Recommended Models

Use case	Model	Notes
Japanese polishing & polite tone	`google/gemma-3-12b` (Ollama / LM Studio)	Excellent Japanese fluency; a 4-bit quantization typically needs 10–12 GB VRAM.
English summaries	`qwen2.5:7b-instruct`, `Phi-3.5-mini-instruct`	Fast responses with concise outputs; ideal for meeting notes.
Maximum accuracy	`llama3.1:70b` or other large instruction-tuned models	Requires high-end GPU/VRAM; tune `OLLAMA_NUM_PARALLEL` as needed.

Make sure the model identifier matches your runtime. Ollama lists models via ollama list, while LM Studio shows the identifier in the “Local Models” panel.

Local Resource Requirements

Running Gemma-3-12B locally in 4-bit/QAT mode for proofreading tasks generally requires the following resources:

CPU / OS: Supports macOS 13.4+ (Apple Silicon M1/M2/M3/M4), Windows 10+/11 (x64/ARM), and Ubuntu 20.04+. LM Studio runs best with 16 GB or more RAM; larger models consume additional memory.
GPU / Memory:
- Target at least 11 GB of GPU VRAM. Google’s guidance for 12B models: 4-bit ≈ 8.7 GB, 8-bit ≈ 12.2 GB, BF16 ≈ 20 GB. These figures only cover model loading; KV cache usage scales with context length.
- Platform-specific minimums:
  - macOS (Apple Silicon): unified memory of 16 GB or more, with roughly 75% of total RAM available to the GPU.
  - Windows / Linux (NVIDIA): RTX 3060 12 GB or better.
  - Windows / Linux (AMD): Radeon RX 6700 XT 12 GB or better.
  - Windows / Linux (Intel Arc): Arc A770 16 GB.
Recommended quantisation & settings:
- Start with Q4 variants (e.g. Q4_K_M). Move to Q5 or Q6 if you have spare headroom.
- Use an initial context window of 8k–16k tokens; longer inputs demand additional VRAM for the KV cache.
- Disable image input for proofreading scenarios. Gemma 3 treats each image as roughly 256 tokens, reducing usable context.

Troubleshooting

HTTP 404: Ensure the base URL includes /v1.
Timeouts: Initial model load may take 30+ seconds; try a smaller model first.
Wrong language: Set “Prompt language override” to Japanese or include language instructions in your prompt.
High resource usage: Use quantised models (e.g. -q4_K_M variants) or lower the number of GPU layers.

If issues persist, copy the log entry with the failing request/response and share it with the HootVoice team.