Overview
When LLM post-processing is enabled, HootVoice sends each Whisper transcript to a local LLM endpoint so the text can be polished, made polite, or summarised automatically. The feature expects an OpenAI-compatible API such as Ollama or LM Studio.
The toggle is disabled by default. Open Settings → LLM and enable Enable LLM post-processing, then configure the API base URL and model name to match your local server. In that tab we recommend starting with gemma-3-12b-it quantised to Q4_K_M for proofreading-focused workflows.
How it Works
- HootVoice transcribes your recording locally with Whisper.
- Once transcription finishes, it calls
/v1/chat/completionson the configured LLM endpoint. - The LLM response replaces the raw transcript in the log and clipboard.
- If auto-paste is enabled, the processed text is inserted into the frontmost app.
If the API fails or times out, HootVoice falls back to the original Whisper text. Logs capture HTTP status codes and any error payloads for quick diagnostics.
Setup Checklist
- An OpenAI-compatible server (Ollama or LM Studio) is running on your machine
- The target LLM model is downloaded and ready to serve
curlrequests to/v1/modelsor/v1/chat/completionssucceed- HootVoice settings reference the correct API base URL and model identifier
Using Ollama
Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1, which matches the default value in HootVoice.
macOS
- Install via
brew install ollama(requires Homebrew). - Run
ollama run llama3.1:8bto download and cache the model. - Keep the background service running with
ollama serveor the Ollama menu bar app.
Windows
- Download the installer from ollama.com and complete setup.
- Open PowerShell and run
ollama run llama3.1:8bto fetch the model. - The service stays active in the background; manage it from the system tray.
Linux
- Run
curl https://ollama.ai/install.sh | sh. - Enable the user service with
systemctl --user enable --now ollama. - Download a model via
ollama run llama3.1:8band verify the API responds.
Test connectivity with:
curl http://localhost:11434/v1/models
Using LM Studio
LM Studio offers a GUI for managing models and ships with an OpenAI-compatible server. The default port is 1234, so set HootVoice’s base URL to http://localhost:1234/v1.
macOS
- Download the DMG from the LM Studio website and install it.
- Open “Download Models” and grab the models you want.
- Click “Start Server” and enable the “OpenAI Compatible Server” option.
Windows
- Run the Windows installer with the default options.
- Download models from within the app, then switch to the “Server” tab.
- Press “Start Server” and enable auto-start if you need it on boot.
Linux
- Launch the AppImage or install the Debian package.
- Download a model, then toggle the server switch in the top-right corner.
- Allow inbound traffic on port 1234 if your firewall prompts.
Confirm the server is reachable:
curl http://localhost:1234/v1/models
Recommended Models
| Use case | Model | Notes |
|---|---|---|
| Japanese polishing & polite tone | google/gemma-3-12b (Ollama / LM Studio) |
Excellent Japanese fluency; a 4-bit quantization typically needs 10–12 GB VRAM. |
| English summaries | qwen2.5:7b-instruct, Phi-3.5-mini-instruct |
Fast responses with concise outputs; ideal for meeting notes. |
| Maximum accuracy | llama3.1:70b or other large instruction-tuned models |
Requires high-end GPU/VRAM; tune OLLAMA_NUM_PARALLEL as needed. |
Make sure the model identifier matches your runtime. Ollama lists models via ollama list, while LM Studio shows the identifier in the “Local Models” panel.
Local Resource Requirements
Running Gemma-3-12B locally in 4-bit/QAT mode for proofreading tasks generally requires the following resources:
- CPU / OS: Supports macOS 13.4+ (Apple Silicon M1/M2/M3/M4), Windows 10+/11 (x64/ARM), and Ubuntu 20.04+. LM Studio runs best with 16 GB or more RAM; larger models consume additional memory.
- GPU / Memory:
- Target at least 11 GB of GPU VRAM. Google’s guidance for 12B models: 4-bit ≈ 8.7 GB, 8-bit ≈ 12.2 GB, BF16 ≈ 20 GB. These figures only cover model loading; KV cache usage scales with context length.
- Platform-specific minimums:
- macOS (Apple Silicon): unified memory of 16 GB or more, with roughly 75% of total RAM available to the GPU.
- Windows / Linux (NVIDIA): RTX 3060 12 GB or better.
- Windows / Linux (AMD): Radeon RX 6700 XT 12 GB or better.
- Windows / Linux (Intel Arc): Arc A770 16 GB.
- Recommended quantisation & settings:
- Start with Q4 variants (e.g.
Q4_K_M). Move to Q5 or Q6 if you have spare headroom. - Use an initial context window of 8k–16k tokens; longer inputs demand additional VRAM for the KV cache.
- Disable image input for proofreading scenarios. Gemma 3 treats each image as roughly 256 tokens, reducing usable context.
- Start with Q4 variants (e.g.
Troubleshooting
- HTTP 404: Ensure the base URL includes
/v1. - Timeouts: Initial model load may take 30+ seconds; try a smaller model first.
- Wrong language: Set “Prompt language override” to Japanese or include language instructions in your prompt.
- High resource usage: Use quantised models (e.g.
-q4_K_Mvariants) or lower the number of GPU layers.
If issues persist, copy the log entry with the failing request/response and share it with the HootVoice team.