Run a Local AI Voice Studio: Clone, Dictate, Generate

// Before you start

What you need

macOS (Apple Silicon or Intel) or Windows PC with a decent GPU
At least 8 GB RAM (16 GB recommended for larger models)
NVIDIA GPU with CUDA for Windows, or Apple Silicon Metal for Mac (CPU fallback works but is slow)
A short audio clip (5–30 seconds) if you want to clone a specific voice
About 2–10 GB free disk space for model downloads

Step 1 of 6

Download the installer for your platform

2 min

Voicebox ships as a native app — a DMG for Mac or an MSI for Windows. You do not need to install Node, Python, or any developer tools. Just grab the right file for your machine from the official site.

Terminal · mac

$ # macOS Apple Silicon: https://voicebox.sh (click Download DMG)

$ # macOS Intel: https://voicebox.sh (click Download DMG — Intel build)

$ # Windows: https://voicebox.sh (click Download MSI)

$ # Docker (any OS): docker compose up

$ # Linux: https://voicebox.sh/linux-install

What you should see

A .dmg or .msi file lands in your Downloads folder.

This might happen

macOS shows 'unidentified developer' or Gatekeeper blocks the app.

Open System Settings → Privacy & Security, scroll down, and click 'Open Anyway' next to the Voicebox entry. This is normal for apps not yet in the Mac App Store.

Step 2 of 6

Install and launch Voicebox

3 min

On Mac, open the DMG and drag Voicebox into your Applications folder, then double-click it. On Windows, run the MSI and follow the installer prompts. The app is built with Tauri (Rust), so it opens fast and uses far less memory than Electron-based tools.

Terminal · mac

$ # macOS: open the .dmg → drag Voicebox to /Applications → open Voicebox

$ # Windows: double-click the .msi → click Next/Install → launch from Start Menu

What you should see

The Voicebox UI opens. You will see a sidebar with sections like Generate, Dictate, Stories, and Settings.

This might happen

Windows Defender SmartScreen warns about an unrecognized app.

Click 'More info' then 'Run anyway'. This warning appears because the binary is new and has not yet accumulated enough download history with Microsoft.

Step 3 of 6

Download your first TTS model

5–15 min (download time varies)

Voicebox does not bundle the AI models inside the installer — they are too large. On first use, go to Settings → Models and pick an engine. Kokoro is the best starting point: it is tiny (82 MB), runs fast on CPU, and comes with 50 preset voices so you can generate speech immediately without any reference audio.

Terminal · mac

$ # In the app: Settings → Models → Kokoro → Download

$ # Wait for the progress bar to complete before generating.

What you should see

The Kokoro model shows a green 'Ready' badge. You are now able to generate speech.

This might happen

Download stalls or fails partway through.

Check your internet connection, then click the retry button next to the model. Large models (Qwen3-TTS 1.7B, TADA 3B) can take 10–20 minutes on slower connections.

Step 4 of 6

Generate your first speech clip

2 min

Go to the Generate tab. Type any text into the input box — a sentence or two is fine. Select Kokoro from the engine dropdown and pick one of the 50 preset voices from the voice list. Hit Generate. The audio appears in the output panel when done. You can play it, download it as a WAV, or apply effects.

Terminal · mac

$ # In the app:

$ # 1. Click 'Generate' in the sidebar

$ # 2. Type your text in the input field

$ # 3. Engine dropdown → Kokoro

$ # 4. Voice dropdown → pick any preset (e.g. 'af_heart')

$ # 5. Click the Generate button

What you should see

A waveform appears in the output panel within a few seconds. Click Play to hear it.

This might happen

Generation finishes instantly but produces silence or a very short clip.

Make sure the model finished downloading (green Ready badge). If the badge is missing, go back to Settings → Models and re-download.

Step 5 of 6

Clone a voice from a short audio sample

5 min

To clone a voice, go to Voice Profiles → New Profile. Give it a name, then upload a reference audio file — 5 to 30 seconds of clean speech works best. Background music or noise will reduce quality. Save the profile, then select it as your voice in the Generate tab and run a generation. The engine will match the tone and style of your sample.

Terminal · mac

$ # In the app:

$ # 1. Sidebar → Voice Profiles → + New Profile

$ # 2. Name the profile (e.g. 'My Voice')

$ # 3. Click 'Upload Reference Audio' → select your .mp3 or .wav file

$ # 4. Click Save

$ # 5. Go to Generate → Voice dropdown → select 'My Voice'

$ # 6. Choose Qwen3-TTS or Chatterbox Multilingual as the engine

$ # 7. Type text → Generate

What you should see

The generated audio sounds like the person in your reference clip. Quality improves with cleaner, longer samples.

This might happen

The cloned voice sounds robotic or nothing like the reference.

Use a recording with no background music, minimal echo, and a single speaker. Trim silence from the start and end. Qwen3-TTS and Chatterbox Multilingual give the best cloning results; Kokoro does not support zero-shot cloning.

Step 6 of 6

Enable global dictation with a hotkey

3 min

Voicebox can replace your keyboard in any app — browser, email, notes, anything. Go to Settings → Dictation, enable it, and assign a hotkey (for example, Option+Space on Mac). Grant microphone access when prompted. Now press your hotkey anywhere on your computer, speak, and the transcribed text is pasted automatically into whatever field is active.

Terminal · mac

$ # In the app:

$ # 1. Settings → Dictation → toggle Enable Dictation ON

$ # 2. Click the hotkey field → press your chosen key combo (e.g. Option+Space)

$ # 3. macOS: grant Accessibility and Microphone permissions when the system dialog appears

$ # 4. Click outside any text field in another app → press your hotkey → speak

What you should see

Your spoken words appear as typed text in the active text field of whatever app you are using.

This might happen

macOS does not paste the text automatically after dictation.

Go to System Settings → Privacy & Security → Accessibility and make sure Voicebox is listed and toggled on. This permission is required for auto-paste to work.

// Status

cooked. baked. worked.

A fully local voice studio running on your machine. You can generate speech in 23 languages using 7 engines, clone a voice from a short audio clip, dictate into any app with a hotkey, and apply audio effects — all without sending data to any cloud service.

// the honest bit

The honest part

Voicebox is a young open-source project. Some engines (TADA, Qwen3-TTS 1.7B) require a capable GPU and several gigabytes of VRAM — CPU fallback is very slow for these. Linux has no pre-built binary yet; building from source requires developer tools. Voice cloning quality varies significantly with audio sample quality. The Stories editor and MCP agent integration are advanced features that work best if you already use Claude Code, Cursor, or Cline. Docker support exists but is not well-documented for non-developers.