// Before you start

What you need

Windows 10/11, Linux, or macOS (Apple Silicon supported)
Anaconda or Miniconda installed (conda command must work in your terminal)
NVIDIA GPU with 6 GB+ VRAM recommended (CPU works but is slow)
10–20 GB free disk space for models and dependencies
A short audio clip (5 seconds minimum, 1 minute ideal) of the voice you want to clone
Stable internet connection for the initial model download

Step 1 of 6

Create a fresh Python environment

3 min

Conda keeps GPT-SoVITS's many dependencies isolated so they don't break anything else on your machine. This command creates a clean Python 3.10 environment and then activates it. Every command after this must be run inside that same activated environment.

Terminal · mac

$ conda create -n GPTSoVits python=3.10 -y

$ conda activate GPTSoVits

What you should see

A prompt that now shows (GPTSoVits) at the start of each line.

This might happen

'conda' is not recognized as a command

Anaconda/Miniconda is not installed or not added to your PATH. Download Miniconda from https://docs.conda.io/en/latest/miniconda.html, install it, then open a fresh terminal.

Step 2 of 6

Download the project files

2 min

This clones the GPT-SoVITS repository to your computer — it downloads all the scripts and config files you need. If you don't have Git, you can also download the ZIP from the GitHub page and unzip it, then navigate into the folder.

Terminal · mac

$ git clone https://github.com/RVC-Boss/GPT-SoVITS.git

$ cd GPT-SoVITS

What you should see

A folder called GPT-SoVITS appears and your terminal is now inside it.

This might happen

'git' is not recognized

Install Git from https://git-scm.com/downloads, or download the ZIP directly from the GitHub page and unzip it manually.

Step 3 of 6

Run the automated installer for your platform

15–30 min

This single script installs all Python packages AND downloads the required pretrained AI models automatically. Pick the command that matches your hardware. CU126 means CUDA 12.6 (most common for recent NVIDIA cards). Use CPU if you have no NVIDIA GPU. The --source HF flag downloads from Hugging Face — if you are in China, swap HF for ModelScope.

Terminal · mac

$ # Windows (NVIDIA GPU — paste into PowerShell):

$ pwsh -F install.ps1 --Device CU126 --Source HF

$ # Windows (no GPU):

$ pwsh -F install.ps1 --Device CPU --Source HF

$ # Linux (NVIDIA GPU):

$ bash install.sh --device CU126 --source HF

$ # macOS (Apple Silicon):

$ bash install.sh --device MPS --source HF

What you should see

Lots of download and install progress bars. Ends with no red ERROR lines. The pretrained_models folder inside GPT_SoVITS/ will contain several .pth and .ckpt files.

This might happen

Download stalls or fails mid-way (common on slow connections or in regions with restricted access to Hugging Face)

Re-run the same install command — it resumes where it left off. Or swap --source HF for --source ModelScope to use the Chinese mirror.

Step 4 of 6

Launch the WebUI

2 min

This starts the local web server. After a few seconds it will print a URL like http://127.0.0.1:9874. Open that URL in any browser. The full interface loads there — you never need to touch code.

Terminal · mac

$ # Make sure the conda environment is still active, then:

$ python webui.py

What you should see

Terminal prints something like: Running on local URL: http://127.0.0.1:9874. Your browser may open automatically.

This might happen

Port 9874 already in use, or the page shows a connection error

Another program is using that port. Stop it, or add --port 9875 to the command: python webui.py --port 9875

Step 5 of 6

Upload your reference audio and generate speech

5 min

In the browser UI, go to the 'TTS Inference' tab. Upload your reference audio clip (WAV or MP3, 5–30 seconds works for zero-shot; 1 minute is better). Type the text you want spoken into the text box. Select the language of both the reference audio and the output text. Click 'Synthesize'. The result plays directly in the browser and can be downloaded. No training is needed for a quick zero-shot test — the model clones the voice on the spot.

Terminal · mac

$ # No terminal command needed for this step.

$ # All actions happen in your browser at http://127.0.0.1:9874

What you should see

An audio player appears in the browser with the synthesized speech in the cloned voice. A download button lets you save the WAV file.

This might happen

Output sounds robotic or doesn't match the voice

Use a cleaner reference clip: no background music, no reverb, single speaker only. The UVR5 vocal separator tool built into the WebUI (under 'Tools') can strip background noise from your clip before using it.

Step 6 of 6

(Optional) Fine-tune for better quality with 1 minute of audio

30–60 min

Zero-shot is fast but fine-tuning gives noticeably better results. In the WebUI, go to the 'Training' tab. Upload your 1-minute audio, use the built-in tools to auto-slice it into segments and label them, then click the training buttons in order (SoVITS train, then GPT train). Training takes 20–60 minutes on a GPU. Once done, load your new model files in the Inference tab for higher-quality output. The WebUI walks you through each sub-step visually.

Terminal · mac

$ # No terminal command needed.

$ # All steps happen in the browser Training tab at http://127.0.0.1:9874

What you should see

Two new model files appear in your output folder (a .pth SoVITS model and a .ckpt GPT model). You can load these in the Inference tab.

This might happen

CUDA out of memory error during training

In the Training tab, reduce the batch size to 2 or 1. If you have no GPU, training on CPU is extremely slow (hours) and not recommended.

// Status

cooked. baked. worked.

A locally running browser app that can clone a voice from a short audio clip and speak any text you type in that voice, with optional fine-tuning for higher quality output.

// the honest bit

The honest part

This project has many moving parts and a large dependency stack — the install step alone can take 30 minutes and occasionally fails on the first try due to network issues or mismatched CUDA versions. CPU inference works but is very slow (10x+ real-time). Mac GPU training produces lower-quality results than NVIDIA GPU training, per the project's own docs. The WebUI is in active development and the interface changes between versions. Fine-tuning requires clean, noise-free audio — poor source audio gives poor results regardless of settings.

Clone Any Voice in 1 Minute with GPT-SoVITS

// Build stats

What you need

Create a fresh Python environment

Download the project files

Run the automated installer for your platform

Launch the WebUI

Upload your reference audio and generate speech

(Optional) Fine-tune for better quality with 1 minute of audio

cooked. baked. worked.

The honest part