Generating Text to Speech locally

Generating text to speech locally for free

Posted by : mahbub on Aug 27, 2025

Category : TTS AI ML

A few weeks ago, a colleague shared an online service that claimed to generate text-to-speech (TTS) from short voice recordings, mimicking a person’s speech. I was surprised, as I’d always assumed only a few seconds of audio wouldn’t be enough to accurately replicate someone’s speech patterns – everyone articulates words differently.

Intrigued, I asked my colleague to share the link so I could investigate. After reviewing the top front-page samples from “celebrity” voices (which didn’t perfectly match the celebrities themselves, as I suspected), and learning that the service cost $30 per month with limited TTS generation, so obviously I decided not to use it (I am not sure what is up with every company now a days, trying to take our money every month - not going to happen with me :p). However, even though the celebrity voices didn’t quite match, the TTS outputs were surprisingly natural-sounding. If I could find a more natural-sounding TTS system that’s free and offers unlimited usage, that would be truly ideal!

Before this recent surge in AI technology, I’d been relying on the ReadPleasePlus2003 app for over 10 years. This app used Windows’ built-in voices and worked well with recent versions of Windows 11. However, ReadPleasePlus2003 is an abandoned app, no longer updated, and so it could stop working at any time. Furthermore, its artificial speech wasn’t particularly pleasant to listen to.

So, I began searching GitHub for TTS or related tools that could work on Windows 11. After scouring through dozens of abandoned projects, I finally found one called “TTS” by coqui-ai, thanks to a helpful post in Stack Overflow. Using a quick Python script, I was able to make it behave like ReadPleasePlus2003 – allowing me to read a block of text from anywhere, as long as I can copy the text to the clipboard, and the script is running, of course. A fun fact is that with this tool, you can use your own voice as a sample, resulting in output similar to the paid service that costs $30 per month for limited use.

Here’s how you can set up this tool on your Windows 11 PC with CUDA 11:

Prerequisites:

Python 3.11: Download and install the latest version of Python 3.11. If you have other Python versions installed, install this one in a separate folder and rename the python.exe file to python3.11.exe. Then add the path to this executable to your system’s environment variables.
CUDA Toolkit 11.8: Download and install the CUDA Toolkit 11.8. Extract the downloaded tarball and copy the contents of the extracted folder into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8.
CUDA DNN Library (CUDNN): Download the CUDNN v9.10.2 tarball specifically for CUDA 11.8 (the latest version compatible with CUDA 11). Extract the contents and place them within C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8.

System Environment Variables: Add the following paths to your system’s environment variables:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64

eSpeak NG: Download and install the latest 64-bit version of eSpeak NG from: https://github.com/espeak-ng/espeak-ng/releases
Git for Windows: Download and install the latest 64-bit version of Git for Windows.

Setup Steps:

Open a PowerShell prompt.
Navigate to a new working directory.
Run the following command to clone the coqui-ai TTS repository: git clone https://github.com/coqui-ai/TTS.git
Navigate into the TTS directory: cd TTS
Create a virtual environment: python3.11 -m venv .
Activate the virtual environment: .\scripts\pip install -e .
Install the necessary Python packages: .\scripts\pip install torch==2.2.0+cu118 torchvision==0.17.0+cu118 torchaudio==2.2.0 transformers==4.35.2 pyperclip pyaudio -f https://download.pytorch.org/whl/torch_stable.html

Python script

Put this script in the root folder of the TTS, then run using .\scripts\python3.11 main.py this will get the text from your clipboard and convert it to audio then play. Every time you copy some different text it will, the script will convert it to audio and play automatically. When trying to run the script, if it complaint about any missing module, just install them using .\scripts\pip install <module name> command.

import os
import sys
import torch
import pyperclip
import time
import wave
import pyaudio
import torchaudio
from TTS.api import TTS

outputFile = "output.wav"

def initialize_tts():
    """Initialize CUDA device and TTS model."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    if torch.cuda.is_available():
        print("CUDA installed successfully\n")
    else:
        print("CUDA not properly installed. Stopping process...")
        quit()

    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", progress_bar=True).to(device)
    return tts

def synthesize_speech(tts, text):
    """Generate audio from text using the TTS model."""
    tts.tts_to_file(
        text=text,
        speaker_wav="male.wav",
        language="en",
        file_path= outputFile
    )

def play_output_audio():
    # Open the WAV file
    wf = wave.open(outputFile, 'rb')
    p = pyaudio.PyAudio()

    # Open stream
    stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                    channels=wf.getnchannels(),
                    rate=wf.getframerate(),
                    output=True)

    # Read and play the audio
    data = wf.readframes(1024)
    while data:
        stream.write(data)
        data = wf.readframes(1024)

    # Cleanup
    stream.stop_stream()
    stream.close()
    p.terminate()


def clipboard_listener(tts):
    recent_text = ""
    print("Listening for clipboard changes... Press Ctrl+C to stop.\n")
    
    try:
        while True:
            current_text = pyperclip.paste()
            if current_text != recent_text and isinstance(current_text, str) and current_text.strip():
                print("New clipboard text detected")
                synthesize_speech(tts, current_text)
                play_output_audio()
                
                recent_text = current_text
            time.sleep(1)  # Check every second
    except KeyboardInterrupt:
        print("\n👋 Listener stopped by user.")

def main():
    tts = initialize_tts()
    clipboard_listener(tts)

if __name__ == "__main__":
    main()

About Mahbub Mozadded

Software designer and developer.

Email : mahbub@mozadded.com

Website : http://mozadded.com