🎤 F5-TTS Voice Cloning and 🔬 Denoising Process Visualization

Clone any voice with just 5-30 seconds of reference audio and see how noise transforms into speech step by step.

Developed by Noel Triguero. Model by SWivid

See how the model transforms pure noise into clean audio step by step. The F5-TTS model uses 32 "denoising" steps to generate the final audio.

Input

Reference Audio

Transcription

Text to Generate

Status

Intermediate Denoising Steps

Select Step

0=Initial noise, 1=Step 12, 2=Step 20, 3=Step 26, 4=Step 32 (final)
(First 10 steps are noise for humans)

0 4

Audio at Selected Step

💡 Tips for Better Results

Clean audio: No background noise, music or echo
Duration: 5-30 seconds is ideal
Exact transcription: The transcription must match the audio exactly
Clear speech: Constant volume and clear pronunciation
Language: Reference audio and text should be in english or chinese

🔧 Technical Information

Model: F5-TTS (Flow Matching Text-to-Speech)
Vocoder: Vocos
Device: CPU (may take a while...)