StyleTTS 2和Tortoise-TTS

近年来，得益于深度学习的进步和大规模数据集的可用性，文本转语音 (TTS) 合成领域取得了显著进展。现代 TTS 系统能够产生高度逼真和富有表现力的语音，模糊了合成声音和自然声音之间的界限。本项目指南深入探讨了这场革命前沿的两个突出模型：StyleTTS 2 和 Tortoise-TTS。这两种模型都代表了语音克隆和语音合成的创新方法，每种方法都利用不同的技术来实现令人印象深刻的结果。

本指南是探索这些强大的 TTS 系统的起点，简要概述了它们的核心原理、方法和主要功能。此外，本指南将提供代码示例来说明如何使用每个项目。请注意，这些示例仅用于教育和研究目的。我不对这些技术的任何非法或不道德的使用负责。虽然本指南重点介绍了这两个特定项目，但必须承认，开源社区提供了一个充满活力的 TTS 模型生态系统，每个模型都有自己的优势和局限性。鼓励读者探索其他项目并比较其性能和功能，以更广泛地了解语音克隆技术的发展前景。

1、StyleTTS 2：利用风格扩散和 SLM 对抗训练

StyleTTS 2 凭借其对风格扩散和对抗训练的创新应用，以及大型语音语言模型 (SLM) 脱颖而出。该模型可以有效地捕捉和合成各种语音风格，而无需依赖参考音频，从而产生富有表现力和自然的语音。

图 1：StyleTTS 2 在单说话人情况下的训练和推理方案。对于多说话人情况，声学和韵律风格编码器（表示为 E）首先获取目标说话人的参考音频 xref，并生成参考风格向量 c = E(xref)。然后，风格扩散模型以 c 为参考，对与 xref 中的说话人相对应的 sp 和 sa 进行采样。

StyleTTS 2 的关键组件包括：

风格扩散：这种新技术将语音风格建模为潜在随机变量，允许根据输入文本对不同风格进行有效采样和操作。通过利用扩散模型，StyleTTS 2 无需参考音频即可生成多样化且富有表现力的语音。
SLM 对抗训练：预先训练的 SLM（如 WavLM）充当鉴别器，为模型提供反馈并增强合成语音的自然度。这利用了 SLM 中嵌入的丰富声学和语义知识来显着提高语音质量。
可区分持续时间建模：这种创新方法支持使用 SLM 鉴别器进行端到端训练，解决了以前方法中普遍存在的不稳定性问题。这允许模型直接从 SLM 反馈中学习，从而实现更准确和自然的持续时间控制。
端到端训练：整个 StyleTTS 2 架构是联合训练的，优化了所有组件，包括波形生成。这种整体方法进一步提高了合成语音的整体质量和连贯性。

2、StyleTTS 2 语音克隆代码示例

这个初始代码块使一切准备就绪。它克隆 StyleTTS2 存储库，安装必要的 Python 包，下载预先训练的模型，并准备参考音频。

%%shell
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git
sudo apt-get install espeak-ng
git-lfs clone https://huggingface.co/yl4579/StyleTTS2-LibriTTS
mv StyleTTS2-LibriTTS/Models .
mv StyleTTS2-LibriTTS/reference_audio.zip .
unzip reference_audio.zip
mv reference_audio Demo/reference_audio
pip install yt_dlp

2.1 导入库

接下来我们导入基本库，设置随机种子以实现可重复性，并定义一些实用函数。这些函数处理诸如预处理音频、计算样式嵌入以及将文本转换为音素等任务。

import nltk
nltk.download('punkt')

%cd StyleTTS2

import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

import random
random.seed(0)

import numpy as np
np.random.seed(0)

# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize

from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()

2.2 加载预训练模型和配置

在这里，我们加载 StyleTTS 2 所依赖的各种预训练模型，包括 ASR 模型、音调提取器和 BERT 模型。我们还加载配置文件，其中包含模型的重要参数。

%matplotlib inline

to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def length_to_mask(lengths):
    mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
    mask = torch.gt(mask+1, lengths.unsqueeze(1))
    return mask

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

def compute_style(path):
    wave, sr = librosa.load(path, sr=24000)
    audio, index = librosa.effects.trim(wave, top_db=30)
    if sr != 24000:
        audio = librosa.resample(audio, sr, 24000)
    mel_tensor = preprocess(audio).to(device)

    with torch.no_grad():
        ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
        ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))

    return torch.cat([ref_s, ref_p], dim=1)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

config = yaml.safe_load(open("Models/LibriTTS/config.yml"))

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)

# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)

model_params = recursive_munch(config['model_params'])
model = build_model(model_params, text_aligner, pitch_extractor, plbert)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

params_whole = torch.load("Models/LibriTTS/epochs_2nd_00020.pth", map_location='cpu')
params = params_whole['net']

for key in model:
    if key in params:
        print('%s loaded' % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict
            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:] # remove `module.`
                new_state_dict[name] = v
            # load params
            model[key].load_state_dict(new_state_dict, strict=False)
#             except:
#                 _load(params[key], model[key])
_ = [model[key].eval() for key in model]

2.3 设置扩散采样器和推理函数

这部分代码设置扩散采样器，它是 StyleTTS 2 的关键组件。它还定义了我们将用于生成语音的主要推理函数（推理、LFinference、STinference）。这就像配置驱动语音合成过程的核心引擎。

from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
    clamp=False
)

def inference(text, ref_s, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)
    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
                                          embedding=bert_dur,
                                          embedding_scale=embedding_scale,
                                            features=ref_s, # reference from the same speaker as the embedding
                                             num_steps=diffusion_steps).squeeze(1)


        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        ref = alpha * ref + (1 - alpha)  * ref_s[:, :128]
        s = beta * s + (1 - beta)  * ref_s[:, 128:]

        d = model.predictor.text_encoder(d_en,
                                         s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)

        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)


        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(en)
            asr_new[:, :, 0] = en[:, :, 0]
            asr_new[:, :, 1:] = en[:, :, 0:-1]
            en = asr_new

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(asr)
            asr_new[:, :, 0] = asr[:, :, 0]
            asr_new[:, :, 1:] = asr[:, :, 0:-1]
            asr = asr_new

        out = model.decoder(asr,
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))


    return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later

def LFinference(text, s_prev, ref_s, alpha = 0.3, beta = 0.7, t = 0.7, diffusion_steps=5, embedding_scale=1):
  text = text.strip()
  ps = global_phonemizer.phonemize([text])
  ps = word_tokenize(ps[0])
  ps = ' '.join(ps)
  ps = ps.replace('``', '"')
  ps = ps.replace("''", '"')

  tokens = textclenaer(ps)
  tokens.insert(0, 0)
  tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

  with torch.no_grad():
      input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
      text_mask = length_to_mask(input_lengths).to(device)

      t_en = model.text_encoder(tokens, input_lengths, text_mask)
      bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
      d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

      s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
                                        embedding=bert_dur,
                                        embedding_scale=embedding_scale,
                                          features=ref_s, # reference from the same speaker as the embedding
                                            num_steps=diffusion_steps).squeeze(1)

      if s_prev is not None:
          # convex combination of previous and current style
          s_pred = t * s_prev + (1 - t) * s_pred

      s = s_pred[:, 128:]
      ref = s_pred[:, :128]

      ref = alpha * ref + (1 - alpha)  * ref_s[:, :128]
      s = beta * s + (1 - beta)  * ref_s[:, 128:]

      s_pred = torch.cat([ref, s], dim=-1)

      d = model.predictor.text_encoder(d_en,
                                        s, input_lengths, text_mask)

      x, _ = model.predictor.lstm(d)
      duration = model.predictor.duration_proj(x)

      duration = torch.sigmoid(duration).sum(axis=-1)
      pred_dur = torch.round(duration.squeeze()).clamp(min=1)


      pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
      c_frame = 0
      for i in range(pred_aln_trg.size(0)):
          pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
          c_frame += int(pred_dur[i].data)

      # encode prosody
      en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
      if model_params.decoder.type == "hifigan":
          asr_new = torch.zeros_like(en)
          asr_new[:, :, 0] = en[:, :, 0]
          asr_new[:, :, 1:] = en[:, :, 0:-1]
          en = asr_new

      F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

      asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
      if model_params.decoder.type == "hifigan":
          asr_new = torch.zeros_like(asr)
          asr_new[:, :, 0] = asr[:, :, 0]
          asr_new[:, :, 1:] = asr[:, :, 0:-1]
          asr = asr_new

      out = model.decoder(asr,
                              F0_pred, N_pred, ref.squeeze().unsqueeze(0))


  return out.squeeze().cpu().numpy()[..., :-100], s_pred # weird pulse at the end of the model, need to be fixed later

def STinference(text, ref_s, ref_text, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)

    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    ref_text = ref_text.strip()
    ps = global_phonemizer.phonemize([ref_text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)

    ref_tokens = textclenaer(ps)
    ref_tokens.insert(0, 0)
    ref_tokens = torch.LongTensor(ref_tokens).to(device).unsqueeze(0)


    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        ref_input_lengths = torch.LongTensor([ref_tokens.shape[-1]]).to(device)
        ref_text_mask = length_to_mask(ref_input_lengths).to(device)
        ref_bert_dur = model.bert(ref_tokens, attention_mask=(~ref_text_mask).int())
        s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
                                          embedding=bert_dur,
                                          embedding_scale=embedding_scale,
                                            features=ref_s, # reference from the same speaker as the embedding
                                             num_steps=diffusion_steps).squeeze(1)


        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        ref = alpha * ref + (1 - alpha)  * ref_s[:, :128]
        s = beta * s + (1 - beta)  * ref_s[:, 128:]

        d = model.predictor.text_encoder(d_en,
                                         s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)

        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)


        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(en)
            asr_new[:, :, 0] = en[:, :, 0]
            asr_new[:, :, 1:] = en[:, :, 0:-1]
            en = asr_new

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(asr)
            asr_new[:, :, 0] = asr[:, :, 0]
            asr_new[:, :, 1:] = asr[:, :, 0:-1]
            asr = asr_new

        out = model.decoder(asr,
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))


    return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later

2.4 从 YouTube 下载参考音频

在这里我们创建一个从 YouTube 下载音频的函数，我们可以将其用作语音克隆的参考音频。

from yt_dlp import YoutubeDL

def download_audio(youtube_link, output_path="/content"):
  """Downloads audio from a YouTube video.

  Args:
    youtube_link: The URL of the YouTube video.
    output_path: The path to save the audio file. Defaults to "audio".
  """

  ydl_opts = {
      'format': 'bestaudio/best',
      'postprocessors': [{
          'key': 'FFmpegExtractAudio',
          'preferredcodec': 'wav',
          'preferredquality': '192',
      }],
      'outtmpl': f'{output_path}/%(NA)s.%(ext)s'
  }

  with YoutubeDL(ydl_opts) as ydl:
      ydl.download([youtube_link])

  print(f"Audio downloaded from {youtube_link} and saved to {output_path}")

download_audio("https://www.youtube.com/watch?v=VID") # replace The VID with yours

2.5 使用 StyleTTS 2 进行简单推理

现在我们开始使用模型。我们指定一个参考音频文件和一些文本，然后使用推理函数生成语音。我们还计算实时因子 (RTF) 以查看合成速度有多快。

reference_dicts = {}
reference_dicts['Voice_NAME'] = "/content/NA.wav" # change the voice reference name (ex: Trump) , also the file path of your reference file (should be in wav format)
text = """
StyleTTS 2 is a text to speech model that leverages style diffusion and adversarial training with large speech language models to achieve human level text to speech synthesis
"""
# Simple Usage 
noise = torch.randn(1,1,256).to(device)
for k, path in reference_dicts.items():
    ref_s = compute_style(path)
    start = time.time()
    wav = inference(text, ref_s, alpha=0.3, beta=0.7, diffusion_steps=5, embedding_scale=1)
    rtf = (time.time() - start) / (len(wav) / 24000)
    print(f"RTF = {rtf:5f}")
    import IPython.display as ipd
    print(k + ' Synthesized:')
    display(ipd.Audio(wav, rate=24000, normalize=False))
    print('Reference:')
    display(ipd.Audio(path, rate=24000, normalize=False))

2.6 探索不同的情感风格

本节演示如何通过提供具有不同情感的不同文本提示来控制生成语音的情感风格。

texts = {}
ref_s = compute_style("/content/NA.wav")
texts['Happy'] = "We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands."
texts['Sad'] = "I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence."
texts['Angry'] = "The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations!"
texts['Surprised'] = "I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?"

for k,v in texts.items():
    wav = inference(v, ref_s, diffusion_steps=10, alpha=0.3, beta=0.7, embedding_scale=1)
    print(k + ": ")
    display(ipd.Audio(wav, rate=24000, normalize=False))

Alpha 和 Beta 是决定我们使用基于文本而不是参考采样的样式的因素。alpha 和 beta 的值越高，样式越适合文本，但与参考的相似度越低。使用更高的 beta 会使合成语音更具情感，但代价是与参考的相似度较低。 alpha 决定说话者的音色，而 beta 决定韵律。

2.7 使用 StyleTTS 2 进行长篇叙述

最后，此代码展示了如何使用 LFinference 函数为较长的文本段落生成语音。这对于有声读物叙述或创建较长的画外音等任务特别有用。

# Longform Narration
passage =  """
If the supply of fruit is greater than the family needs,it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word "Homemade," when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first class home made products there is a market in all large cities. All first-class grocers have customers who purchase such goods.
"""

path = "/content/NA.wav"
s_ref = compute_style(path)
sentences = passage.split('.') # simple split by comma
wavs = []
s_prev = None
for text in sentences:
    if text.strip() == "": continue
    text += '.' # add it back

    wav, s_prev = LFinference(text,
                              s_prev,
                              s_ref,
                              alpha = 0.3,
                              beta = 0.9,  # make it more suitable for the text
                              t = 0.7,
                              diffusion_steps=10, embedding_scale=1.5)
    wavs.append(wav)
print('Synthesized: ')
display(ipd.Audio(np.concatenate(wavs), rate=24000, normalize=False))
print('Reference: ')
display(ipd.Audio(path, rate=24000, normalize=False))

StyleTTS 2 在各种基准测试中都表现出了最先进的性能，在主观和客观评估中都取得了令人印象深刻的结果。它无需参考音频即可生成丰富多样且富有表现力的语音，为语音克隆和个性化语音合成应用开辟了令人兴奋的可能性。

3、Tortoise-TTS：结合自回归解码器和扩散模型

Tortoise-TTS 采用了不同的方法，通过结合自回归解码器和扩散模型 (DDPM)，从图像生成的进步中汲取灵感。这种独特的架构使 Tortoise-TTS 能够利用两种方法的优势实现高质量的语音合成。

Tortoise-TTS 的主要功能包括：

自回归解码器和 DDPM：自回归转换器将文本转换为“语音标记”序列，然后由 DDPM 将其解码为高质量的 MEL 频谱图。这种组合可以实现高效准确的文本到语音转换。
对比语言语音预训练 (CLVP) 模型：受 DALL-E 的启发，Tortoise-TTS 采用基于文本和语音对训练的 CLIP 类模型来对自回归解码器的输出进行评分。这样可以在计算成本高昂的 DDPM 解码之前选择质量最高的候选者。
条件输入：为了改善语音特性并减少搜索空间，Tortoise-TTS 将目标说话者的参考音频片段作为附加输入，调节自回归和扩散模型。
Tortoise 技巧：在自回归潜在空间而不是离散标记上微调 DDPM 可显著提高效率和质量。
大规模训练：Tortoise-TTS 受益于在结合 LibriTTS、HiFiTTS 和独立收集的 49,000 小时有声读物和播客数据集的海量数据集上进行训练。

4、Tortoise-TTS 语音克隆代码示例

在这里，我们正在安装必要的软件包、克隆 Tortoise-TTS 存储库并安装其特定的依赖项。

%%capture
!pip3 install -U scipy
!git clone https://github.com/jnordberg/tortoise-tts.git
%cd tortoise-tts
!pip3 install -r requirements.txt
!pip3 install transformers==4.19.0 einops==0.5.0 rotary_embedding_torch==0.1.5 unidecode==1.3.5
!python3 setup.py install
!pip install yt_dlp

4.1 导入库并初始化 Tortoise-TTS

现在我们导入处理音频和 Tortoise-TTS 所需的代码库。我们还创建了一个 TextToSpeech 对象，这是我们与模型交互的主要接口。这会从互联网上下载必要的模型文件，因此可能需要一点时间。

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import IPython
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices
# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

4.2 定义文本和语音参数

import os
text = """
This methodology of improving performance need not be confined to images. This
paper describes a way to apply advances in the image generative domain to speech
synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system
"""
# Name of the voice
CUSTOM_VOICE_NAME = "ANY" # for example Trump or any other name .. 
# preset is the speed of the clone voice
preset = "standard"
custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)

在这里，我们指定要转换为语音的文本并为我们的自定义语音设置一些参数。我们正在创建一个文件夹来存储语音数据并为其选择一个名称。我们还定义了预设，它控制生成的语音的速度。

4.3 获取语音数据（YouTube 或上传）

本节为我们提供了两种获取音频数据以训练自定义语音的选项：使用 yt-dlp 从 YouTube 视频下载或直接上传 WAV 文件。

选项 1：从 YouTube 下载：

from yt_dlp import YoutubeDL

def download_audio(youtube_link):
  """Downloads audio from a YouTube video.
  Args:
    youtube_link: The URL of the YouTube video.
    output_path: The path to save the audio file. Defaults to "audio".
  """
  ydl_opts = {
      'format': 'bestaudio/best',
      'postprocessors': [{
          'key': 'FFmpegExtractAudio',
          'preferredcodec': 'wav',
          'preferredquality': '192',
      }],
      'outtmpl': f'{custom_voice_folder}/%(0)s.%(ext)s'
  }
  with YoutubeDL(ydl_opts) as ydl:
      ydl.download([youtube_link])
  print(f"Audio downloaded from {youtube_link} and saved to {custom_voice_folder}")
download_audio("https://www.youtube.com/watch?v=VID") # replace The VID with yours

选项 2：从计算机上传：

from google.colab import files
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)

4.3 使用自定义语音生成语音

最后，奇迹发生了。。我们加载收集到的语音数据，使用 tts_with_preset 函数根据输入的文本和所选语音生成语音，然后将结果保存为 WAV 文件。我们甚至可以直接在笔记本中收听它。

# Generating speech with the custom voice.
voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
                          preset=preset)
torchaudio.save(f'generated-{CUSTOM_VOICE_NAME}.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f'generated-{CUSTOM_VOICE_NAME}.wav')

通过这种创新技术和大规模训练的结合，Tortoise-TTS 在语音质量和自然度方面取得了令人印象深刻的成果。它能够利用来自其他领域的进步，例如图像生成，凸显了跨领域学习在突破语音合成技术界限方面的潜力。

5、结束语

我们了解了 Tortoise-TTS 和 StyleTTS 2 的复杂工作原理，这两个令人印象深刻的例子表明了文本转语音技术的发展程度。它们提供了强大的工具来生成逼真且富有表现力的合成声音，为各种应用开辟了无限可能。

重要的是要记住，这些只是这两个例子是语音克隆领域不断发展的众多例子中的两个。开源社区不断突破界限，根据你的特定需求，可能会有更好的模型。

真正的魔力往往在于定制。你可以采用这些预先训练的模型，并使用你自己的数据对其进行微调，以创建更加独特且适合你用例的声音。这对于捕捉原始训练数据中可能不存在的特定俚语、口音或其他细微差别特别有用。

最适合你的语音克隆模型取决于你的个人需求以及你愿意在培训和定制上投入的努力。探索这些技术的旅程正在进行中，通过不断学习和实验，你可以为你的项目释放语音克隆的全部潜力。

原文链接：Hands-On with Voice Cloning : Code Examples and Insights from TorToise-TTS and StyleTTS 2

汇智网翻译整理，转载请标明出处