MODEL-ZOO

用Kokoro TTS制作AI配音

如果我告诉你，现在可以完全免费使用一种不仅好而且可能比 ElevenLabs 等行业领先（且昂贵）服务更好的工具来创建专业品质的 AI 配音，你会怎么想？听起来好得令人难以置信？

admin

Feb 2, 2025 • 12 min read

这就是 Kokoro TTS 的创建者每天都在证明的错误。这不仅仅是另一个文本转语音模型；这是一场 8200 万参数的开源革命，它颠覆了 AI 语音世界。我们谈论的是一个在极低预算下训练的模型，在日常计算机上运行，但在主要基准测试中却以某种方式超越了巨头。

好奇吗？你应该好奇。

这篇博文不仅会揭开 Kokoro TTS 背后的秘密，还会逐步指导你如何利用它的力量来创建令人惊叹的画外音，将你的项目提升到新的高度。

系好安全带，因为人工智能语音的未来就在这里，而且它是免费的。这不仅仅是一个教程；它是你进入一个正在重新定义景观的社区的大门。

1、Kokoro TTS 现象：AI竞技场上的大卫与歌利亚

我们都喜欢弱者的故事。而在人工智能语音的世界里，Kokoro TTS 是终极弱者。想象一下：一个只有 8200 万个参数的微型模型，仅用不到 100 小时的音频数据进行训练，与 ElevenLabs 等巨头正面交锋，后者的模型是在庞大的数据集上进行训练，并获得了大量资金支持。听起来像是注定要失败的，对吧？错了。

令人震惊的是：Kokoro TTS 不仅参加了比赛，还赢了。它在 Hugging Face TTS Arena 上夺得了第一名，这是一个严格评估最佳文本转语音模型的排行榜。它是如何实现这一看似不可能的壮举的？

秘诀在于其巧妙的架构和设计的效率。这是一个高度优化的模型，利用了仅解码器的架构，并在 A100 80GB GPU 上使用大约 500 个 GPU 小时进行训练。这种效率转化为几个令人难以置信的优势：

超快推理：即使在 CPU 上也能实时生成音频。无需再等待云服务器。
随处运行：从笔记本电脑到 Raspberry Pi，Kokoro TTS 几乎可以在任何设备上运行。
开源自由：修改、改进它，将其集成到你的项目中，不受限制。权力真正掌握在你手中。

但最令人惊讶的部分是什么？它完全免费，同时做到了这一切。这不仅仅是为了省钱；这是为了让尖端人工智能的普及成为可能。

但问题仍然存在：免费工具真的能与高级服务相媲美吗？让我们来测试一下……

2、解锁魔法：Kokoro TTS 的分步指南

悬念够了。让我们开始实践吧。以下是现在开始使用 Kokoro TTS 进行创作的方法。我们将在本教程中使用 Google Colab，让每个人都可以访问它，无论你的硬件如何。

2.1 Colab 连接（免费 AI 游乐场）

首先，打开这个 Google Colab 笔记本：此处

2.2 启动引擎（一键安装）

运行笔记本中的第一个单元。此单元安装所有必要的依赖项，包括 kokoro 库本身，以及一些其他使魔法实现的工具。将其视为准备你的数字语音工作室。

# 1️⃣ Install kokoro
!pip install -q kokoro soundfile

# 2️⃣ Install espeak, used for out-of-dictionary fallback
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# You can skip espeak installation, but OOD words will be skipped unless you provide a fallback

# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf

# 🇺🇸 'a' => American English
# 🇬🇧 'b' => British English
pipeline = KPipeline(lang_code='a') # make sure lang_code matches voice

# The following text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''

2.3 选择你的声音（与演员见面）

Kokoro TTS 带有一系列独特的声音，每个声音都有自己的个性。运行第二个单元以加载默认语音包，这是两个不同声音的迷人融合：Bella 和 Sarah。

秘密武器：想尝试其他声音吗？您可以在 Kokoro TTS Hugging Face 页面上找到可用语音包的列表。只需更改代码中的 VOICE_NAME 即可。例如，语音包目前包括：

af：默认语音，Bella 和 Sarah 各占一半
af_bella、af_sarah、am_adam、am_michael、
bf_emma、bf_isabella、bm_george、bm_lewis、
af_nicole、af_sky

2.4 畅所欲言（让 AI 说话）

现在到了有趣的部分！在第三个单元格中，你将找到一个文本变量。将示例文本替换为你希望 Kokoro TTS 说的任何内容。发挥创意！

2.5 聆听魔法（即时音频）

运行单元格。几秒钟内，你的 Colab 笔记本中就会出现一个音频播放器。点击播放，聆听你的文本转换为令人惊讶的自然语音。

行动项目：尝试使用不同的文本输入生成音频。尝试不同的声音，并仔细聆听细微差别。

但我们才刚刚开始。如果你能创造自己独特的 AI 声音会怎样？

from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch

# 1. Initialize the pipeline
pipeline = KPipeline(lang_code='a')  # 'a' for American English

# 2. Define the available voices (from the model's config)
# Though not directly used in mixing, it's good practice to keep this list
available_voices = [
    'af', 'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky'
]

# 3. Specify Device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 4. Load the voice embeddings (download these from Hugging Face)
voice_01 = 'af_bella'
voice_02 = 'am_michael'
# voice_03 = 'bm_lewis'
# 5. Define the text
text = '''
Kokoro TTS stands at the intersection of art and technology, weaving humanlike warmth into synthetic voices. 
It’s the future of audio creation, boldly forging a path where open-source innovation conquers barriers,
 granting every creator the freedom to share their story.
'''

# 6. Generate audio using individual voices (similar to the image)
print("Generating audio with individual voices...")

generator_01 = pipeline(text, voice=voice_01, speed=1, split_pattern=r'\n+')
for i, (gs, ps, audio) in enumerate(generator_01):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

generator_02 = pipeline(text, voice=voice_02, speed=1, split_pattern=r'\n+')
for i, (gs, ps, audio) in enumerate(generator_02):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

# generator_03 = pipeline(text, voice=voice_03, speed=1, split_pattern=r'\n+')
# for i, (gs, ps, audio) in enumerate(generator_03):
#     print(f"Voice 03 - Segment {i}:")
#     print(gs)
#     display(Audio(data=audio, rate=pipeline.sample_rate, autoplay=False))
#     sf.write(f'voice_03_segment_{i}.wav', audio, pipeline.sample_rate)

3、语音炼金术：制作你的标志性声音

这就是 Kokoro TTS 真正闪耀的地方。它不仅仅是选择一种声音；而是创造一种声音。得益于其巧妙的语音混合功能，你可以将不同的语音包混合在一起，微调比例以制作出你独有的声音。

想象一下：

Bella 的温暖与 Sarah 的清脆相结合，创造出一种既友好又有权威的声音。
Michael 的深沉音调与 Emma 的温柔声音融合在一起，非常适合平静的冥想应用程序。

可能性只收到你的想象力的限制。

它是如何工作的？

它非常简单。每个语音包都由一组数字（嵌入）表示。通过以不同的比例将这些数字相加，你可以有效地混合这些声音的特征。

# 7. Mix the voices (e.g., average of af_bella and am_michael)
print("\nGenerating audio with mixed voice (average)...")
mixed_voice_bf = voice_01_pt* 0.7 + voice_02_pt*0.3 

# 8. Generate audio using the mixed voice
# Instead of passing the tensor directly, create a new voice key and register it with the pipeline
# Reshape the mixed voice to the expected shape (510, 1, 256)
pipeline.voices['mixed_voice'] = mixed_voice_bf.squeeze(0)  # Register the mixed voice after squeezing

generator_mixed = pipeline(
    text, voice='mixed_voice',  # Use the registered key
    speed=1, split_pattern=r'\n+'
)

for i, (gs, ps, audio) in enumerate(generator_mixed):
    print(f"Mixed Voice - Segment {i}:")
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

print("\nAudio generation complete!")

说明：

加载语音包：我们使用 torch.load() 加载两个不同的语音包。
混合语音：我们将每个语音包乘以权重（表示混合比）并将它们相加，从而创建一个mixed_voice。在此示例中，我们创建了 af_bella 和 am_michael 的 70/30 混合。
使用混合语音生成：我们使用带有mixed_voice的generate()函数，使用混合语音合成音频。
显示音频使用IPython.display中的display和Audio函数播放音频

操作项：花一些时间尝试不同的语音组合和比例。创建完全符合你项目需求的语音。保存你最喜欢的自定义语音的设置以供日后使用。

类似示例 2：

# 7. Mix the voices (e.g., average of af_bella and am_michael)
print("\nGenerating audio with mixed voice (average)...")
mixed_voice_bf = (voice_01_pt + voice_02_pt) / 2

# 8. Generate audio using the mixed voice
# Instead of passing the tensor directly, create a new voice key and register it with the pipeline
# Reshape the mixed voice to the expected shape (510, 1, 256)
pipeline.voices['mixed_voice'] = mixed_voice_bf.squeeze(0)  # Register the mixed voice after squeezing

generator_mixed = pipeline(
    text, voice='mixed_voice',  # Use the registered key
    speed=1, split_pattern=r'\n+'
)

for i, (gs, ps, audio) in enumerate(generator_mixed):
    print(f"Mixed Voice - Segment {i}:")
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

print("\nAudio generation complete!")

在这里，你将成为配音艺术家，为你的项目塑造完美的 AI 声音。

4、超越基础：专业提示和未来

你现在已经掌握了 Kokoro TTS 的基础知识。但要真正释放其专业潜力，这里有一些高级技巧和对未来的展望：

掌握发音：Kokoro TTS 可以轻松处理大多数单词，但它确实依赖 espeak-ng 进行字素到音素 (g2p) 转换。这意味着对于词典中没有的单词，它将使用最佳猜测，但这可能并不总是正确的。
句子结构很重要：较短的句子通常会导致更好的流畅度和语调。尝试分解长句子以获得最佳效果。
标点符号的力量：逗号、句号和问号不仅仅是为了语法；它们指导人工智能的停顿和语调。策略性地使用它们。
社区力量：加入 GitHub 和 Hugging Face 上的 Kokoro TTS 社区。分享您的创作，向他人学习，并为项目的发展做出贡献。在这里您可以找到最新的提示、技巧和语音包。

Kokoro TTS 的下一步是什么？开发人员正在不断改进。目前，该模型正在进一步开发中，数据集越来越大，这很可能会提高输出质量。

总结：Kokoro TTS 不仅仅是一个工具，更是一场运动。它旨在将高质量 AI 语音的力量交到每个人手中。这是一场正在酝酿的革命。

5、结束语

让你的声音、你的故事，通过 AI 放大。

Kokoro TTS 不仅仅是高级文本转语音服务的免费替代品。它证明了开源协作的力量，也让我们一窥 AI 语音技术的未来。我们探索了它的非凡功能，学会了如何有效地使用它，并发现了通过混合制作独特声音的艺术。现在轮到你加入这场革命了。无论你是经验丰富的开发人员、创意内容创作者，还是只是对 pos 感到好奇

原文链接：How to Create Professional AI Voiceovers for Free with Kokoro TTS (Better than ElevenLabs?)

汇智网翻译整理，转载请标明出处