“Text-to-Speech” (TTS) is a type of technology that converts written text into spoken words. It’s the reverse of Automatic Speech Recognition (ASR), which converts spoken words into written text. TTS is used in a variety of applications, from voice assistants like Siri and Alexa to reading tools for the visually impaired.
The process of TTS generally involves two main steps:
Text Analysis: In this step, the system analyzes the input text to identify words and their correct pronunciations, using a process called text normalization. It also identifies the structure of sentences and the emphasis and pitch contour for each sentence, a process called prosody assignment.
Speech Synthesis: In this step, the system generates the speech output, usually by selecting and concatenating small segments of speech from a large database, a process called concatenative TTS. More recent systems use machine learning models to generate speech directly from the text, a process called parametric TTS.
One of the main challenges in TTS is making the speech sound natural and human-like. This involves getting the pronunciation, intonation, and pacing right, and also making the voice sound expressive and emotional when needed. The best TTS systems today can produce very high-quality speech that’s almost indistinguishable from a human voice, but there’s still a lot of ongoing research in this area to further improve the quality and naturalness of synthesized speech.
« Back to Glossary Index