How speech synthesis works. Speech synthesis, speech recognition - speech signal processing

Oral speech synthesis is the transformation of previously unknown textual information into speech. Speech output is an implementation of a speech interface to simplify the use of the system. In fact, thanks to speech synthesis, another channel is provided for transmitting data from a computer, mobile phone to a person, similar to a monitor. Of course, it is impossible to convey a drawing with your voice, but listen to email or a daily schedule in some cases is quite convenient, especially if at that time the eye is occupied with something else. For example, coming to work in the morning, preparing for negotiations, you could straighten your tie or hairstyle in the mirror while the computer reads out loud latest news, mail or reminds important information for negotiations.

Figure 2.2 - Acoustic signal processing

Speech synthesis technology has found wide application for people with vision problems. For everyone else, it creates a new dimension of ease of use of technology and significantly reduces eye strain, nervous system, allows you to use auditory memory.

Figure 2.3 - Speech synthesis

Any text consists of words separated by spaces and punctuation marks. The pronunciation of words depends on their location in a sentence, and the intonation of a phrase depends on punctuation marks. Finally, pronunciation also depends on the meaning of the word! Accordingly, in order for synthesized speech to sound natural, it is necessary to decide the whole complex tasks related to both ensuring the naturalness of the voice at the level of smoothness of sound and intonation, and with the correct placement of stress, deciphering abbreviations, numbers, abbreviations and special characters, taking into account the peculiarities of the grammar of the Russian language.

There are several approaches to solving the problems:

1) allophone synthesis systems - provide stable, but not enough natural, robotic sound;

2) systems based on the Unit Selection approach - provide a much more natural sound, but may contain fragments of speech with sharp dips in quality, up to loss of intelligibility;

3) hybrid technology based on the Unit Selection approach and supplemented with allophone synthesis units.

Based on this technology, the VitalVoice system was created, which provides stable and natural sound at an acoustic level.

Speech communication is natural and convenient for humans. The goal of speech recognition is to remove the middleman in communication between a person and a computer. Controlling your car with your voice in real time, as well as entering information through human speech, will make life much easier modern man. Teaching a machine to understand without an intermediary the language that people speak among themselves is the task of speech recognition.

Scientists and engineers have been solving the problem for many years verbal communication man and machine. The first speech recognition device appeared in 1952; it could recognize numbers spoken by a person. Commercial speech recognition programs appeared in the early nineties.

All speech recognition systems can be divided into two classes:

1) Speaker-dependent systems - are tuned to the speaker’s speech during the learning process. To work with another speaker, such systems require complete reconfiguration.

Figure 2.4 - Speech recognition

2) Speaker-independent systems - the operation of which does not depend on the speaker. Such systems do not require preliminary training and are capable of recognizing the speech of any speaker.

Initially, the first type of system appeared on the market. In them, the sound image of the team was stored in the form of a holistic standard. Dynamic programming techniques were used to compare the unknown utterance and the command reference. These systems worked well when recognizing small sets of 10-30 commands and understood only one speaker. To work with another speaker, these systems required complete reconfiguration.

In order to understand continuous speech, it was necessary to move on to much larger dictionaries, from several tens to hundreds of thousands of words. The methods used in systems of the first type were not suitable for solving this problem, since it is simply impossible to create standards for such a number of words.

In addition, there was a desire to make a system independent of the speaker. This is a very difficult task, since each person has an individual manner of pronunciation: speech rate, voice timbre, pronunciation features. Such differences are called speech variability. To take it into account, new statistical methods have been proposed, based mainly on the mathematical apparatus of Hidden Markov Models (HMMs) or Artificial Neural Networks. Instead of creating standards for each word, standards are created for the individual sounds that make up the words, so-called acoustic models. Acoustic models are formed by statistically processing large speech databases containing speech recordings of hundreds of people.

IN existing systems speech recognition uses two fundamentally different approaches:

Lexical recognition

Note that creating speech recognition systems is an extremely difficult task.

Speech synthesizer programs are increasingly becoming part of our lives every year. They allow us to learn foreign languages ​​more thoroughly, translate texts into a convenient audio format, are used in the functionality of various utility programs, and much more. And when some of us have a need to reproduce online any text in audio format, then many of us turn to various services and speech synthesis programs that can help us transform the text we need there. In this article I will talk about the network versions of such products, describe what an online speech synthesizer is, what online speech synthesis services exist, and how to use them.

The best online speech synthesizers

Initially, speech synthesizers were developed for people with visual impairments to reproduce text using a computer voice. But gradually their advantages were appreciated by a mass audience, and now almost anyone can download a speech synthesizer on a PC, or use alternatives that are present in some versions of operating systems.

So which online speech synthesizer can you choose? Below I will list a number of services that allow you to reproduce text to speech online.

Ivona is a great synthesizer

The voice engines of this online service are of very high quality, have a good phonetic basis, sound quite natural, and the “metallic” computer voice is felt much less often here than in competing services.

The Ivona service has support for many languages; in the Russian version there is a male voice (Maxim) and a female voice (Tatyana).

  1. To use the speech synthesizer, log in to this resource; on the left there will be a window into which you will need to insert text for reading.
  2. Insert the text, click on the button indicating the person, select the language (Russian) and pronunciation option (female or male) and click on the “Play” button.

Unfortunately, the free functionality of the site is limited to a 250-character sentence, and is intended more to demonstrate the capabilities of the service than for serious work with text. Greater opportunities can only be obtained for a fee.

Acapela - speech recognition service

A company that sells its voice engines for various technical solutions offers you to use the Acapela speech synthesizer online. Although the prosody of this service is not at the same level as that of Ivona, nevertheless, the quality of pronunciation here is also very good. Acapela resource supports about 100 voices in 34 languages.

  1. To use the functionality of the resource, open the specified service, select the Russian language in the left window (Select a language – Russian).
  2. Paste the desired text below and click on the “Listen” button.

The maximum text size for audio reading is 300 characters.

Fromtexttospeech - online service

To translate text to speech online, you can also use the fromtexttospeech service. It works on the principle of converting text into an mp3 audio file, which you can then download to your computer. The service supports text conversion of 50 thousand characters, which is quite a significant amount.

  1. To work with the fromtexttospeech service, go to it, in the “Select Language” option, select “Russian” (there is only one voice here - Valentina).
  2. In the large window, enter (paste) the text you need for voiceover, then click on the “Create Audio File” button.
  3. The text will be processed, then you can listen to the result, and then download it to your PC.
  4. To do this, right-click on “Download audio file” and select “Save target as” from the menu that appears.

Google Translate can also be used

The well-known Google translator online has a built-in text-to-speech function, and the amount of text read here can be quite voluminous.

  1. To work with it, log in to this service (here).
  2. Select the Russian language in the window on the left, and click on the “Listen” button with a speaker at the bottom.

The playback quality is at a fairly tolerable level, but no more.

Text-to-speech - online speech synthesizer

Another resource that provides speech synthesis of normal quality. The free functionality is limited to typing text up to 1000 characters long.

  1. To work with the service, go to this website, in the window on the right next to the “Language” option, select Russian.
  2. In the window, type (or copy from an external source) the required text, and then click on the “Say It” button on the right.
  3. A link to the pronunciation of the specified text can also be placed in your email or web page by clicking on the “Yes” button just below.

Alternative PC programs for text-to-speech translation

There are also programs for speech synthesis, such as TextSpeechPro AudioBookMaker, ESpeak, Voice Reader 15, VOICE and a number of others that can convert text into speech. They need to be downloaded and installed on your computer, and the functionality and capabilities of these products usually slightly exceed the capabilities of the online services considered. Their detailed characteristics deserve separate extensive material.


So which online speech synthesizer should you choose? In most of them, the free options are significantly limited, and in terms of sound quality, the Ivona service will leave its competitors behind. If you are interested in the possibility of quickly translating your text into an audio file, then use the resource “fromtexttospeech” - it gives results good quality and in a fairly short time.

Speech synthesizers installed on computers or mobile devices no longer seem to be such unusual programs as before. Thanks to modern technologies a regular desktop PC can reproduce a human voice.

How do speech synthesizers work? Where are they used? What is the best speech synthesizer? The answers to these and other questions are presented in this article.

General concept

Speech synthesizers are special programs, consisting of a number of modules that provide the ability to translate typed texts into sentences spoken by a human voice. Don’t think that the entire database of words and phrases has been recorded real people in professional studios. It is physically impossible to complete such a task. A library with this a large number phrases cannot be set to any modern computer, not to mention mobile phones. For this purpose, the developers created Text-to-Speech technology.

Scope of application

Speech synthesizers are used in learning foreign languages, listening to texts on the pages of books, creating vocal parts, issuing search queries in the form of spoken phrases, etc.

What types of programs are there? Depending on the scope of application, utilities can be divided into 2 types: regular ones that convert typed text into speech, and special vocal modules used in music applications.

Advantages and Disadvantages

On at the moment The computer synthesizes human speech only approximately. In the simplest programs, you can observe problems with sound and the correct placement of stress in various words. Speech synthesizers installed on mobile devices consume a lot of energy. It is often possible to note unauthorized downloading of additional modules.

The advantages include ease of perception. Many users find it much easier to assimilate audio information than any other kind.

The best speech synthesizers with Russian voices

The RHVoice program was created by Olga Yakovleva. The standard version of the application includes 3 voices. The settings are very simple. The program can be used both as a stand-alone application, compatible with SAPI5, and as an additional screen module.

The Acapela speech synthesizer differs from its analogues in its ideal text pronunciation. The application supports more than 30 languages ​​of the world. In the free version, only 1 female voice is available.

Vocalizer is often used in call centers. The user can adjust the emphasis, volume and reading speed. Additional dictionaries are loaded if necessary. There is 1 female voice in the application. The speech engine is automatically integrated into programs for reading books in electronic format.

The eSpeak utility supports over 50 languages. The disadvantage of the program is that it saves sound files only in WAV format, which requires a lot of space on your hard drive.

The Festival application is a powerful speech synthesis utility that even supports Finnish and Hindi.

Installing the program

How to use this type of application? First you need to install the program. Computer operating systems use a standard installer, in which the user only has to select the language module supported by the utility. The installer for mobile devices can be downloaded from the official website, Google Play, and also App Store. Installation of the application occurs automatically.

First launch of the program

At this stage, the user just needs to set the default language. Sometimes you need to note the sound quality. The standard version implies a sampling frequency of 4410 Hz, a depth of 16 bits and a bit rate of 128 kbps. In mobile OS, the figures may be lower. A specific voice is used as a basis.

Filters and equalizers help you achieve the desired sound. The user has three options for text translation. He can type sentences on the keyboard, turn on the audio of an existing file, or install a browser extension that converts content on web pages into speech. It is enough to note the required course of action, the timbre of the voice and the language in which the text will be spoken. To start the playback process, click on the “Start” button.

Working with complex programs

In music applications, settings are much more complex. In the speech module of the FL Studio program, the user can select several types of voices, as well as specify the tone and playback speed. Stresses are placed before syllables using the “_” symbol. With the help of such a speech synthesizer, you can only create a robotic voice.

Vocaloid is a professional type application. In addition to the usual parameters, the user can select articulation and glissando. The utility has a database with professional vocals. If desired, you can adjust entire sentences to fit the notes. The library with vocals alone takes up more than 4 GB in compressed form.

"Google Speech Synthesizer": what is this program?

In May 2014, the company provided users with the opportunity to try out a new free product. What is Google Speech Synthesizer on Android? This is a program that reads text on the screen of a mobile device or tablet. Now there is no need to install third-party utilities that require a license. "Google Speech Synthesizer" is used when reading e-books, listening to the correct pronunciation of words, and launching the TalkBack application.

The new version of the Google Speech Synthesizer 3.1 program now supports English, Italian, Spanish, Korean, German, Dutch, Polish, Portuguese, Russian and French. Where can I find voice packs? They are downloaded from the application itself.

Advantages and disadvantages of the product from Google

The peculiarities of the Russian-speaking female voice are its clear, loud sound and smooth intonation. Playback speed can be adjusted in the program settings. Users using TalkBack and the Russian language localization of the Android OS should exercise caution when switching to the speech synthesizer if the application was previously set to a different voice by default. You may have trouble maintaining auditory control of your mobile device. Almost all voices, except Russian, are unable to process sentences in Cyrillic.

Among the disadvantages, one can note a delayed reaction to reading texts consisting of phrases in different languages. The Russian voice is distinguished by metallic notes of timbre. You may hear a rattling sound at low frequencies. The advantages include the stability of the application and acceptable quality of reading English words.

"Google Speech Synthesizer": how to use the program

In order for the utility to work as it should, you need to update it to latest version. To activate the process of speaking text, you need to open the settings. In the “language and input” section, you need to check the “speech synthesis” box. The line “default system” should also be noted. Don't forget that the voice packages in the program itself also need to be updated.

Problems when working with the utility

If necessary, the user can disable the application. In the simplest utilities, the stop button is located in the program itself. Deactivating an extension installed in the browser is done by disabling the add-on or completely removing the plugin. Problems may also arise when using the program on a mobile phone. The fact is that the speech synthesizer automatically starts loading language modules that the user does not need.

This process takes a lot of time and significantly consumes traffic. How can I disable Google Speech Synthesizer on my mobile device and get rid of this problem? First you need to open the application settings. Then you need to select the “language and voice input” section. Next you need to mark the last line.

Having selected voice search, you should click on the cross next to the “offline speech recognition” item. Then it is recommended to delete the application cache. Next you need to reboot mobile phone. To completely disable the utility, you need to open the “applications” section in the settings, select a speech synthesizer from the list and click on the “stop” button.

Uninstalling a program

It happens that the user does not use Google Speech Synthesizer at all. Is it possible to remove the utility from a mobile device? To do this you need to open Google Play. Then you should select from the list installed programs speech synthesizer and click on the “delete” button.


For regular users and people with disabilities Applications with a simple interface are suitable. This can be either RHVoice or Google Speech Synthesizer. A Russian voice will read the text displayed on the screen. The average user does not need more.

Musicians are advised to give preference professional program Vocaloid. The application has additional voice libraries and many different options. The program will allow you to get a natural sounding voice. After all, it is so important for musicians that computer synthesis is not perceptible to the ear.

Online speech synthesizers are a useful find that previously could only be dreamed of. They allow you to voice any text you specify, adjusting the voice, timbre, tempo, etc. Initially, the utility was designed for people with poor eyesight who are unable to read text from the monitor. Nowadays it is often used as an auxiliary tool in learning foreign languages, allowing you to perceive speech by ear and get used to the correct placement of stress and intonation. Also, for convenience, you can use the synthesizer to listen to books while doing household chores.

On the Internet you can easily find a lot of such applications available for downloading on your PC. However, in order not to fill up your computer’s memory again and not to jeopardize the security of its operation, it is better to use online services. We will tell you about the three most convenient and multifunctional.

Acapela – the most famous online speech synthesizer

The Acapela website provides a huge selection of languages ​​and voices for voicing text. This is especially true for English - it can be heard in twenty different options: in a female voice, male, childish, senile, joyful, etc.

It’s convenient that all parameters can be configured immediately on the main page

Unfortunately, things are worse with Russian texts - they are voiced only in one voice - a certain Alena. But nevertheless, the result is quite worthy.

The settings here are very simple - just select the language and voice, enter the desired text, then agree to the terms of use of the resource and click the “Listen!” button.

The interface is designed in English, but even without translation it is quite clear what and how to click

The limit for audio playback is 300 characters. This is the main disadvantage of most online speech synthesizers, so if you need voice acting for a large file, this option is clearly not suitable. To use voiceover without restrictions, they offer to buy full version programs. It is available for all operating systems on PC and phone.

Google Translator: fast, easy, accessible

Speaking about playing text files, we cannot fail to mention the famous Google Translate. As the name suggests, this service is designed for translating texts. In addition, here you can also listen to files - this is done literally in one click.

Everything is designed in Russian, so it’s very easy to understand the interface

To listen to the file, you need to paste your text into the appropriate window and click on the megaphone icon in the lower left corner. It’s convenient that you can do this with both the original and the translation. Note that the limit here is much larger than in Acapela - 5000 characters. There are no extensions or paid versions provided.

Since this program was created for other purposes, the functionality here leaves much to be desired. Timbre, reading speed and other important parameters are not adjustable in any way. The voice acting is unnatural, with distinct “metallic” notes. Intonation, pauses, semantic stresses - all this is done unprofessionally, so in every sentence you feel like the words are unevenly “glued” together.

This application is convenient to use, for example, if you want to understand how the text you write is perceived by ear. For this, intonation and timbre are not particularly important, because the wording itself, the presence of tautologies and dissonant statements are interesting.

Among the advantages, one can note only a huge selection of languages, which, in fact, is quite logical for an online translator

ServiceFromtexttospeech for voice playback of your text

The last application we want to talk about is Fromtexttospeech. Let's start with the fact that the restrictions on the number of characters here are the most loyal - up to 50,000. This is serious competitive advantage, but let's see if Fromtexttospeech has any other obvious advantages.

The algorithm of the program is approximately the same as that of Acapela:

  • configure online speech synthesizer parameters: language, timbre and speed;
  • click “Create Audio File”;
  • download or simply listen to the finished file.

So, let's try. Copy a few sentences of your article and paste it onto Just below the working panel the number of characters that we can still add is displayed.

It’s very convenient that you can choose the reading speed: slow, medium, fast and very fast

There’s nothing else to configure here, so let’s move on to the actual conversion to audio procedure. This process takes several minutes (depending on the file size), after which you can evaluate the result of the work in a separate window.

The ability to save the resulting audio file to your computer is a very convenient feature that distinguishes this service from many others.

To summarize, it is worth saying that all the services we reviewed are very individual and have their own characteristics. If you are interested in professional voice acting, then Acapela is perfect for these purposes. On the official website of the program you can test its operation, evaluate its sound and functionality in order to decide whether to purchase the full version. If the issue of quality is not too important for you, choose the good old Google Translator or Fromtexttospeech, which allow you to convert large text files in audio.

You can listen to how fragments of the same text sound when performed by different voice engines in our video.

Today there is a technology that can convert text information into ordinary speech. With the development of “smart machines”, this technology is becoming more and more relevant, and every day it requires more and more perfection. Actually, at the moment, a number of speech synthesis methods have been developed, which we will talk about.

Speech synthesizers can be used in completely different areas and are used to solve many problems, ranging from “reading” books, producing “talking” children’s toys, announcing stops in public transport or in service systems, and ending with medicine (here it is worth remembering Stephen Hawking, who uses a speech synthesizer to communicate with the world).

So, let's take a closer look at the technology and methods of speech synthesis. As already mentioned, there are several speech synthesis methods. Thus, several main approaches can be distinguished:

  • parametric synthesis;
  • concatenative (compilation) synthesis;
  • synthesis according to rules (based on printed text);

Parametric synthesis allows you to record speech for any language, but it cannot be used for texts that have not been specified in advance. Parametric speech synthesis is used when the set of messages is limited. The quality of this synthesis method can be very high.

Essentially, parametric speech synthesis is an implementation of the operating principle of a vocoder. In the case of parametric synthesis beep represented by a certain number of continuously changing parameters. A tone generator is used to generate vowel sounds, and a noise generator is used for consonants. But this method is usually used to record voices in musical compositions, and more often we're talking about not even about pure voice synthesis, but rather about modulation.

The compilation synthesis method is based on compiling texts from a pre-recorded “dictionary” of elements. The size of a system element must be at least a word. Typically, the stock of elements is limited to several hundred words, and the content of synthesized texts is limited to the volume of the dictionary. This speech synthesis method is widely used in everyday life- as a rule, in various help services and equipment that requires equipment with voice response systems.

Full speech synthesis according to the rules can reproduce speech from a previously unknown text. This method does not use elements of human speech, but is based on programmed linguistic and acoustic algorithms.

There is also a division here - two approaches to this synthesis method can be distinguished. The first is formant speech synthesis according to the rules, and the second is articulatory synthesis. Formant synthesis is based on formants - frequency resonances of speech speaker system. The formant synthesis algorithm models the functioning of the human vocal tract, which works as a set of resonators. Today, unfortunately, most synthesizers that work exclusively on formant synthesis are difficult to understand without preparation, but, undoubtedly, this is a universal and promising technology. The articulatory method tries to improve the shortcomings of the formant method by adding phonetic features of the pronunciation of individual sounds to the model.

There is also a technology for speech synthesis according to rules, which uses recorded segments of natural speech. Since compilation methods are still most often used, let’s say a few words about them in more detail.

Depending on how large the “excerpts” of speech used for synthesis are, the following types of synthesis are distinguished:

  • microsegment (microwave);
  • allophonic;
  • diphonic;
  • semisyllabic;
  • syllabic;
  • synthesis from units of arbitrary size.

The most commonly used are allophonic and diphonic methods. For the diphonic method of speech synthesis, the basic elements are all kinds of binomial combinations of phonemes, and for the allophone method - combinations of left and right contexts (an allophone is a variant of a phoneme, which is determined by its specific phonetic environment). At the same time various types contexts are combined into classes according to the degree of acoustic proximity.

The advantage of such systems is that they make it possible to synthesize text from a text that is not specified in advance, but the disadvantage is that the quality of synthesized speech is not comparable to the quality of natural speech (distortions may occur at the boundaries of stitching elements). It is also very difficult to control the intonation characteristics of speech, since the characteristics of individual words can change depending on the context or type of phrase.

However, this is all in theory. In practice, on modern stage development, despite active progress in this area, developers of speech synthesis technology still experience some difficulties, mainly associated with the artificiality of the synthesized speech, the lack of emotional coloring in it and low noise immunity.

The fact is that any synthesized speech, as a rule, is difficult for a person to perceive. This is due to the fact that the gaps in the synthesized text are filled by the human brain, which uses additional resources for this, and a person can normally perceive synthesized speech for only about 20 minutes.

The perception of speech is also influenced by its emotional coloring. In the case of synthesized speech, it is absent. Although it is worth noting that some algorithms still make it possible to imitate the emotional coloring of speech to some extent by changing the duration of phonemes, pauses and timbre modulation, but so far their work is far from ideal.

As for the third named problem - low noise immunity, experiments show that the perception of synthesized text is interfered with by any, even the smallest, extraneous noise. This is again due to the fact that to process synthesized speech, the human brain uses additional centers, which are not used in the perception of natural speech.

At the end of this article I would like to give some examples of existing speech synthesizers.

Everyone knows the so-called “readers” - programs for more convenient reading of text from the monitor. Many of us use speech synthesis programs to voice text, for example, Balabolka and Govorilka.

In order for such programs to read texts, you must also install the SAPI (Speech API) library and voice engines. The most common are two versions of the Speech API: SAPI4 and SAPI5. Both libraries can run on the same computer. IN operating systems Windows XP, Windows Vista and Windows 7 already have SAPI5 libraries installed.

In addition to e-readers, screen access programs are common. Examples of such programs are:

VIRGO 4. The program was created for comfortable work of blind and visually impaired users with Windows. It allows you to select the information that will be spoken by voice and the information that will be shown on the Braille display. For visually impaired users, the Galileo screen magnification system is provided.

Cobra 9.1 also makes working with Windows easier for blind and visually impaired users. This program can display information from a computer monitor using speech, a braille display, and has a screen magnification function.