Idea: Gábor Olaszy (BME TMIT, Hungary) 2013.

Development : Kálmán Abari (Debrecen University, Hungary), Tamás Gábor Csapó, Bálint Pál Tóth and Gábor Olaszy (BME TMIT, Hungary)  2013-2015.

TTF modell: This is a new model for text-to-formant conversion. It is capable to predict the phonetically correct and ordered formant pattern flow of any virtual speech signal i.e. that has never been uttered. The input of the model is text, the output is the characteristic formant pattern flow of the sentence for F1 and F2. The model consists of two main parts: 1) precisely prepared multi-speaker parallel speech database with manually corrected sound boundaries and formant values; 2) HMM-based formant trajectory predictor from text. A focus is on the formant trajectories (over sound combinations). These trajectories belong not to a person, but to the language in general. The trajectories over a given sentence (sentence pattern) are always the same, independently of the speaker, hence all Hungarians articulate similarly, when speaking Hungarian.

Formant database: 5 female and 5 male adults, having Hungarian mother tongue, living in Budapest, and having different profession: teacher, actor, administrator, researcher, engineer, etc.). Formant frequencies for F1, F2 and F3 (altogether cca. 7 million) have been defined by Praat and were visually controlled and adjusted for 10x1900 sentences.

HMM-based formant trajecctory predictor The training of the HMMs was done with the HTS toolkit. The F1 and F2 data of the multispeaker parallel speech database were used for training. Two general models were calculated finally from 5 male and from 5 female voices. These two HMM databases are used, when the model is working. The input is text and the gender, the output is the predicted formant trajectory for F1 and F2 in the function of time concerning the given sentence.

Evaluation: : By the evaluation the predicted formant patterns were compared with that of natural ones. A new degree of the similarity between the predicted and the natural formant patterns is expressed by the Trajectory Matching Rate (TMR). The evaluation of TMR values is based on the use of the correlation coefficient. The more similar the predicted formant pattern of the sentence is to that of the natural same sentence, the closer to +1 its TMR value is. Every sentence was characterised one TMR value for F1 and another one for F2. The evaluation was done for 800 male and 800 female sentences. In both cases the averaged TMR was higher than 0.8. This value implies that the TTF converter gives very good prediction for F1 and F2 formant trajectories from text input.

Visual presentation: We present on this page sample sentences both for male and female showing the predicted TTF sentence patterns and also the individual formant data of the 5-5 speakers in the same sentence. This is to show how similar is the shape of the formant pattern given by the TTF model and the formant pattern of the natural sentence. TTF models are: male=5sp.m; female=5sp.f. There are on the pictures the TMR values which express the similarity of the TTF model with the individual formant patterns in the given sentence.

Live demo: A live text-to-formant (TTF) conversion demo is available, where the given Hungarian sentence is converted into F1, F2 formant trajectories (male, female). Please follow Step 1, 2, 3. The results are available in numerical (CSV) and in visual representations.

Information regarding the results:

    The phone durations have identical lengths. They do not represent lingustic durations.
  • label: SAMPA symbol
    num: the serial number of the phoneme in the sentence, beginning with 2 (the 1 belongs to the silence at the beginning)
    time: synthesized durations in seconds (correct linguistical durations)
    pos.: position within the phoneme in percentege
    F1: first predicted formant value in Hz
    F2: second predicted formant value in Hz
    * pauses are not displayed in the table

Paper about the research: Kálmán Abari, Tamás Gábor Csapó , Bálint Pál Tóth, Gábor Olaszy: From text to formants - indirect model for trajectory prediction based on a multi-speaker parallel speech database. Proc. of Interspeech 2015, Dresden, Germany. 623-627. paper