This PhD thesis presents the development of a prosody system for European Portuguese (EP) for text-to-speech (TTS) applications. Basically, TTS systems carry out the automatic utterance of a text and consist in a sequence of several modules. Those modules implement the pre-processing of the text input, the phonetic transcription and the supra-segmental processing that consists in the inclusion of prosodic patterns. The prosody is responsible for a communicative intention and guarantees some naturalness in the uttered speech. The prosodic features consist in the imposition of the timing, characterized by the segmental durations and pauses, the intonation, characterized by the fundamental frequency (F0) curve, and by the intensity curve.
The preparatory work that was fundamental for modelling and testing purposes is presented in the beginning. It starts with a preliminary study about the stressed syllable. This study identifies the variation range of F0, duration and intensity features in stressed syllable along contexts. Then the FEUP-IPB EP speech database that was used in following studies is presented. The database is labelled at the levels of the phoneme, word, sentence and F0. The thesis follows on with the presentation of two algorithms to provide the syllabic splitting of the text and of the phoneme sequences. This chapter ends with a proposed set of rules for the automatic phonetic transcription of the most problematic graphemes in EP.
The proposed prosody model consists of several sub-models, namely, the duration model to predict the segmental durations and the model to predict the F0 pattern.
Two proposals, based in artificial neural networks (ANNs), to predict the segmental durations are presented.
The first proposal consists of one ANN carefully selected concerning its architecture and type as well as the set of input features with the objective of minimizing the error between predicted and measured durations. The second proposal, entitled alternative model, is based on same considerations of the first proposal but uses one dedicated ANN for each phoneme, in a total of 44 ANNs. The alternative model, with dedicated ANNs, improved the final performance.
A model of insertion and prediction of durations of the pauses is proposed, based on a preliminary study over the FEUP-IPB database.
The proposed model to predict the F0 contour is based on the Fujisaki model and consists of two sub-models. One predicts the Phrase Commands’ (PCs) parameters and the other predicts the Accent Commands’ (ACs) parameters.
The PCs and the ACs were manually estimated in 101 paragraphs of the database under the criterion of the minimization of the error between estimated and measured F0 contours.
The prediction of the PCs is performed in two stages. The first stage is carried out by an algorithm responsible for the insertion of the PCs connected to the text and based on a mathematical model obtained from experimental observations. The second stage of the model predicts the PCs amplitude, Ap, and anticipation, T0a, relatively to the initial position. The anticipation allows the determination of the exact position in the speech signal. The two parameters are predicted with ANNs.
A strong connection between ACs and syllables was found in the database. This strong connection justified the adopted methodology of predicting ACs associated with syllables. Therefore, the ACs model consists of one ANN to predict the existence of AC associated with the syllable and other three ANNs to predict the parameter’s amplitude (Aa) and anticipation of the onset (T1a) and offset (T2a) instants.
The final perceptual test using the category-judgment method and the MOS scale resulted in a classification of 4.6 for the natural speech, 4.4 for the estimated F0, 4.2 for predicted durations, 3.1 for the predicted F0 and 2.9 for the complete proposed model (duration and F0 models). The MOS for the complete model is at the ‘Fair’ level.