ABSTRACT

ISSN: 1139-8736
Depósito Legal: B-8714-2001

ABSTRACT

This work is aimed at analyzing the problems when facing automatic speech understanding, from both scientific and technical points of view, concluding with the design, implementation and evaluation of a Castilian Spanish Speech understanding system.

In this work, some of the alternatives, that have been proposed by international research groups in order to solve the speech understanding problem, have been reviewed.

A novel non integrated architecture for speech understanding in Spanish has been defined, taking into account the specific characteristics of Spanish as a natural language, not found or rarely found in other languages. This architecture intends to be the baseline of future work in this topic in the Speech Technology Group, in the Universidad Politécnica de Madrid.

To achieve the objective of understanding speech in limited semantic domains (limited by the concepts used in the domain of a specific application) this architecture has been designed with the following main characteristics:

Robustness, that is, the possibility of processing sentences with errors (word insertions, deletions or substitutions) produced by the acoustic decoding module (a continuous speech recognition system); or non-grammatical constructions, due to the inherent characteristics of spoken language; or problems in lexical, syntactic o semantic coverage.
Modularity, that permits improving the system without redesigning or implementing the whole system.
Flexibility, in order to have an application-independent architecture, obviously under certain restrictions, imposed by the characteristics of both automatic information and control systems.
Power, defined as the possibility of processing sentences with a certain degree of linguistic complexity.

Those modules incorporate linguistic knowledge of different kinds, and this has allowed us to study the interaction of different linguistic knowledge sources and an efficient way of integrating them in the understanding process.

Features to represent the semantic information have been used, completing the one already modelled by the dictionary semantic categories; simplified contextual grammars (defining specific rules languages, and rule analysis or execution algorithms), which mainly solve some of the semantic ambiguity and ellipsis problems; and a semantic context free grammar (using the Earley algorithm with its possibility of processing ambiguous sentences). The latter intends to obtain the structural information of the sentence, using a taxonomy of the domain concepts that heavily reduces the number of needed rules. Moreover, it helps the processing of complex sentences, while keeping the SQL translation process surprisingly simple, by using semantic templates. This translation process is needed in information systems accessing databases.

In order to evaluate the acoustic decoder module behaviour, a modular continuous speech recognition system has been implemented. It is able to integrated grammatical knowledge based on any stochastic morpho-syntactic or semantic N-gram. To keep the efficiency of this module, even when the grammar information is used, a search space reduction mechanism (beam-search) has been deeply studied. A new method developed in this Thesis allows to analyse and to determine, in advance, a pruning threshold based on the probability (or distance) of the best state in the search space for every frame (stochastic), making use of training data and knowing the impact this threshold will have in the recognition process. Besides, two well-known variants have been evaluated: the use of one or two pruning thresholds, one based on the probability (distance) of the best last states for every model in every frame (stochastic parameter of the last state) and the other one based on the probability of the best of the rest of the states in every frame (stochastic parameter of the rest of the states). New conclusions have been drawn from this study. All this allowed us to deepen in this well known but not so well understood technique. Moreover the acoustic decoder has been modified to allow the generation of several output hypothesis (N-best sentences), and the relationship between the value N (number of paths or hypothesis) and the speech recognition system performance (improvement of the word error rate), for applications such as the one aimed in this Thesis. We have checked that with a small number of hypothesis (very low N), the acoustic module is able to recover from a lot of errors that would severely affect the understanding process of the recognised spoken sentence.

Anterior I Siguiente I Índice General

ISSN: 1139-8736
Depósito Legal: B-8714-2001