next up previous
Next: Flat Category Representation: An Intermediate Connecting Representation Up: Abstract Previous: Introduction

2. Processing Spoken Language



Our goal is to learn to process spontaneously spoken language at a syntactic and semantic level in a fault-tolerant manner. In this section we will give motivating examples of spoken language.

2.1 ``Noise'' in Spoken Language

Our domain in this paper is the arrangement of meetings between business partners, and we currently use 184 spoken dialog turns with 314 utterances from this domain. One turn consists of one or more subsequent utterances of the same speaker. For these 314 utterances, thousands of utterance hypotheses can be generated and have to be processed based on the underlying speech recognizer. German utterance examples from this domain are shown below together with their literal English translation. It is important to note that the English translations are word-for-word translations.

  1. Käse ich meine natürlich März
    (Rubbish I mean of_course March)
  2. Der vierzehnte ist ein Mittwoch richtig
    (The fourteenth is a Wednesday right)
  3. Ähm am sechsten April bin ich leider außer Hause
    (Eh on sixth April am I unfortunately out_of home)
  4. Also ich dachte noch in der nächsten Woche auf jeden Fall noch im April
    (So I thought still in the next week in any case still in April)
  5. Gut prima vielen Dank dann ist das ja kein Problem
    (Good great many thanks then is this yeah no problem)
  6. Oh das ist schlecht da habe ich um vierzehn Uhr dreißig einen Termin beim Zahnarzt
    (Oh that is bad there have I at fourteen o'clock thirty a date at dentist)
  7. Ja genau allerdings habe ich da von neun bis vier Uhr schon einen Arzttermin
    (Yes exactly however have I there from nine to four o'clock already a doctor-appointment)






As we can see, spoken language contains many performance phenomena, among them exclamations (``rubbish'', see Example 1), interjections (``eh'', ``so'', ``oh'', see Examples 3, 4 and 6), new starts (``there have I ...'', see Example 6). Furthermore, the syntactic and semantic constraints in spoken language are less strict than in written text. For instance, the word order in spontaneously spoken language is often very different from written language. Therefore, spoken language is ``noisier'' than written language even for these transcribed sentences, and well-known parsing strategies from text processing - which can rely more on wellformedness criteria - are not directly applicable for analyzing spoken language.

2.2 ``Noise'' from a Speech Recognizer

If we want to analyze spoken language in a computational model, there is not only the ``noise'' introduced by humans while speaking but also the ``noise'' introduced by the limitations of speech recognizers. Typical speech recognizers produce many separated word hypotheses with different plausibilities over time based on a given speech signal. Such word hypotheses can be connected to a word hypothesis sequence and have to be evaluated for providing a basis for further analysis. Typically, a word hypothesis consists of four parts: 1) the start time in seconds, 2) the end time in seconds, 3) the word string of the hypothesis, and 4) a plausibility of the hypothesis based on the confidence of the speech recognizer. Below we show a simple word graph3. In practice, word graphs for spontaneous speech can be much longer leading to comprehensive word hypothesis sequences. However, for illustrating the properties of the speech input we focus on this relatively short and simple word graph (Figure 4).

Word-Graph-Figure



Figure 4: Simple word graph for a spoken utterance: ``ähm am sechsten April bin ich leider außer Hause'' (``eh on sixth April am I unfortunately out of home''). Each node represents a word hypothesis; each arrow represents its possible subsequent word hypotheses. Each word hypothesis is shown with its word string, start time, end time interval and acoustic plausibility.

These word hypotheses can overlap in time and constitute a directed graph called word graph. Each node in this word graph represents one word hypothesis. Two hypotheses in this graph of generated word hypotheses can be connected if the end time of the first word hypothesis is directly before the start time of the second word hypothesis. For instance, the word hypothesis for ``am'' (``on'') ending at 0.43 and the hypothesis ``sechsten'' (``sixth'') starting at 0.44 can be connected to a word hypothesis sequence.

Path-Figure

Figure 5: Two examples for word hypothesis sequences in a word graph

Our example word graph is very simple. However, as shown in Figure 5, a possible word hypothesis sequence is not only the desired ``Ähm am sechsten April bin ich leider außer Hause'' (``Eh on sixth April am I unfortunately out_of home''), but also the sequence ``Ähm ich am sechsten April wenn ich ich leider außer Hause'' (``Eh I on sixth April if I I unfortunately out_of home''). Consequently, we have to deal with incorrectly recognized words in an extraordinary order. Therefore syntactic and semantic analysis has to be very fault-tolerant in order to process such noisy word hypothesis sequences.



next up previous
Next: Flat Category Representation: An Intermediate Connecting Representation Up: Abstract Previous: Introduction


SCREEN (screen@nats5.informatik.uni-hamburg.de)
Mon Dec 16 15:33:13 MET 1996