Next: Design Analysis of SCREEN Up: Abstract Previous: The Architecture of the SCREEN System

5. Detailed Analysis with Examples

In this section we will have a detailed look at processing the output from a speech recognizer and producing a flat syntactic and semantic interpretation of concurrent word hypothesis sequences (also called sentence hypothesis here).

5.1 The Overall Environment

The overall processing is incremental from left to right, and any time multiple sentence hypotheses are processed in parallel. Figure 12 shows a snapshot of SCREEN after 0.95s of the utterance.

Example-Sentence-Figure

Figure 12: First snapshot for sentence ``Käse ich meine natürlich März (``Rubbish I mean of_course March''). The abbreviations are explained in Table 1 to 4. Below, the second pop-up window illustrates the full preferences of the word ``meine'' (``mean'') for its basic syntactic categories. An animation of this sentence can be found in Figure 20.

At this time the snapshot shows the first three sentence hypotheses as the German words together with their (literal) English translations (``Rubbish I mean'', ``Rubbish I'', ``Rubbish I had''). The SCREEN environment allows the user to view and inspect the incremental generation of word hypothesis sequences (partial sentence hypotheses) and their most preferred syntactic and semantic categories at the basic and abstract level. Each sentence hypothesis is illustrated horizontally. At a certain time many sentence hypotheses can be active in parallel. They are ranked according to the descending plausibility of the sentence hypotheses. So in the snapshot in Figure 12 there are currently three sentence hypotheses and the preferred current sentence hypothesis consists of ``Rubbish I mean''.

All these sentence hypotheses are syntactically and semantically plausible starts. The underlying variations are introduced by the speech recognizer which produced different word hypotheses for slightly overlapping signal parts of the sentence. Besides the speech plausibility, syntax and also semantics can help with choosing better sentence hypotheses. Currently we combine the speech recognition plausibility, the syntactic plausibility, and the semantic plausibility to compute the plausibility of the sentence hypotheses as a multiplication of the respective normalized plausibility values between 0 and 1. Since the speech recognizer does not contain syntactic and semantic knowledge, a sequence hypothesis rated plausible based on speech knowledge alone may neglect the potential of syntactic and semantic regularity. By using corresponding syntactic and semantic plausibility values for a sentence hypothesis we can integrate acoustic, syntactic, and semantic knowledge.

Each word hypothesis is shown with the preferred basic syntactic hypothesis (upper left square of a word hypothesis), the preferred abstract syntactic hypothesis (upper middle square), the preferred basic semantic hypothesis (lower left square), the preferred abstract semantic hypothesis (lower middle square), the preferred dialog act (upper right square)⁹, and the integrated acoustic, syntactic and semantic confidence of the partial sentence hypothesis up to that point (lower right square). The size of the square illustrates the strength of the hypothesis, and a full black square means that a preferred hypothesis is close to one. For instance, in the word hypothesis for ``ich'' (``I'') in the first sentence hypothesis we have the hypothesis of a pronoun (U) as the basic syntactic category, a noun group (NG) as the abstract syntactic category, an animate object (ANIM) as the basic semantic category, an AGENT as the abstract semantic category, and suggestion (SUG) as dialog act. Furthermore, the length of a vertical bar between word hypotheses indicate the plausibility for a new phrase start.

As another example, we can see the representation of our example word ``meine'' (could be the verb ``mean'' or the pronoun ``my'' in German) which we have used throughout the network descriptions (see Figure 9). The network had a correct preference for ``meine'' being a verb (V). Figure 12 shows this preference as well as a zoomed illustration of all other less favored preferences in a second pop-up window below. As we can see, the ambiguous other pronoun preference U received the second strongest activation while all other preferences are close to 0. These shown activation preferences are the output values of the corresponding network for basic syntactic categorization. So any shown activation value in our snapshots shows only the most preferred hypothesis while all other hypotheses can be shown on request¹⁰.

Within the display we can scroll up and down the descending and ascending sentence hypotheses. Furthermore we can scroll left and right for analyzing specific longer word hypothesis sequences. There is also a step mode which allows the SCREEN system to wait for an interactive mouse click to process the next incoming word hypothesis for a very detailed analysis. This step mode can be adapted for a different number of steps (word hypotheses) and it can be switched off completely if one decides to analyze the sentence hypotheses later or at the end of all word hypotheses. Only the preferred of all possible syntactic and semantic hypotheses are shown. Therefore many different hypotheses appear to have the same size. However, by clicking on one of the squares the other less confident hypotheses can be displayed as well.

5.2 Analyzing the Final Snapshot in Short Sentence Hypotheses

Example-Sentence-Figure

Figure 13: Final snapshot for sentence ``Käse ich meine natürlich März (``Rubbish I mean of_course March''). An animation of this sentence can be found in Figure 20.

In Figure 13 we illustrate the final state after 3.01s of the utterance. Eight possible sentence hypotheses remained out of which we see the first four in Figure 13. Starting with the fourth sentence hypothesis ``Käse ich hätte ich März'' (``Rubbish I had I march'') we can see that this lower rated sentence hypothesis is not the desired sentence. The lower ranked hypotheses are good examples that current state-of-the-art speech recognizers alone will not be able to produce reliable sentence hypotheses, since the problem of analyzing spontaneous speaker-independent speech is very complex. Therefore the syntactic and semantic components for spontaneous language have to take into account that there will be highly irregular sequences as shown below. However, it is interesting to observe that the underlying connectionist networks always produce a preference for the syntactic and semantic interpretation at the abstract and basic level. In fact, although the lower ranked sentence hypotheses do not constitute the desired sentence all assigned syntactic and semantic categories are correct for the individual word hypotheses. Of course there may be cases that a network also could make a wrong decision for uncertain word hypotheses. However the syntactic and semantic processing will never break for any possible sentence hypothesis, and is in this respect different from more well-known methods like symbolic context-free chart parsers.

If we look at the top-ranked sentence hypothesis ``Käse ich meine natürlich März'' (``Rubbish I mean of_course March'') this is also the desired sentence. It is the most plausible sentence based on speech and language plausibility. Furthermore, we can see that the assigned categories are correct: The German word ``Käse'' (``Rubbish'') is found to be a noun as part of a noun group which expresses a negation. ``Ich'' (``I'') starts a new phrase, that is a pronoun as a noun group which represents an animate being and an agent. The following German word ``meine'' is particularly interesting since it can be used as a verb in the sense of ``mean'' but also as a pronoun in the sense of ``my''. Therefore, the connectionist network for the basic syntactic classification has to disambiguate these two possibilities based on the preceding context. The network has learned to take into consideration the preceding context and is able to choose the correct basic syntactic category verb (V) rather than pronoun (U) for the word ``meine'' (``mean''). At this time a new phrase start has been found as well. The following word ``natürlich'' (``of course'') has the highest preference for an adverb and a special group. Finally, the word ``März'' (``March'') is assigned the highest plausibility for a noun and noun group as well as a time at which something happens.

5.3 Phrase Starts and Phrase Groups in Longer Sentence Hypotheses

Now we will focus on a detailed analysis of a second example: ``Ähm ja genau allerdings habe ich da von neun bis vier Uhr schon einen Arzttermin''. The literally translated sentence to be analyzed is: ``Eh yes exactly however have I there from nine to four o'clock already a doctor-appointment''. A better but non-literal translation would be: ``Eh yes exactly however then I have a doctor appointment from nine to four o'clock''. During the analysis of the first few sentence hypotheses, the interjection ``ähm'' (``eh'') is detected by the corresponding module in the correction part and is eliminated from the respective sentence hypotheses.

Example-Sentence-Figure

Figure 14: First part of the snapshot for sentence ``Ähm ja genau allerdings habe ich da von neun bis vier Uhr schon einen Arzttermin'' (literal translation: ``Yes exactly however have I there from nine to four o'clock already a doctor-appointment''; improved translation: ``Eh yes exactly however then I have a doctor appointment from nine to four o'clock''). An animation of this sentence can be found in Figure 21.

In Figure 14 and Figure 15 we show the best found four sentence hypotheses. The categories of these sentence hypotheses look similar but we have to keep these separate hypotheses since they differ in their time stamps and their speech confidence values.

Example-Sentence-Figure

Figure 15: Second part of the snapshot for sentence ``Ähm ja genau allerdings habe ich da von neun bis vier Uhr schon einen Arzttermin'' (``Yes exactly however have I there from nine to four o'clock already a doctor-appointment''). An animation of this sentence can be found in Figure 21.

In these two snapshots of this longer example we can also illustrate the influence of the phrase starts. The sequences ``von neun'' (``from nine'') and ``bis vier Uhr'' (``to four o'clock'') constitute two phrase groups which are clearly separated by the black bar before the prepositions ``von'' (``from'') and ``bis'' (``to''). All the other words ``neun'' (``nine''), ``vier'' (``four''), and ``Uhr'' (``o'clock'') do not start another phrase group. Since the underlying connectionist network for learning the phrase boundaries is a simple recurrent network this example demonstrates that this network has learned the preceding context. Without having learned that there had been a preposition ``von'' (``from'') or ``bis'' (``to'') a noun like ``Uhr'' (``o'clock'') does not have to be within a prepositional phrase group but could also be part of a noun phrase in another context like ``vier Uhr paßt gut'' (``four o'clock fits well'').

5.4 Dealing with Noise as Repairs

Finally we will focus on the example for the simple word graph shown in the beginning of this paper (Figure 4): ``Ähm am sechsten April bin ich leider außer Hause''. The literal translation is ``Eh on 6th April am I unfortunately out of home''. Using this sentence we will give an example for an interjection and a simple word repair. Dealing with hesitations and repairs is a large area in spontaneous language processing and is not the main topic of this paper (a more detailed discussion on repairs in SCREEN can be found in previous work, Weber Wermter, 1996). Nevertheless, for the sake of illustration and completeness we show the ability of SCREEN to deal with interjections and word repairs. The first snapshot in Figure 16 shows the start of our example sentence after 1.39s. The leading interjection ``eh'' has been eliminated already.

Example-Sentence-Figure

Figure 16: First snapshot for sentence ``Ähm am sechsten April bin ich leider außer Hause'' (``Eh on 6th April am I unfortunately out of home''). An animation of this sentence can be found in Figure 22.

Furthermore, we can see that the second word hypothesis sequence shows two subsequent word hypotheses for ``ich'' (``I''). This is possible since there were two word hypotheses generated by the speech recognizer which could be connected. In this case there were the four word hypotheses shown below:

start time end time word hypothesis speech plausibility

1.22s 1.37s ich (I) 1.527688e-03

1.23s 1.30s ich (I) 1.178415e-02

1.23s 1.37s ich (I) 2.463924e-03

1.31s 1.38s ich (I) 1.813340e-02

Just using this speech knowledge from the word hypotheses, it is possible to connect the second hypothesis which runs from 1.23s to 1.30s with the fourth hypothesis which runs from 1.31s to 1.38s. This is an example of noise generated by the speech recognizer, since the desired sentence contains only one word ``ich'' (``I'') but the sentence hypothesis at this point contains two. This repetition can be treated and eliminated in the same way as actual word repairs in language. While the reasons for the occurrence of such repairs are different the effect of a repeated word is the same. Therefore, in this case the repeated ``ich'' (``I'') is eliminated from the sentence sequence. In Figure 17 we show the final snapshot of the sentence. We can see that no word repairs occur in the top-ranked sentence hypothesis which is also the desired sentence.

Example-Sentence-Figure

Figure 17: Final snapshot for sentence ``Ähm am sechsten April bin ich leider außer Hause'' (``Eh on 6th April am I unfortunately out of home''). An animation of this sentence can be found in Figure 22.

In general, for language repairs, SCREEN can deal with the elimination of interjections and pauses, the repair of word repetitions, word corrections (where the words may be different, but their categories are the same) as well as simple forms of phrase repairs (where a phrase is repeated or replaced by another phrase).

Next: Design Analysis of SCREEN Up: Abstract Previous: The Architecture of the SCREEN System

SCREEN (screen@nats5.informatik.uni-hamburg.de)
Mon Dec 16 15:33:13 MET 1996


Figure 12:	First snapshot for sentence ``Käse ich meine natürlich März (``Rubbish I mean of_course March''). The abbreviations are explained in Table 1 to 4. Below, the second pop-up window illustrates the full preferences of the word ``meine'' (``mean'') for its basic syntactic categories. An animation of this sentence can be found in Figure 20.


Figure 13:	Final snapshot for sentence ``Käse ich meine natürlich März (``Rubbish I mean of_course March''). An animation of this sentence can be found in Figure 20.


Figure 15:	Second part of the snapshot for sentence ``Ähm ja genau allerdings habe ich da von neun bis vier Uhr schon einen Arzttermin'' (``Yes exactly however have I there from nine to four o'clock already a doctor-appointment''). An animation of this sentence can be found in Figure 21.


Figure 16:	First snapshot for sentence ``Ähm am sechsten April bin ich leider außer Hause'' (``Eh on 6th April am I unfortunately out of home''). An animation of this sentence can be found in Figure 22.


Figure 17:	Final snapshot for sentence ``Ähm am sechsten April bin ich leider außer Hause'' (``Eh on 6th April am I unfortunately out of home''). An animation of this sentence can be found in Figure 22.