Next: Conclusions Up: Abstract Previous: Design Analysis of SCREEN

7. Discussion

First we will focus on what has been learned for processing spoken-language processing. When we started the SCREEN project, it was not predetermined whether a deep analysis or a flat screening analysis would be particularly appropriate for robust analysis of spoken sentences. A deep analysis with highly structured representations is less appropriate since the unpredictable faulty variations in spoken language limit the usefulness of deep structured knowledge representations much more than it is the case for written language. Deep interpretations and very structured representations - as for instance possible with HPSG grammars for text processing - make a great deal of assumptions and predictions which do not hold for faulty spoken language. Furthermore, we have learned that for generating a semantic and syntactic representation we do not even need to use a deep interpretation for certain tasks. For instance, for translating between two languages it is not necessary to disambiguate all prepositional phrase attachment ambiguities since during the process of translation these disambiguations may get ambiguous again in the target language.

However, we use some structure at the level of words and phrases for syntax and semantics respectively. We learned that a single flat semantics level rather than the four flat syntax and semantics levels is not sufficient since syntax is necessary for detecting phrase boundaries. One could argue that one syntactic abstract phrase representation and one abstract semantic phrase representation may be enough. However, we found that the basic syntactic and semantic representations at the word level make the task easier for the subsequent abstract analysis at the phrase level. Furthermore, the basic syntactic and semantic representations are necessary for other tasks as well, for instance for the judgment of the plausibility of a sequence of syntactic and semantic categories. This plausibility is used as a filter for finding good word hypothesis sequences. Therefore, we argue that for processing faulty spoken language - for a task like sentence translation or question answering - we need much less structured representations as are typically used in well-known parsers but we need more structured representations than those of a single-level tagger.

In some of our previous work we had made early experiences with related connectionist networks for analyzing text phrases. Moving from analyzing text phrases to analyzing unrestricted spoken utterances, there are tremendous differences in the two tasks. We found that the phrase-oriented flat analysis used in SCAN (Wermter, 1995) is advantageous in principle for spoken-language analysis and the phrase-oriented analysis is common to learning text and speech processing. However, we learned that spoken-language analysis needs a much more sophisticated architecture. In particular, since spoken language contains many unpredictable errors and variations, fault tolerance and robustness are much more important. Connectionist networks have an inherent implicit robustness based on their similarity-based processing in gradual numerical representations. In addition, we found that for some classes of relatively often occurring mistakes, there should be some explicit robustness provided by machinery for interjections, word and phrase repairs. Furthermore, the architecture has to consider the processing of a potentially large number of competing word hypothesis sequences rather than a single sentence or phrase for text processing.

Now, we will focus on what has been learned about connectionist and hybrid architectures. In the beginning we did not predetermine whether connectionist methods would be particularly useful for control or for individual modules or for both. However, during the development of the SCREEN system it became clear that for the general task of spoken language understanding, individual subtasks like syntactic analysis had to be very fault-tolerant because of the ``noise'' in spoken language, due both to humans and to speech-recognizers as well. Especially unforeseeable variations often occur in spontaneously spoken language and cannot be predefined well in advance as symbolic rules in a general manner. This fault-tolerance at the task level could be supported particularly well by the inherent fault-tolerance of connectionist networks for individual tasks and the support of inductive learning algorithms. So we learned that for a flat robust understanding of spoken-language connectionist networks are particularly effective within individual subtasks.

There has been quite a lot of work on control in connectionist networks. However, in many cases these approaches have concentrated on control in single networks. Only recently there has been more work on control in modular architectures (Sumida, 1991; Jacobs et al., 1991b; Jain, 1991; Jordan & Jacobs, 1992; Miikkulainen, 1996). For instance, in the approach by Jacobs and Jordan (Jacobs et al., 1991b; Jordan & Jacobs, 1992), task knowledge and control knowledge are learned both. Task knowledge is learned in individual task networks, and higher control networks are responsible for learning when a single task network is responsible for producing the output. Originally it was an open question whether a connectionist control would be possible for processing spoken language. While automatic modular task decomposition (Jacobs et al., 1991a) can be done for simple forms of function approximation, more complex problems like understanding spoken language in real-world environments still need designer-based modular task decomposition for the necessary tasks.

We learned that connectionist control in an architecture with a lot of modules and subtasks currently seems to be beyond the capabilities of current connectionist networks. It has been shown that connectionist control is possible for a limited number of connectionist modules (Miikkulainen, 1996; Jain, 1991). For instance Miikkulainen shows that a connectionist segmenter and a connectionist stack can control a parser to analyze embedded clauses. However, the communication paths still have to be very restricted within these three modules. Especially for a real-world system for spoken-language understanding from speech, over syntax, semantics to dialog processing for translation it is extremely difficult to learn to coordinate the different activities, especially for a large parallel stream of word hypothesis sequences. We believe that it may be possible in the future, however currently connectionist control in SCREEN is restricted to the detection of certain hesitations phenomena like corrections.

Considering flat screening analysis of spoken language and hybrid connectionist techniques together, we have developed and followed a general guideline (or design philosophy) of using as little knowledge as necessary while getting as far as possible using connectionist networks wherever possible and symbolic representations where necessary. This guideline led us to (1) a flat but robust representation of spoken-language analysis and to (2) the use of hybrid connectionist techniques which support the task by the choice of the possibly most appropriate knowledge structure. Many hybrid systems contain just a small portion of connectionist representations in addition to many other modules, e.g. BeRP (Wooters, 1993; Jurafsky et al., 1994a; Jurafsky et al., 1994b), JANUS (Waibel et al., 1992), TRAINS (Allen, 1995; Allen et al., 1995). In contrast, most of the important subtasks in SCREEN are performed directly by many connectionist networks.

Furthermore, we have learned that flat syntactic and semantic representations could give surprisingly good training and test results when trained and tested with a medium corpus of about 2300 words in the 184 dialog turns. These good results are mostly due to the learned internal weight representation and the local context which adds sequentiality to the category assignments. Without the internal weight representation of the preceding context the syntactic and semantic categorization does not perform equally well, so the choice of recurrent networks is crucial for many sequential category assignments. Therefore these networks and techniques hold potential especially for such medium-size domains where a restricted amount of training material is available. While statistical techniques are often used for very large data sets, but do not work well for medium data sets, the connectionist techniques we used work well for medium-size domains.

The used techniques can be ported to different domains and be used for different purposes. Even if different sets of categories would have to be used the learning networks are able to extract these syntactic regularities automatically. Besides the domain of arranging business meetings we have also ported SCREEN to the domain of interactions at a railway counter with comparable syntactic and semantic results. These two domains differed primarily in their semantic categories, while the syntactic categories (and networks) of SCREEN could be used directly.

SCREEN has the potential for scaling up. In fact, based on the imperfect output of a speech recognizer, several thousand sentence hypotheses have already been processed. If new words are to be processed, their syntactic and semantic basic categories are simply entered into the lexicon. The structure of individual networks does not change, new units do not have to be added and therefore the networks do not have to be retrained.

The amount of hand-coding is restricted primarily to the symbolic control of the module interaction and to the labeling of the training material for the individual networks. When we changed the domain to railway counter interactions, we could use the identical control, as well as the syntactic networks. Only the semantic networks had to be retrained due to the different domain.

So far we have focused on supervised learning in simple recurrent networks and feedforward networks. Supervised learning still requires a training set and some manual labeling work still has to be done. Although especially for medium size corpora labeling examples is easier than for instance designing complete rule bases it would be nice to automate the knowledge acquisition even further. Currently we plan to build a more sophisticated lexicon component which will provide support for automatic lexicon design (Riloff, 1993) and dynamic lexicon entry determination using local context (Miikkulainen, 1993).

Furthermore, SCREEN could be expanded at the speech construction and evaluation part. The syntactic and semantic hypotheses could be used for more interaction with the speech recognizer. Currently syntactic and semantic hypotheses from the speech evaluation part are used to exclude unlikely word hypothesis sequences from the language modules. However, these hypotheses by the connectionist networks for syntax and semantics - in particular the modules of basic syntactic and semantic category prediction - could also be used directly into the process of recognition in the future in order to provide more syntactic and semantic feedback to the speech recognizer at an early stage. Besides syntax and semantics, cue phrases, stress and intonation could provide additional knowledge for speech/language processing (Hirschberg, 1993; Gupta & Touretzky, 1994). These issues will be additional major efforts for the future.

Next: Conclusions Up: Abstract Previous: Design Analysis of SCREEN

SCREEN (screen@nats5.informatik.uni-hamburg.de)
Mon Dec 16 15:33:13 MET 1996