Next: Discussion Up: Abstract Previous: Detailed Analysis with Examples

6. Design Analysis of SCREEN

In this section we will describe our design choices in SCREEN. In particular we focus on the issues why we use connectionist networks, why we reach high accuracy with little training, and how SCREEN can be compared to other systems and other design principles.

6.1 Why Did We Use Connectionist Networks in SCREEN?

In the past, n-gram based techniques have been used successfully for tasks like syntactic category prediction or part of speech tagging. Therefore, it is possible to ask why we developed simple recurrent networks in SCREEN. In this subsection we will provide a detailed comparison of simple recurrent networks and n-gram techniques for the prediction of basic syntactic categories. We chose this task for a detailed comparison since it is currently the most difficult task for a simple recurrent network in SCREEN. So purposefully we did not choose a subtask for which a simple recurrent network had a very high accuracy, but the prediction task since it is more difficult to predict a category compared to disambiguating among categories, for instance. So we chose the difficult prediction with a relatively low network performance in order to be (extremely) fair for the comparison with n-gram techniques.

We are primarily interested in the generalization behavior for new unknown input. Therefore Figure 18 shows the accuracy of the syntactic prediction for the unknown test set. After each word several different syntactic categories can follow and some syntactic categories are excluded. For instance, after a determiner ``the'' an adjective or a noun can follow: ``the short ...'', ``the appointment'', but after a determiner ``the'' a preposition is implausible to occur and should most probably be excluded. Therefore it is important to know how many categories can be ruled out and Figure 18 shows the relationship between the prediction accuracy and the number of excluded categories for n-grams and our simple recurrent network (as described in Figure 8).

SRN-n-gram-Comparison-Figure

Figure 18: Comparison between simple recurrent network and n-grams

As we can expect, for both techniques, n-grams and recurrent networks, the prediction accuracy is higher if only a few categories have to be excluded and the performance is lower if many categories have to be excluded. However, more interestingly, we can see that simple recurrent networks performed better than 1-grams, 2-grams, 3-grams, 4-grams and 5-grams. Furthermore, it is interesting to note that higher n-grams do not necessarily lead to better performance. For instance, the 4-grams and 5-grams perform worse than 2-grams since they would probably need much larger training sets.

We did the same comparison of n-grams (1-5) and simple recurrent networks also for semantic prediction and received the same result that simple recurrent networks performed better than n-grams. The performance of the best n-gram was often only slightly worse than the performance of the simple recurrent network, which indicates that n-grams are a reasonably useful technique. However, in all comparisons simple recurrent networks performed at least slightly better than the best n-grams. Therefore, we used simple recurrent networks as our primary technique for connectionist sequence learning in SCREEN.

How can we explain this result? N-grams like 2-grams still perform reasonably well for our task and simple recurrent networks are closest to their performance. However, simple recurrent networks perform slightly better since they do not contain a fixed and limited context. In many sequences, the simple recurrent network may primarily use the directly preceding word representation to make a prediction. However, in some exceptions more context is required and the recurrent network has a memory of the internal reduced representation of the preceding context. Therefore, it has the potential to be more flexible with respect to the context size.

N-grams may not perform optimally but they are extremely fast. So the question arises how much time is necessary to compute a new category using new input and the current context for the network. In general our networks differ slightly in size but typically they contain several hundred weights. For a typical representative simple recurrent network with 13 input units, 14 hidden units, 8 output units, and 14 context units, and about 500 weights it takes 10^-4s on a Sparc Ultra to compute a new category within the whole forward sweep.

Since the techniques for smoothed n-grams basically rely on an efficient table-look-up of precomputed values, of course typical n-gram techniques are still faster. However, due to their fixed-size context they may not perform as well as simple recurrent networks. Furthermore, computing the next possible categories in 10^-4s is fast enough for our current version of SCREEN. For the sake of an explanation one could argue that SCREEN contains about 10 networks modules and a typical utterance contains 10 words, so a single utterance hypothesis could be performed in 10^-2s. However, different from text tagging, we do not have single sentences but we process word graphs. Depending on the specific utterance, about 10⁵ word hypothesis sequences could be generated and have to be processed. Furthermore there is some book-keeping required for keeping the best word hypotheses, for loading the appropriate networks with the appropriate word hypotheses, etc. The potentially large number of word hypotheses, the additional book-keeping performance, and the number of individual modules for syntax, semantics and dialog processing explain why the total analysis time of the whole unoptimized SCREEN system is in the order of seconds although a single recurrent network performs in the order of 10^-4s.

6.2 Improvement in the Hypothesis Space

In this subsection we will analyze to what extent the syntactic and semantic prediction knowledge can be used to improve the best found sentence hypotheses. We illustrate the pruning performance in the hypothesis space by integrating acoustic, syntactic, and semantic knowledge. While the speech recognizer alone provides only acoustic confidence values, SCREEN adds syntactic and semantic knowledge. All these knowledge sources are weighted equally in order to compute a single plausibility value for the current word hypothesis sequence. This plausibility value is used in the speech construction part to prune the hypothesis space and to select the currently best word hypothesis sequences. Several word hypothesis sequences are processed incremental and in parallel. At a given time the n best incremental word hypothesis sequences are kept¹¹.

The syntactic and semantic plausibility values are based on the basic syntactic and semantic prediction (BAS-SYN-PRE and BAS-SEM-PRE) of the next possible categories for a word and the selection of a preference by the determined basic syntactic respectively semantic category (BAS-SYN-DIS and BAS-SEM-DIS)¹². The performance of the disambiguation modules is 86%-89% for the test set. For the prediction modules the performance is 72% and 81% for the semantic and syntactic test set, respectively if we want to exclude at least 8 of the 12 possible categories. This performance allows us the computation of a syntactic and semantic plausibility in SYN-SPEECH-ERROR and SEM-SPEECH-ERROR. Based on the combined acoustic, syntactic, and semantic knowledge, first tests on the 184 turns show that the accuracy of the constructed sentence hypotheses of SCREEN could be increased by about 30% using acoustic and syntactic plausibilities and by about 50% using acoustic, syntactic, and semantic plausibilities (Wermter Weber, 1996a).

6.3 SCREEN's Network Performance and Why the Networks Yield High Accuracy with Little Training

For evaluating the performance of SCREEN's categorization part on the meeting corpus we first show the percentages of correctly classified words for the most important networks for categorization: BAS-SYN-DIS, BAS-SEM-DIS, ABS-SYN-CAT, ABS-SEM-CAT, PHRASE-START. There were 184 turns in this corpus with 314 utterances and 2355 words. 1/3 of the 2355 words and 184 turns was used for training, 2/3 for testing. Usually more data is used for training than testing. In preliminary earlier experiments we had used 2/3 for training and 1/3 for testing. However, the performance on the unknown test set was similar for the 1/3 training set and 2/3 test set. Therefore, we used more testing than training data since we were more interested in the generalization performance for unknown instances in the test set compared to the training performance for known instances.

At first sight, it might seem relatively little data for training. While statistical techniques and information retrieval techniques often work on large texts and individual lexical word items, we need much less material to get a reasonable performance since we work on the syntactic and semantic representations rather than the words. We would like to stress that we use the syntactic and semantic category representations of 2355 words for training and testing rather than the lexical words themselves. Therefore, the category representation requires much less training data than a lexical word representation would have required. As a side effect, also training time was reduced for the 1/3 training set, while keeping the same performance on the 2/3 test set. That is, for training we used category representations from 64 dialog turns, for testing generalization the category representations from the remaining 120 dialog turns.

Module Accuracy on test set

BAS-SYN-DIS 89%

BAS-SEM-DIS 86%

ABS-SYN-CAT 84%

ABS-SEM-CAT 83%

PHRASE-START 90%

WORD-ERROR 94%

PHRASE-ERROR 98%

Table 5: Performance of the individual networks on the test set of the meeting corpus

Table 5 shows the test results for individual networks on the unknown test set. These networks were trained for 3000 epochs with a learning rate of 0.001 and 14 hidden units. This configuration had provided the best performance for most of the network architectures. In general we tested network architectures from 7 to 28 hidden units, learning parameters from 0.1 to 0.0001. As learning rule we used the generalized delta rule (Rumelhart et al., 1986). An assigned output category representation for a word was counted as correct if the category with the maximum activation was the desired category.

The performance for the basic syntactic disambiguation was 89% on the unknown test set. Current syntactic (text-)taggers can reach up to about 95% accuracy on texts. However, there is a big difference between text and speech parsing due to the spontaneous noise in spoken language. The interjections, pauses, repetitions, repairs, new starts and more ``ungrammatical'' syntactic varieties in our spoken-language domain are reasons why the typical accuracy of other syntactic text taggers has not been reached.

On the other hand we see 86% accuracy for the basic semantic disambiguation which is relatively high for semantics. So there is some evidence that the noisy ``ungrammatical'' variety of spoken language hurts syntax but less semantics. Due to the domain dependence of semantic classifications it is more difficult to compare and explain semantic performance. However, in a different study within the domain of railway interactions we could reach a similar performance (for details see Section 6.6). In all our experiments syntactic results were better than the semantic results, indicating that the syntactic classification was easier to learn and generalize. Furthermore, our syntactic results were close to 90% for noisy spoken language which we consider to be very good in comparison to 95% for more regular text language.

The performance for the abstract categories is somewhat lower than for the basic categories since the evaluation at each word introduces some unavoidable errors. For instance, after ``in'' the network cannot yet know if a time or location will follow, but has to make an early decision already. In general, the networks perform relatively well on this difficult real-world corpus, given that we did not eliminate any sentence for any reason and took all the spontaneous sentences as they had been spoken.

Furthermore, we use transcripts of spontaneous language for training in the domain of meeting arrangements. Most utterances are questions and answers about dates and locations. This restricts the potential syntactic and semantic constructions, and we certainly benefit from the restricted domain. Furthermore, while some mappings are ambiguous for learning (e.g., a noun can be part of a noun group or a prepositional group) other mappings are relatively unambiguous (e.g., a verb is part of a verb group). We would not expect the same performance on mixed arbitrary domains like the random spoken sentences about various topics from passers-by in the city. However, the performance in somewhat more restricted domains can be learned in a promising manner (for a transfer to a different domain see Section 6.6). So there is some evidence that simple recurrent networks can provide good performance using small training data from a restricted domain.

6.4 SCREEN's Overall Output Performance

While we just described the individual network performance, we will now focus on the performance of the running system. The performance in the running SCREEN system has to be different from the performance of the individual networks for a number of reasons. First, the individual networks are trained separately in order to support a modular architecture. In the running SCREEN system, however, connectionist networks receive their input from other underlying networks. Therefore, the actual input to a connectionist network in the running SCREEN system may also differ from the original training and test sets. Second, the spoken sentences may contain errors like interjections or word repairs. These have to be part of the individual network training, but the running SCREEN system is able to detect and correct certain interjections, word corrections and phrase corrections. Therefore, system and network performance differ at such disfluencies. Third, if we want to evaluate the performance of abstract semantic categorization and abstract syntactic categorization we are particularly interested in certain sentence parts. For abstract syntactic categorization, e.g., the detection of a prepositional phrase, we have to consider that the beginning of a phrase with its significant function word, e.g., preposition, should be the most important location for syntactic categorization. In contrast, for abstract semantic categorization, the content word at the end of a phrase group, directly before the next phrase start, is most important.

Correct flat syntactic output representation 74%

Correct flat semantic output representation 72%

Table 6: Overall syntactic and semantic accuracy of the running SCREEN system on the unknown test set of the meeting corpus

As we should expect based on the explanation in the previous paragraph, the overall accuracy of the output of the complete running system should be lower than the performance of the individual modules. In fact, this is true and Table 6 shows the overall syntactic and semantic phrase accuracy of the running SCREEN system. 74% of all assigned syntactic phrase representations of the unknown test set are correct and 72% of all assigned semantic phrase representations. The slight performance drop can be partially explained by the more uncertain input from other underlying networks which themselves are influenced by other networks. On the other hand, in some cases the various decisions by different modules (e.g. the three modules for lexical, syntactic and semantic category equality of two words) can be combined in order to clean up some errors (e.g. a wrong decision by one single module). In general, given that the 120 dialog turns of the test set were completely unrestricted, unknown real-world and spontaneous language turns, we believe that the overall performance is quite promising.

6.5 SCREEN's Overall Performance for an Incomplete Lexicon

One important property of SCREEN is its robustness. Therefore, it is an interesting question how SCREEN would behave if it could only receive incomplete input from its lexicon. Such situations are realistic since speakers could use new words which a speech recognizer has not seen before. Furthermore, we can test the robustness of our techniques. While standard context-free parsers usually cannot provide an analysis if words are missing from the lexicon, SCREEN would not break on missing input representations, although of course we have to expect that the overall classification performance must drop if less reliable input is provided.

In order to test such a situation under the controlled influence of removing items from the lexicon, we first tested a scenario where we randomly eliminated 5% of the syntactic and semantic lexicon representations. If a word was unknown, SCREEN used a single syntactic and single semantic average default vector instead. This average default vector contained the normalized frequency of each syntactic respectively semantic category across the lexicon.

Correct flat syntactic output representation 72%

Correct flat semantic output representation 67%

Table 7: Overall syntactic and semantic accuracy of the running SCREEN system for the meeting corpus on the unknown test set after 5% of all lexicon entries were eliminated

Even without 5% of all lexicon entries all utterances could still be analyzed. So SCREEN does not break for missing word representations but attempts to provide an analysis as good as possible. As expected, Table 7 shows a performance drop for the overall syntactic and semantic accuracy. However, compared to the 74% and 72% performance for the complete lexicon (see Table 6) we still find that 72% of the syntactic output representations and 67% of the semantic output representations are correct after eliminating 5% of all lexicon entries.

Correct flat syntactic output representation 70%

Correct flat semantic output representation 67%

Table 8: Overall syntactic and semantic accuracy of the running SCREEN system for the meeting corpus on the unknown test set after 10% of all lexicon entries were eliminated

In another experiment we eliminated 10% of all syntactic and semantic lexicon entries. In this case, the syntactic accuracy was still 70% and the semantic accuracy was 67%. Eliminating 10% of the lexicon led to a syntactic accuracy reduction of only 4% (74% versus 70%) and a semantic accuracy reduction of 5% (72% versus 67%). In general we see that in all our experiments the percentage of accuracy reduction was much less than the percentage of eliminated lexicon entries demonstrating SCREEN's robustness for working with an incomplete lexicon.

6.6 Comparison with the Results in a New Different Domain

In order to compare the performance of our techniques, we will also show results from experiments with a different spoken Regensburg Train Corpus. Our intention cannot be to describe the experiments in this domain at the same level of detail as we have done for our Blaubeuren Meeting Corpus in this paper. However, we will provide a summary in order to provide a point of reference and comparison for our experiments on the meeting corpus. This comparison serves as another additional possibility to judge our results for the meeting corpus.

As a different domain we chose 176 dialog turns at a railway counter. People ask questions and receive answers about train connections. A typical utterance is: ``Yes I need eh a a sleeping car PAUSE from PAUSE Regensburg to Hamburg''. We used exactly the same SCREEN communication architecture to process spoken utterances from this domain: the same architecture was used, 1/3 of the dialog turns was used for training, 2/3 for testing on unseen unknown utterances. For syntactic processing, we even used exactly the same network structure, since we did not expect much syntactic differences between the two domains. Only for semantic processing we retrained the semantic networks. Different categories had to be used for semantic classification, in particular for actions. While actions about meetings (e.g., visit, meet) were predominant in the meeting corpus, actions about selecting connections (e.g., choose, select) were important in the train corpus (Wermter Weber, 1996b). Just to give the reader an impression of the portability of SCREEN, we would estimate that 90% of the original human effort (system architecture, networks) could be used in this new domain. Most of the remaining 10% were needed for the necessary new semantic tagging and training in the new domain.

Module Accuracy on test set

BAS-SYN-DIS 93%

BAS-SEM-DIS 84%

ABS-SYN-CAT 85%

ABS-SEM-CAT 77%

PHRASE-START 89%

WORD-ERROR 94%

PHRASE-ERROR 98%

Table 9: Performance of the individual networks on the test set in the train corpus

Table 9 shows the performance on the test set in the train corpus. If we compare our results in the meeting corpus (Table 5) with these results in the train corpus we see in particular that the abstract syntactic processing is almost the same in the meeting corpus (84% in Table 5 compared to 85% in Table 9) but the abstract semantic processing is better in the meeting corpus (83% in Table 5 compared to 77% in Table 9). Other modules dealing with explicit robustness for repairs (phrase start, word repair errors, phrase repair errors) show almost the same performance (90% vs 89%, 94% vs 94%, 98% vs 98%).

-->

Correct flat syntactic output representation 76%

Correct flat semantic output representation 64%

Table 10: Overall syntactic and semantic accuracy of the running SCREEN system on the unknown test set of a different train corpus

As a comparison we summarize here the overall performance for this different train domain. Table 10 shows that SCREEN has about the same syntactic performance in the two domains (compare with Table 6). So in this different domain we can essentially confirm our previous results for syntactic processing performance (74% vs. 76%). However, semantic processing appears to be harder in the train domain since the performance of 64% is lower than the 72% in the meeting domain. However, semantic processing, semantic tagging or semantic classification is often found to be much harder than syntactic processing in general, so that the difference is still within the range of usual performance differences in syntax and semantics. Since semantic categories like agents, locations, and time expressions are about the same in these two domains the more difficult action categorization is mainly responsible for this difference in semantic performance between the two domains.

In general the transfer from one domain to another only requires a limited amount of hand-modeling. Of course, syntactic and semantic categories have to be specified for the lexicon and the transcripts. These syntactically or semantically tagged transcript sentences are the direct basis for generating the training sets for the networks. Generating these trainings sets is the main manual effort while transferring the system to a new domain. After the generation of the training sets has been performed the training of the networks can proceed automatically. The training of a typical single recurrent network takes in the order of a few hours. So much less manual work is required than for transferring a standard symbolic parser to a new domain and generating a new syntactic and semantic grammar.

6.7 An Illustrative Comparison Argument Based on a Symbolic Parser

We have made the point that SCREEN's learned flat representations are more robust than hand-coded deeply structured representations. Here we would like to elaborate this point with a compelling illustrative argument. Consider different variations of sentence hypotheses from a speech recognizer in Figure 19: 1. A correct sentence hypothesis: ``Am sechsten April bin ich außer Hause'' (``On 6th April am I out of home'') and 2. A partially incorrect sentence hypothesis: ``Am sechsten April ich ich außer Hause'' (``On 6th April I I out of home''). Focusing on the syntactic analysis, we used an existing chart parser and an existing grammar which had been used extensively for other real-world parsing up to the sentence level (Wermter, 1995). The only necessary significant adaptation was the addition of a rule NG -> U for pronouns, which had not been part of the original grammar. This rule states that a pronoun U (e.g., ``I'') can be a noun group (NG).

If we run the first sentence hypothesis through the symbolic context-free parser we receive the desired syntactic analysis shown in Figure 19, but if we run the second slightly incorrect sentence hypothesis through the parser we do not receive any analysis (The syntactic category abbreviations in Figure 19 are used in the same manner as throughout the paper (see Table 1-4); furthermore and as usual, ``S'' stands for sentence, ``ADJG'' for adjective group, ``NP'' for complex nominal phrase, ``VP'' for verb phrase. The literal English translations are shown in brackets).

1. Input: AM SECHSTEN APRIL BIN ICH AUßER HAUSE -> (ON 6th APRIL AM I OUT_OF HOME) 1. Output: 2. Input: AM SECHSTEN APRIL ICH ICH AUßER HAUS -> (ON 6th APRIL I I OUT_OF HOME) 2. Output: NIL (NO ANALYSIS POSSIBLE)

Figure 19: Two sentence hypotheses from a speech recognizer. The first hypothesis can be analyzed, the second partially incorrect hypothesis cannot be analyzed anymore by the symbolic parser.

The reason why the second sentence hypothesis could not be parsed by the context-free chart parser was that the speech recognizer generated incorrect output. There is no verb in the second sentence hypothesis and there is an additional pronoun ``I''. Such mistakes occur rather frequently based on the imperfectness of current speech recognition technology. Of course one could argue that the grammar should be relaxed and made more flexible to deal with such mistakes. However, the more rules for fault detection are integrated into the grammar or the parser the more complicated the grammar or the parser. Even more important, it is impossible to predict all possible mistakes and integrate them into a symbolic context-free grammar. Finally, relaxing the grammar for dealing with mistakes by using explicit specific rules also might lead to other additional mistakes because the grammar now has to be extremely underspecified.

As we have shown, for instance in Figure 17, SCREEN does not have problems dealing with such speech recognizer variations and mistakes. The main difference between a standard context-free symbolic chart parser analysis and SCREEN's analysis is that SCREEN has learned to provide a flat analysis under noisy conditions but the context-free parser has been hand-coded to provide a more structural analysis. It should be emphasized here that we do not make an argument against structural representations per se and in general. The more structure that can be provided the better, particularly for tasks which require structured world knowledge. However, if robustness is a major concern, as it is for lower syntactic and semantic spoken-language analysis, a learned flat analysis provides more robustness.

6.8 Comparisons with Related Hybrid Systems

Recently, connectionist networks have received a lot of attention as computational learning mechanisms for written language processing (Reilly Sharkey, 1992; Miikkulainen, 1993; Feldman, 1993; Barnden Holyoak, 1994; Wermter, 1995). In this paper however, we have focused on the examination of hybrid connectionist techniques for spoken language processing. In most previous approaches to speech/language processing processing was often sequential. That is, one module like the speech recognizer or the syntactic analyzer completed its work before the next module like a semantic analyzer started to work. In contrast, SCREEN works incrementally which allows the system (1) to have modules running in parallel, (2) to integrate knowledge sources very early, and (3) to compute the analysis more similar to humans since humans start to process sentences before they may be completed.

We will now compare our approach to related work and systems. A head-to-head comparison with a different system is difficult based on different computer environments and whether systems can be accessed and adapted easily for the same input. Furthermore, different systems are typically used for different purposes with different language corpora, grammars, rules, etc. However, we have made an extensive effort for a fair conceptual comparison.

PARSEC (Jain, 1991) is a hybrid connectionist system which is embedded in a larger speech translation effort JANUS (Waibel et al., 1992). The input for PARSEC is sentences, the output is case role representations. The system consists of several connectionist modules with associated symbolic transformation rules for providing transformations suggested by the connectionist networks. While it is PARSEC's philosophy to use connectionist networks for triggering symbolic transformations, SCREEN uses connectionist networks for the transformations themselves. It is SCREEN's philosophy to use connectionist networks wherever possible and symbolic rules only where they are necessary.

We found symbolic processing particularly useful for simple known tests (like lexical equality) or for complex control tasks of the whole system (when does a module communicate to which other module). Much of the actual transformational work can be done by trained connectionist networks. This is in contrast to the design philosophy in PARSEC where connectionist modules provide control knowledge which transformation should be performed. Then the selected transformation is actually performed by a symbolic procedure. So SCREEN uses connectionist modules for transformations and a symbolic control, while PARSEC uses connectionist modules for control and symbolic procedures for the transformations.

Different from SCREEN, PARSEC receives sentence hypotheses either as sentence transcripts or as N-best hypotheses from the JANUS system. Our approach receives incremental word hypotheses which are used in the speech construction part to build sentence hypotheses. This part is also used to prune the hypothesis space and to determine the best sentence hypotheses. So during the flat analysis in SCREEN the semantic and syntactic plausibilities of a partial sentence hypothesis can still influence which partial sentence hypotheses are processed.

For PARSEC and for SCREEN a modular architecture was tested which has the advantage that each connectionist module has to learn a relatively easy subtask. In contrast to the development of PARSEC it is our experience that modularity requires less training time. Furthermore, some modules in SCREEN are able to work independently from each other and in parallel. In addition to syntactic and semantic knowledge, PARSEC can make use of prosodic knowledge while SCREEN currently does not use prosodic hints. On the other hand, SCREEN also contains modules for learning dialog act assignment while such modules are currently not part of PARSEC. Learning dialog act processing is important for determining the intended meaning of an utterance (Wermter Löchel, 1996).

Recent further extensions based on PARSEC provide more structure and use annotated linguistic features (Buø et al., 1994). The authors state that they ``implemented (based on PARSEC) a connectionist system'' which should approximate a shift reduce parser. This connectionist shift-reduce parser substantially differs from the original PARSEC architecture. We will refer to it as the ``PARSEC extension''. This PARSEC extension labels a complete sentence with its first level categories. These first level categories are input again to the same network in order to provide second level categories for the complete sentence and so on, until at the highest level the sentence symbol can be added.

Using this recursion step the PARSEC extension can provide deeper and more structural interpretations than SCREEN currently does. However, this recursion step and the construction of the structure also have their price. First, labels like NP for a noun phrase have to be defined as lexical items in the lexicon. Second, and more important, the complete utterance is labeled with the n-th level categories before processing with the n+1-th level categories starts. Therefore several parses (e.g., 7 for the utterance ``his big brother loved himself'') through the utterance are necessary. This means that this recent PARSEC extension is more powerful than SCREEN and the original PARSEC system by Jain with respect to the opportunity to provide deeper and more structural interpretations. However, at the same time this PARSEC extension looses the possibility to process utterances in an incremental manner. However, incrementality is a very important property in spoken-language processing and in SCREEN. Besides the fact that humans process language in an incremental left-to-right manner, this also allows SCREEN to prune the search space of incoming word hypotheses very early.

Comparing PARSEC and SCREEN, PARSEC aims more at supporting symbolic rules by using symbolic transformations (triggered by connectionist networks) and by integrating linguistic features. Currently, the linguistic features in the recent PARSEC extension (Buø et al., 1994) provide more structural and morphological knowledge than SCREEN does. Therefore, currently it appears to be easier to integrate the PARSEC extension into larger systems of high level linguistic processing. In fact, PARSEC has been used in the context of the JANUS framework. On the other hand, SCREEN aims more at robust and incremental processing by using a word hypothesis space, specific repair modules, and more flat representations. In particular, SCREEN emphasizes more the robustness of spoken-language processing, since it contains explicit repair mechanisms and implicit robustness. Explicit robustness covers often occurring errors (interjections, pauses, word and phrase repairs) in explicit modules, while other less predictable types of errors are only supported by the implicit similarity-based robustness from the connectionist networks themselves. In general, the representations generated by the extension of PARSEC provide better support for deeper structures than SCREEN, but SCREEN provides better support for incremental robust processing. In a more recent extension based on PARSEC called FeasPar, the overall parsing performance was a syntactic and semantic feature accuracy of 33.8%. Although additional improvements can be shown using subsequent search techniques on the parsing results, we did not consider such subsequent search techniques for better parses since they would violate incremental processing (Buø, 1996). Without using subsequent search techniques SCREEN reaches an overall semantic and syntactic accuracy between 72% and 74% as shown in Table 6. However it should be pointed out, that SCREEN and FeasPar use different input sentences, features and architectures.

Besides PARSEC also the BeRP and TRAINS systems focus on hybrid spoken-language processing. BeRP (Berkeley Restaurant Project) is a current project which employs multiple different representations for speech/language analysis (Wooters, 1993; Jurafsky et al., 1994a; Jurafsky et al., 1994b). The task of BeRP is to act as a knowledge consultant for giving advice about choosing restaurants. There are different components in BeRP: The feature extractor receives digitized acoustic data and extracts features. These features are used in the connectionist phonetic probability estimation. The output of this connectionist feedforward network is used in a Viterbi decoder which uses a multiple pronunciation lexicon and different language models (e.g. bigram, hand-coded grammar rules). The output of the decoder are word strings which are transformed into database queries by a stochastic chart parser. Finally, a dialog manager controls the dialog with the user and can ask questions.

BeRP and SCREEN have in common the ability to deal with errors from humans and from the speech recognizer as well as a relatively flat analysis. However, for reaching this robustness in BeRP a probabilistic chart parser is used to compute all possible fragments at first. Then, an additional fragment combination algorithm is used for combining these fragments so that they cover the greatest number of input words. Different from this sequential process of first computing all fragments of an utterance and then combining the fragments, SCREEN uses incremental processing and desirably provides the best possible interpretation. In this sense SCREEN's language analysis is weaker but more general. SCREEN's analysis will never break and produce the best possible interpretation for all noisy utterances. This strategy may be particularly useful for incremental translation. On the other hand, BeRP's language analysis is stronger but more restricted. BeRP's analysis may stop at the fragment level if there are contradictory fragments. This strategy may be particularly useful for question answering where additional world knowledge is necessary and available.

TRAINS is a related spoken-language project for building a planning assistant who can reason about time, actions, and events (Allen, 1995; Allen et al., 1995). Because of this goal of building a general framework for natural language processing and planning for train scheduling, TRAINS needs a lot of commonsense knowledge. In the scenario, a person interacts with the system in order to find solutions for train scheduling in a cooperative manner. The person is assumed to know more about the goals of the scheduling while the system is supposed to have the details of the domain. The utterance of a person is parsed by a syntactic and semantic parser. Further linguistic reasoning is completed by modules for scoping and reference resolution. After the linguistic reasoning, conversation acts are determined by a system dialog manager and responses are generated based on a template-driven natural language generator. Performance phenomena in spoken language like repairs and false starts can currently be dealt with already (Heeman Allen, 1994b; Heeman Allen, 1994a). Compared to SCREEN, the TRAINS project focuses more on processing spoken language at an in-depth planning level. While SCREEN uses primarily a flat connectionist language analysis, TRAINS uses a chart parser with a generalized phrase structure grammar.

Next: Discussion Up: Abstract Previous: Detailed Analysis with Examples

SCREEN (screen@nats5.informatik.uni-hamburg.de)
Mon Dec 16 15:33:13 MET 1996