Improving ME accuracy

Our main goal is to find a method that will automatically obtain the best feature selection Veenstra2000,MihalceaCOLING2002,suarezCOLING2002 from the training data. We performed an $3$-fold cross-validation process. Data is divided in $3$ folds; then, $3$ tests are done, each one with $2$ folds as training data and the remaining one as testing data. The final result is the average accuracy. We decided on just three tests because of the small size of the training data. Then, we tested several combinations of features over the training data of the SENSEVAL-2 Spanish lexical-sample and analyzed the results obtained for each word.


Table: Three-fold Cross-Validation Results on SENSEVAL-2 Spanish Training Data: Best Averaged Accuracies per Word

 

Word

Features  Accur MFS Word Features  Accur MFS
 

autoridad,N

sbcp  0.589 0.503 clavar,V sbcprdk3  0.561 0.449
  bomba,N 0LWSBCk5  0.762 0.707 conducir,V LWsBCPD  0.534 0.358
  canal,N sbcprdk3  0.579 0.307 copiar,V 0sbcprdk3  0.457 0.338
  circuito,N 0LWSBCk5  0.536 0.392 coronar,V sk5  0.698 0.327
  corazón,N 0Sbcpk5  0.781 0.607 explotar,V 0LWSBCk5  0.593 0.318
  corona,N sbcp  0.722 0.489 saltar,V LWsBC  0.403 0.132
  gracia,N 0sk5  0.634 0.295 tocar,V 0sbcprdk3  0.583 0.313
  grano,N 0LWSBCr  0.681 0.483 tratar,V sbcpk5  0.527 0.208
  hermano,N 0Sprd  0.731 0.602 usar,V 0Sprd  0.732 0.669
  masa,N LWSBCk5  0.756 0.455 vencer,V sbcprdk3  0.696 0.618
  naturaleza,N sbcprdk3  0.527 0.424 brillante,A sbcprdk3  0.756 0.512
  operación,N 0LWSBCk5  0.543 0.377 ciego,A 0spdk5  0.812 0.565
  órgano,N 0LWSBCPDk5  0.715 0.515 claro,A 0Sprd  0.919 0.854
  partido,N 0LWSBCk5  0.839 0.524 local,A 0LWSBCr  0.798 0.750
  pasaje,N sk5  0.685 0.451 natural,A sbcprdk10  0.471 0.267
  programa,N 0LWSBCr  0.587 0.486 popular,A sbcprdk10  0.865 0.632
  tabla,N sk5  0.663 0.488 simple,A LWsBCPD  0.776 0.621
  actuar,V sk5  0.514 0.293 verde,A LWSBCk5  0.601 0.317
  apoyar,V 0sbcprdk3  0.730 0.635 vital,A Sbcp  0.774 0.441
  apuntar,V 0LWsBCPDk5  0.661 0.478  
     


In order to perform the 3-fold cross-validation process on each word, some preprocessing of the corpus was done. For each word, all senses were uniformly distributed into the three folds (each fold contains one-third of the examples of each sense). Those senses that had fewer than three examples in the original corpus file were rejected and not processed.

Table 8 shows the best results obtained using three-fold cross-validation on the training data. Several feature combinations were tested in order to find the best set for each selected word. The purpose was to obtain the most relevant information for each word from the corpus rather than applying the same combination of features to all of them. Therefore, the information in the column Features lists only the feature selection with the best result. Strings in each row represent the entire set of features used when training each classifier. For example, autoridad obtains its best result using nearest words, collocations of two lemmas, collocations of two words, and POS information that is, $s$, $b$, $c$, and $p$ features, respectively (see Figure 9). The column Accur (for ``accuracy'') shows the number of correctly classified contexts divided by the total number of contexts (because ME always classifies precision as equal to recall). Column MFS shows the accuracy obtained when the most frequent sense is selected.

The data summarized in Table 8 reveal that using ``collapsed'' features in the ME method is useful; both ``collapsed'' and ``non-collapsed'' functions are used, even for the same word. For example, the adjective vital obtains the best result with ``$Sbcp$'' (the ``collapsed'' version of words in a window $(-3 .. +3)$, collocations of two lemmas and two words in a window $(-2 .. +2)$, and POS labels, in a window $(-3 .. +3)$ too); we can here infer that single-word information is less important than collocations in order to disambiguate vital correctly.

The target word (feature 0) is useful for nouns, verbs, and adjectives, but many of the words do not use it for their best feature selection. In general, these words do not have a relevant relationship between shape and senses. On the other hand, POS information ($p$ and $P$ features) is selected less often. When comparing $lemma$ features with $word$ features (e.g., $L$ versus $W$, and $B$ versus $C$), they are complementary in the majority of cases. Grammatical relationships ($r$ features) and word-word dependencies ($d$ and $D$ features) seem very useful, too, if combined with other types of attributes. Moreover, keywords ($k$m features) are used very often, possibly due to the source and size of contexts of SENSEVAL-2 Spanish lexical-sample data.

Table 9 shows the best feature selections for each part-of-speech and for all words. The data presented in Tables 8 and 9 were used to build four different sets of classifiers in order to compare their accuracy: MEfix uses the overall best feature selection for all words; MEbfs trains each word with its best selection of features (in Table 8); MEbfs.pos uses the best selection per POS for all nouns, verbs and adjectives, respectively (in Table 9); and, finally, vME is a majority voting system that has as input the answers of the preceding systems.


Table: Three-fold Cross-Validation Results on SENSEVAL-2 Spanish Training Data: Best Averaged Accuracies per POS

POS Acc Features System
Nouns 0.620 LWSBCk5
Verbs 0.559 sbcprdk3 MEbfs.pos
Adjectives 0.726 0spdk5
ALL 0.615 sbcprdk3 MEfix


Table 10 shows a comparison of the four systems. MEfix has the lower results. This classifier applies the same set of types of features to all words. However, the best feature selection per word (MEbfs) is not the best, probably because more training examples are necessary. The best choice seems to select a fixed set of types of features for each POS (MEbfs.pos).


Table: Evaluation of ME Systems

ALL Nouns

0.677

MEbfs.pos 0.683 MEbfs.pos
0.676 vME 0.678 vME
0.667 MEbfs 0.661 MEbfs
0.658 MEfix 0.646 MEfix

Verbs Adjectives

0.583

vME 0.774 vME
0.583 MEbfs.pos 0.772 MEbfs.pos
0.583 MEfix 0.771 MEbfs
0.580 MEbfs 0.756 MEfix

MEfix: sbcprdk3 for all words
MEbfs: each word with its
best feature selection
MEbfs.pos: LWSBCk5 for nouns,
sbcprdk3 for verbs,
and 0spdk5 for adjectives
vME: majority voting between MEfix,
MEbfs.pos, and MEbfs


While MEbfs predicts, for each word over the training data, which individually selected features could be the best ones when evaluated on the testing data, MEbfs.pos is an averaged prediction, a selection of features that, over the training data, performed a ``good enough'' disambiguation of the majority of words belonging to a particular POS. When this averaged prediction is applied to the real testing data, MEbfs.pos performs better than MEbfs.

Another important issue is that MEbfs.pos obtains an accuracy slightly better than the best possible evaluation result achieved with ME (see Table 7)--that is, a best-feature-selection per POS strategy from training data guarantees an improvement on ME-based WSD.

In general, verbs are difficult to learn and the accuracy of the method for them is lower than for other POS; in our opinion, more information (knowledge-based, perhaps) is needed to build their classifiers. In this case, the voting system (vME) based on the agreement between the other three systems, does not improve accuracy.

Finally in Table 11, the results of the ME method are compared with those systems that competed at SENSEVAL-2 in the Spanish lexical-sample task5. The results obtained by ME systems are excellent for nouns and adjectives, but not for verbs. However, when comparing ALL POS, the ME systems seem to perform comparable to the best SENSEVAL-2 systems.


Table: Comparison with the Spanish SENSEVAL-2 systems

ALL Nouns Verbs Adjectives

0.713

jhu(R) 0.702 jhu(R) 0.643 jhu(R) 0.802 jhu(R)
0.682 jhu 0.683 MEbfs.pos 0.609 jhu 0.774 vME
0.677 MEbfs.pos 0.681 jhu 0.595 css244 0.772 MEbfs.pos
0.676 vME 0.678 vME 0.584 umd-sst 0.772 css244
0.670 css244 0.661 MEbfs 0.583 vME 0.771 MEbfs
0.667 MEbfs 0.652 css244 0.583 MEbfs.pos 0.764 jhu
0.658 MEfix 0.646 MEfix 0.583 MEfix 0.756 MEfix
0.627 umd-sst 0.621 duluth 8 0.580 MEbfs 0.725 duluth 8
0.617 duluth 8 0.612 duluth Z 0.515 duluth 10 0.712 duluth 10
0.610 duluth 10 0.611 duluth 10 0.513 duluth 8 0.706 duluth 7
0.595 duluth Z 0.603 umd-sst 0.511 ua 0.703 umd-sst
0.595 duluth 7 0.592 duluth 6 0.498 duluth 7 0.689 duluth 6
0.582 duluth 6 0.590 duluth 7 0.490 duluth Z 0.689 duluth Z
0.578 duluth X 0.586 duluth X 0.478 duluth X 0.687 ua
0.560 duluth 9 0.557 duluth 9 0.477 duluth 9 0.678 duluth X
0.548 ua 0.514 duluth Y 0.474 duluth 6 0.655 duluth 9
0.524 duluth Y 0.464 ua 0.431 duluth Y 0.637 duluth Y




Footnotes

... task5
JHU(R) by Johns Hopkins University; CSS244 by Stanford University; UMD-SST by the University of Maryland; Duluth systems by the University of Minnesota - Duluth; UA by the University of Alicante.