Winner's Report

The KDD-Cup 2004 (knowledge discovery and data mining competition) is held in conjunction with the 10th annual ACM SIGKDD conference. A student group supervised by Katharina Morik and Martin Scholz PG 445 participated in this year's KDD-Cup and won Honorable Mention for their excellent performance on the Rank-Last (RKL) metrics on the Bio/Protein Task. The group's mean RKL of 45.62 was much better than the closest competitors. The results can be viewed on the KDD-Cup 2004 Homepage. The ceremony will be held on Sunday, August 22 at the KDD Conference in Seattle, where the Honorable Mention certificates will be handed out.

Winner's Approach

After careful data inspection two approaches were used: BlockNearestNeighbor and a Support Vector Machine (SVM). The combination of the two approches lead to the final results.

Data Inspection

The data inspection revealed that from 153 blocks given in the training dataset many only had a little number of positive examples, while a few others contained up to 50 positive ones. Therefore, the analysis of the numerical feature values was conducted for every block in the training set, separately. Some simple statistical measures were used to characterize differences between blocks.

BlockNearestNeighbor

One possibilty to consider differences between blocks is to quantify them appropriately. Using the statistical measures mentioned above, each block was transformed into a vector representation. A distance measure based on these vectors helped to find similar blocks applying the k nearest neighbor method. The BlockNearestNeighbor method can be characterized with the following steps:

Before the whole learning process the datasets have been normalised. To assess the importance of the numerical attributes, decision trees were learned for all blocks. The frequency of an attribute's occurence served as an estimate of its relevance. Using these estimates a feature selection for the vector representation was carried out. We achieved our best results including only attributes occuring in at least two decision trees. It was of utmost importance to find an appropriate value for k. Additionally, the SVM's parameter for positive vs. negative misclassification costs was optimized.
The following figure depicts the process for ranking previously unseen blocks:

SVM Classification

SVMs with RBF kernel have successfully been used for predicting protein homology. Thus a SVMlight with RBF kernel was trained on the normalised data. The function value of the SVM corresponds to the distance to the separating hyperplane. The bigger the distance to the hyperplane, the higher "SVM's confidence" regarding its classification. Therefore the function values were used to find a ranking, rather than using the trained SVM models for making class assignments.

Final Model

In the final model the predictions were normalized. The combination of both models led to the final ranking for each block and another improvement of the performance.

The overall procedure for ranking new blocks is sketched in the following figure: