## Winner's Report

The KDD-Cup 2004 (knowledge discovery and data mining competition) is held in
conjunction with the 10th annual ACM SIGKDD conference. A student group supervised by Katharina Morik and Martin Scholz PG 445
participated in this year's KDD-Cup and won Honorable Mention for their
excellent performance on the Rank-Last (RKL) metrics on the Bio/Protein
Task. The group's mean RKL of 45.62 was much better than the closest
competitors. The results can be viewed on the KDD-Cup 2004 Homepage. The
ceremony will be held on Sunday, August 22 at the KDD Conference in Seattle,
where the Honorable Mention certificates will be handed out.

## Winner's Approach

After careful data inspection two approaches were used:
*BlockNearestNeighbor* and a Support Vector Machine (SVM). The combination
of the two approches lead to the final results.
#### Data Inspection

The data inspection revealed that from 153 blocks given in the training dataset
many only had a little number of positive examples, while a few others contained
up to 50 positive ones. Therefore, the analysis of the numerical
feature values was conducted for every block in the training set, separately. Some
simple statistical measures were used to characterize differences between blocks.

#### BlockNearestNeighbor

One possibilty to consider differences between blocks is to quantify them appropriately.
Using the statistical measures mentioned above, each block was transformed
into a vector representation. A distance measure based on these vectors helped
to find similar blocks applying the k nearest neighbor method.
The *BlockNearestNeighbor* method can be characterized with the following steps:

- In order to rank a block first locate its k-nearest neighbors using
the vector representation.
- Find a ranking model for the examples of this k-nearest blocks using the SVM
^{light}
in ranking mode.
- Rank all examples of the given block by this model.

Before the whole learning process the datasets have been normalised.
To assess the importance of the numerical attributes, decision trees were
learned for all blocks. The frequency of an attribute's occurence served as an
estimate of its relevance. Using these estimates a feature selection for the
vector representation was carried out. We achieved our best results including
only attributes occuring in at least two decision trees. It was of utmost importance to find an
appropriate value for *k*. Additionally, the SVM's parameter for
positive vs. negative misclassification costs was optimized.

The following figure depicts the process for ranking previously unseen
blocks:

#### SVM Classification

SVMs with RBF kernel have successfully been used for predicting protein
homology. Thus a SVM^{light} with RBF kernel was trained on the normalised data. The
function value of the SVM corresponds to the distance to the separating
hyperplane. The bigger the distance to the hyperplane, the higher "SVM's
confidence" regarding its classification. Therefore the function values
were used to find a ranking, rather than using the trained SVM models for
making class assignments.

#### Final Model

In the final model the predictions were normalized. The combination of both models led to the final ranking for each block and another improvement of the performance.

The overall procedure for ranking new blocks is sketched in the following figure: