The KDD-Cup 2004 (knowledge discovery and data mining competition) is held in conjunction with the 10th annual ACM SIGKDD conference. A student group supervised by Katharina Morik and Martin Scholz PG 445 participated in this year's KDD-Cup and won Honorable Mention for their excellent performance on the Rank-Last (RKL) metrics on the Bio/Protein Task. The group's mean RKL of 45.62 was much better than the closest competitors. The results can be viewed on the KDD-Cup 2004 Homepage. The ceremony will be held on Sunday, August 22 at the KDD Conference in Seattle, where the Honorable Mention certificates will be handed out.
The data inspection revealed that from 153 blocks given in the training dataset many only had a little number of positive examples, while a few others contained up to 50 positive ones. Therefore, the analysis of the numerical feature values was conducted for every block in the training set, separately. Some simple statistical measures were used to characterize differences between blocks.
One possibilty to consider differences between blocks is to quantify them appropriately. Using the statistical measures mentioned above, each block was transformed into a vector representation. A distance measure based on these vectors helped to find similar blocks applying the k nearest neighbor method. The BlockNearestNeighbor method can be characterized with the following steps:
SVMs with RBF kernel have successfully been used for predicting protein homology. Thus a SVMlight with RBF kernel was trained on the normalised data. The function value of the SVM corresponds to the distance to the separating hyperplane. The bigger the distance to the hyperplane, the higher "SVM's confidence" regarding its classification. Therefore the function values were used to find a ranking, rather than using the trained SVM models for making class assignments.
In the final model the predictions were normalized. The combination of both models led to the final ranking for each block and another improvement of the performance.
The overall procedure for ranking new blocks is sketched in the following figure: