Feature Selection and Transduction for Prediction of Molecular Bioactivity for Drug Design

Jason Weston, Fernando Perez-Cruz, Olivier Bousquet, Olivier Chapelle, Andre Elisseeff and Bernhard Schoelkopf


In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in the thrombin binding problem which was previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecule) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide a 81.6% success rate, an improvement over the 68.4% of the winner of the KDD Cup. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type.
Main report
Link to KDD Cup 2001 Competition - data and results.
Statistics of the data: further description
Unbalanced correlation score: further description
Experimental methods & results: further description
Source code (Matlab M-Files)