Pictures copyrighted

J. Zurada - Towards Better Understanding of Protein Secondary Structure: Extracting Prediction Rules

Prediction of protein secondary structure (PSS) is considered to be one of the central problems in bioinformatics today. There exists a vast body of work, including computational approaches, to achieve higher accuracy of this prediction. Although numerous computational intelligence techniques have been used to predict PSS, only limited studies have dealt with discovery of prediction rules underlying the prediction itself. Such rules offer interesting links between the prediction model and the underlying biology and enhance interpretability of PSS prediction by providing a degree of transparency to the predicting models usually regarded as a black-box. In this seminar, we explore the use of C4.5 decision trees to extract relevant rules from PSS predictions modeled with two-stage support vector machines (TS-SVM). The rules were derived on the RS126 dataset of globular proteins and on the PSIPRED dataset of 1923 protein sequences. Our approach has produced sets of comprehensible and often interpretable rules underlying the PSS predictions. Moreover, many of the rules seem to be strongly supported by biological evidence.