Loading...
Thumbnail Image
Item

Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

Abstract
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.
Type
Thesis
Type of thesis
Series
Citation
Mutter, S. (2011). Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation (Thesis, Doctor of Philosophy (PhD)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/5299
Date
2011
Publisher
University of Waikato
Rights
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.