Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

dc.contributor.advisorPfahringer, Bernhard
dc.contributor.advisorHolmes, Geoffrey
dc.contributor.advisorMayo, Michael
dc.contributor.authorMutter, Stefan
dc.date.accessioned2011-04-20T03:09:46Z
dc.date.available2011-04-20T03:09:46Z
dc.date.issued2011
dc.date.updated2011-04-14T23:02:42Z
dc.description.abstractDetecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner.
dc.format.mimetypeapplication/pdf
dc.identifier.citationMutter, S. (2011). Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation (Thesis, Doctor of Philosophy (PhD)). University of Waikato, Hamilton, New Zealand. Retrieved from https://hdl.handle.net/10289/5299en
dc.identifier.urihttps://hdl.handle.net/10289/5299
dc.language.isoen
dc.publisherUniversity of Waikato
dc.rightsAll items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
dc.subjectMachine Learning
dc.subjectBioinformatics
dc.subjectStatistical Modelling
dc.subjectHidden Markov Models
dc.subjectproteins
dc.subjectamino acids
dc.subjectone-class classification
dc.subjectpropositionalisation
dc.subjectnull model
dc.subjectdiscriminative learner
dc.subjectgenerative model
dc.subjectclassification
dc.titleSequence-based protein classification: binary Profile Hidden Markov Models and propositionalisationen
dc.typeThesis
pubs.place-of-publicationHamilton, New Zealanden_NZ
thesis.degree.grantorUniversity of Waikato
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy (PhD)en_NZ
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis.pdf
Size:
6.62 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.07 KB
Format:
Item-specific license agreed upon to submission
Description: