Designing similarity functions
Permanent link to Research Commons versionhttps://hdl.handle.net/10289/15442
The concept of similarity is important in many areas of cognitive science, computer science, and statistics. In machine learning, functions that measure similarity between two instances form the core of instance-based classifiers. Past similarity measures have been primarily based on simple Euclidean distance. As machine learning has matured, it has become obvious that a simple numeric instance representation is insufficient for most domains. Similarity functions for symbolic attributes have been developed, and simple methods for combining these functions with numeric similarity functions were devised. This sequence of events has revealed three important issues, which this thesis addresses. The first issue is concerned with combining multiple measures of similarity. There is no equivalence between units of numeric similarity and units of symbolic similarity. Existing similarity functions for numeric and symbolic attributes have no common foundation, and so various schemes have been devised to avoid biasing the overall similarity towards one type of attribute. The similarity function design framework proposed by this thesis produces probability distributions that describe the likelihood of transforming between two attribute values. Because common units of probability are employed, similarities may be combined using standard methods. It is empirically shown that the resulting similarity functions treat different attribute types coherently. The second issue relates to the instance representation itself. The current choice of numeric and symbolic attribute types is insufficient for many domains, in which more complicated representations are required. For example, a domain may require varying numbers of features, or features with structural information. The framework proposed by this thesis is sufficiently general to permit virtually any type of instance representation-all that is required is that a set of basic transformations that operate on the instances be defined. To illustrate the framework’s applicability to different instance representations, several example similarity functions are developed. The third, and perhaps most important, issue concerns the ability to incorporate domain knowledge within similarity functions. Domain information plays an important part in choosing an instance representation. However, even given an adequate instance representation, domain information is often lost. For example, numeric features that are modulo (such as the time of day) can be perfectly represented as a numeric attribute, but simple linear similarity functions ignore the modulo nature of the attribute. Similarly, symbolic attributes may have inter-symbol relationships that should be captured in the similarity function. The design framework proposed by this thesis allows domain information to be captured in the similarity function, both in the transformation model and in the probability assigned to basic transformations. Empirical results indicate that such domain information improves classifier performance, particularly when training data is limited.
The University of Waikato
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.
- Higher Degree Theses