Scaling up semi-supervised learning: An efficient and effective LLGC variant

Abstract

Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.

Citation

Pfahringer, B., Leschi, C. & Reutemann, P.(2007). Scaling up semi-supervised learning: An efficient and effective LLGC variant. In Z.-H. Zhou, H. Li & Q. Yang(Eds.), Proceedings 11th Pacific-Asia Conference, PAKDD 2007, Nanjing, China, May 22-25, 2007.(pp. 236-247). Berlin: Springer.

Series name

Date

Publisher

Springer, Berlin

Degree

Type of thesis

Supervisor