Pfahringer, B., Leschi, C. & Reutemann, P.(2007). Scaling up semi-supervised learning: An efficient and effective LLGC variant. In Z.-H. Zhou, H. Li & Q. Yang(Eds.), Proceedings 11th Pacific-Asia Conference, PAKDD 2007, Nanjing, China, May 22-25, 2007.(pp. 236-247). Berlin: Springer.
Permanent Research Commons link: http://hdl.handle.net/10289/1433
Domains like text classification can easily supply large amounts of unlabeled data, but labeling itself is expensive. Semi- supervised learning tries to exploit this abundance of unlabeled training data to improve classification. Unfortunately most of the theoretically well-founded algorithms that have been described in recent years are cubic or worse in the total number of both labeled and unlabeled training examples. In this paper we apply modifications to the standard LLGC algorithm to improve efficiency to a point where we can handle datasets with hundreds of thousands of training data. The modifications are priming of the unlabeled data, and most importantly, sparsification of the similarity matrix. We report promising results on large text classification problems.