Carmona-Cejudo, J. M., Baena-García, M., Campo-Ávila, J., Bifet, A., Gama, J., & Morales-Bueno, R. (2011). Lecture Notes in Computer Science. (D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, & J. Hollmén, Eds.) (Vol. 7014, pp. 90-100).
Permanent Research Commons link: https://hdl.handle.net/10289/6957
Real-time email classiﬁcation is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classiﬁcation involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classiﬁcation, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.