International Science Index


FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Abstract:Instance selection (IS) technique is used to reduce the data size to improve the performance of data mining methods. Recently, to process very large data set, several proposed methods divide the training set into some disjoint subsets and apply IS algorithms independently to each subset. In this paper, we analyze the limitation of these methods and give our viewpoint about how to divide and conquer in IS procedure. Then, based on fast condensed nearest neighbor (FCNN) rule, we propose a large data sets instance selection method with MapReduce framework. Besides ensuring the prediction accuracy and reduction rate, it has two desirable properties: First, it reduces the work load in the aggregation node; Second and most important, it produces the same result with the sequential version, which other parallel methods cannot achieve. We evaluate the performance of FCNN-MR on one small data set and two large data sets. The experimental results show that it is effective and practical.
[1] S. Garcła, J. Luengo, and F. Herrera, “Data preprocessing in data mining,” Computer Science, vol. 72, 2015.
[2] B. J. Han, “Data mining. concepts and techniques. 3rd ed,” Data Mining Concepts Models Methods & Algorithms Second Edition, vol. 5, no. 4, pp. 1 – 18, 2000.
[3] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
[4] D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, 2000.
[5] M. Kudo and J. Sklansky, “Comparison of algorithms that select features for pattern classifiers,” Pattern Recognition, vol. 33, no. 1, pp. 25–41, 2000.
[6] H. Liu and H. Motoda, “Feature extraction construction and selection: A data mining perspective,” Springer International, vol. 94, no. 448, p. 014004, 1999.
[7] I. Triguero, J. Derrac, S. Garcia, and F. Herrera, “A taxonomy and experimental study on prototype generation for nearest neighbor classification,” Systems Man & Cybernetics Part C Applications & Reviews IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012.
[8] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, “Irahc: Instance reduction algorithm using hyperrectangle,” Pattern Recognition, vol. 48, no. 5, pp. 1878–1889, 2015.
[9] Haro-Garc, A. Aida, Garc, and N. A-Pedrajas, “A divide-and-conquer recursive approach for scaling up instance selection algorithms,” Data Mining and Knowledge Discovery, vol. 18, no. 3, pp. 392–418, 2009.
[10] I. Triguero, D. Peralta, J. Bacardit, S. Garcła, and F. Herrera, “Mrpr: A mapreduce solution for prototype reduction in big data classification,” Neurocomputing, vol. 150, no. 150, p. 331C345, 2015.
[11] J. Zhai, X. Wang, and X. Pang, “Voting-based instance selection from large data sets with mapreduce and random weight networks,” Information Sciences, vol. 367, pp. 1066–1077, 2016.
[12] H. Liu and H. Motoda, “On issues of instance selection.” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 115–130, 2002.
[13] J. R. Cano, F. Herrera, and M. Lozano, “Stratification for scaling up evolutionary prototype selection,” Pattern Recognition Letters, vol. 26, no. 7, pp. 953–963, 2005.
[14] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters.” in Conference on Symposium on Opearting Systems Design & Implementation, 2004, pp. 107–113.
[15] F. Angiulli, “Fast condensed nearest neighbor rule,” in International Conference, 2005, pp. 25–32.
[16] Angiulli, “Fast nearest neighbor condensation for large data sets classification,” IEEE Transactions on Knowledge & Data Engineering, vol. 19, no. 11, pp. 1450–1464, 2007.
[17] B. P. E. Hart, “The condensed nearest neighbor rule,” in IEEE Trans. Information Theory, 1968.
[18] C. H. Chou, B. H. Kuo, and F. Chang, “The generalized condensed nearest neighbor rule as a data reduction method,” vol. 2, pp. 556–559, 2006.
[19] G. W. Gates, “The reduced nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431 – 433, 1972.
[20] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems Man & Cybernetics, vol. 2, no. 3, pp. 408–421, 1972.
[21] W. C. Lin, C. F. Tsai, S. W. Ke, C. W. Hung, and W. Eberle, “Learning to detect representative data for large scale instance selection,” Journal of Systems & Software, vol. 106, no. C, pp. 1–8, 2015.
[22] A. Onan, “A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer,” Expert Systems with Applications, vol. 42, no. 20, pp. 6844–6852, 2015.
[23] J. A. Olvera-Lpez, J. A. Carrasco-Ochoa, and J. F. Martłnez-Trinidad, “A new fast prototype selection method based on clustering,” Pattern Analysis and Applications, vol. 13, no. 2, pp. 131–141, 2010.
[24] J. R. Cano, F. Herrera, and M. Lozano, “Using evolutionary algorithms as instance selection for data reduction in kdd: an experimental study,” IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561–575, 2004.
[25] D. Borthakur, “The hadoop distributed file system: Architecture and design,” Hadoop Project Website, vol. 11, no. 11, pp. 1 – 10, 2007.
[26] D. B. Skalak, “Prototype and feature selection by sampling and random mutation hill climbing algorithms,” Machine Learning Proceedings, pp. 293–301, 1994.
[27] V. S. Devi and M. N. Murty, “An incremental prototype set building technique,” Pattern Recognition, vol. 35, no. 2, pp. 505–513, 2002.