Principle, Methodology and Application for Data Cleaning techniques
DOI:
https://doi.org/10.54691/bcpbm.v26i.2032Keywords:
Data cleaning; Thermal Interpolation; K nearest neighbor algorithm; Lagrange and Newton interpolation.Abstract
Con-temporarily, information data has become the cornerstone of every company’s decision-making. In a vast flow of information, choosing the right data is the first step in developing successful predictions. After the determinations of the requirements, analysis purpose and prediction direction, outlier processing, missing value processing and repeated value processing are usually encountered. This paper introduces the limitations, advantages and disadvantages of different methods in application in detail. At the same time, this paper introduces some interpolation methods based on mathematical statistics, such as thermal interpolation, Lagrange interpolation and Newton interpolation. At the same time, it also provides the normal distribution processing method which is better in dealing with outlier problems, and the popular K-nearest neighbor algorithm. Finally, it illustrates the logic diagram of data cleaning in the data preparation stage. Overall, these results offer a guideline for selecting the appropriate treatment in the corresponding situation during data cleaning process.
Downloads
References
Rahm, E., and Do, H. H, “Data cleaning: Problems and current approaches,” IEEE Data Eng. Bull., 23(4), 3 - 13, 2000.
Dasu, T., and Johnson, T., “Exploratory data mining and data cleaning,” John Wiley & Sons, 479 2003.
Rachael Tatman. Kaggle. Data Cleaning Challenge, 2009.
The diagram is collected from: What is machine learning for data cleaning. Available at: https://blog.csdn.net/HowieXue/article/details/104270918.
Ragel A, Cremilleux B. “Mvc-Areprocessing method to A study on the Influence of different parameters on the performance of the System,” Knowledge - Based System Journal, 12 (5/6): 158 - 163, 1999.
Shen J. J, Chang C. C., Li. Y. C, “Combined association rules for dealing with missing values,” Journal of Information Science, 33 (4): 246 - 254, 2007.
Xu Xiaoli, “Implementation of Lagrange interpolation algorithm in Engineering Application,” Forest Teaching, 01: 17 - 19, 2010.
Yu Bingjie, “Anomaly Detection Algorithm based on Gaussian Model,” China University of Mining and Technology, 2017.
Wang Qilong, “Research on image classification method based on Gaussian distribution modeling,” Dalian University of Technology, 2018.
Peng Jun, Mo Hongwei, and Yuan Guiguai, “Visualization of box chart of sample data using R software in statistical teaching in colleges and universities,” China new communications, 22 (21): 186 - 187, 2020.
Yang Bin, “Comparison of several methods of normality test,” Statistics and Decision, (14):72 - 74, 2015.
Pei Ya-chen, “Application research of k-nearest neighbor classification algorithm,” Communications world, 26 (01): 286 - 287, 2019.
Wang Qinghua, Liu Jiangwei, and Zhang Lanlan. Journal of Xi'an polytechnic university, 2015, 35(02):
Cao Hai, Sun Jing, and Shi Xibin. “Short Text Duplicate Removal Algorithm Based on Feature Iteration,” Computer Engineering, 41 (12): 54 - 57, 63, 2015.
Hao Shuang, LI Guo-liang, Feng Jian-hua and Wang Ning, “Overview of Structured Data Cleansing,” Journal of tsinghua university (science & technology), 58 (12): 1037 - 1050, 2018.
Krishnan, S., Franklin, M. J., Goldberg, K., Wang, J., and Wu, E, “Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 International Conference on Management of Data,” pp. 2117 - 2120, 2016.






