Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data

Peter Gachoki; Moses Muraya; Gladys Njoroge

American Journal of Applied Mathematics and Statistics. 2022, 10(2), 44-51
DOI: 10.12691/AJAMS-10-2-2

Original Research

Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data

Peter Gachoki^1,, Moses Muraya² and Gladys Njoroge¹

¹Department of Physical Sciences, Chuka University, P.O Box 109-60400, Chuka, Kenya

²Department of Plant Sciences, Chuka University, P.O Box 109-60400, Chuka, Kenya

Pub. Date: June 06, 2022

Full Text PDF

Cite this paper

Peter Gachoki, Moses Muraya and Gladys Njoroge. Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data. American Journal of Applied Mathematics and Statistics. 2022; 10(2):44-51. doi: 10.12691/AJAMS-10-2-2

Abstract

Phenotyping has advanced with the application of high throughput phenotyping techniques such automated imaging. This has led to derivation of large quantities of high dimensional phenotypic data that could not have been achieved using manual phenotyping in a single run. Hence, the need for parallel development of statistical techniques that can appropriately handle such large and/or high dimensional data set. Moreover, there is need to come up with a statistical criteria for selecting the best image derived phenotypic features that can be used as best predictors in modelling plant growth. Information on such criteria is limited. The objective of this study is to apply feature importance, feature selection with Shapley values and LASSO regression techniques to find the subset of features with the highest predictive power for subsequent use in modelling maize plant growth using high-dimensional image derived phenotypic data. The study compared the statistical power of these features extraction methods by fitting an XGBoost model using the best features from each selection method. The image derived phenomic data was obtained from Leibniz Institute of Plant Genetics and Crop Plant Research, -Gatersleben, Germany. Data analysis was performed using R-statistical software. The data was subjected to data imputation using k Nearest Neighbours technique. Features extraction was performed using feature importance, Shapley values and LASSO regression. The Shapley values extracted 25 phenotypic features, feature importance extracted 31 features and LASSO regression extracted 12 features. Of the three techniques, the feature importance criterion emerged the best feature selection technique, followed by Shapley values and LASSO regression, respectively. The study demonstrated the potential of using feature importance as a selection technique in reduction of input variables in of high dimensional growth data set.

Keywords

high throughput phenotyping, high dimensional data, feature extraction, feature importance, Shapley values, LASSO regression

Copyright

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

References

[1]	Guyon I. and Elisseeff A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research. 1157-1182.

[2]	Pieruschka R, Schurr U (2019) Plant phenotyping: past, present, and future. Plant Phenomics: 1-6.

[3]	Chen, D., Neumann, K., Friedel, S., Kilian, B., Chen, M., Altmann, T., & Klukas, C. (2014). Dissecting the phenotypic components of crop plant growth and drought responses based on high-throughput image analysis. The Plant Cell, 26(12), 4636-4655.

[4]	Klukas, C., Chen, D., & Pape, J. M. (2014). Integrated analysis platform: an open-source information system for high-throughput plant phenotyping. Plant physiology, 165(2), 506-518.

[5]	Boyd, D., & Crawford, K. (2011,). Six provocations for big data. In A decade in internet time: Symposium on the dynamics of the internet and society.

[6]	Blum, A., Hopcroft, J., & Kannan, R. (2020). Foundations of data science. Cambridge University Press.

[7]	Unnisabegum, A., Hussain, M., & Shaik, M. (2019). Data Mining Techniques for Big Data, Vol. 6, Special Issue.

[8]	Venkatesh, B., & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26.

[9]	Duchesnay, E., & Löfstedt, T. (2018). Statistics and Machine Learning in Python. Release 0.1.

[10]	Cohen, Shay & Ruppin, Eytan & Dror, Gideon. (2005). Feature Selection Based on the Shapley Value. 665-670.

[11]	Fryer, D., Strümke, I., & Nguyen, H. (2021). Shapley values for feature selection: The good, the bad, and the axioms. IEEE Access, 9, 144352-144360.

[12]	Helwig, N. E. (2017). Data, Covariance, and Correlation Matrix. University of Minnesota (Twin Cities).

[13]	Kim, Yongdai & Kim, Jinseog. (2004). Gradient LASSO for feature selection. Proceedings of the 21st International Conference on Machine Learning.

[14]	Muthukrishnan, R. & Rohini, R. (2016). LASSO: A feature selection technique in predictive modeling for machine learning. 18-20.

[15]	Ghojogh, B., Samad, M. N., Mashhadi, S. A., Kapoor, T., Ali, W., Karray, F., & Crowley, M. (2019). Feature selection and feature extraction in pattern analysis: A literature review. arXiv preprint arXiv:1905.02845.

[16]	Pan, R., Yang, T., Cao, J., Lu, K., & Zhang, Z. (2015). Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Applied Intelligence, 43(3), 614-632.

[17]	Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), 1-37.

[18]	Kaylan, P. (2021). 7 Popular Feature Selection Routines in Machine Learning.

[19]	https://www.analyticsvidhya.com/blog/2021/03/7-popular-feature-selection-routines-in-machine-learning/.

[20]	Saarela, Mirka & Jauhiainen, Susanne. (2021). Comparison of feature importance measures as explanations for classification models. SN Applied Sciences. 3.

[21]	Giersdorf, J., & Conzelmann, M. (2017). Analysis of feature-selection for LASSO regression models.

[22]	Chu, Carlin & Chan, David. (2020). Feature Selection Using Approximated High-Order Interaction Components of the Shapley Value for Boosted Tree Classifier. IEEE Access. PP. 1-1.

[23]	Xiaomao, X., Xudong, Z., & Yuanfang, W. (2019, August). A comparison of feature selection methodology for solving classification problems in finance. In Journal of Physics: Conference Series (Vol. 1284, No. 1, p. 012026). IOP Publishing.