Modelling and Data Analysis
2024. Vol. 14, no. 4, 91–103
doi:10.17759/mda.2024140406
ISSN: 2219-3758 / 2311-9454 (online)
Optimization Problem of Constructing Linear Regressions with a Minimum Value of the Mean Absolute Error on Test Sets
Abstract
This article is devoted to the problem of selecting a given number of the most informative regressors in linear regressions. When using the ordinary least squares method, the exact solution to this problem by the criterion of maximizing the coefficient of determination when using the entire data set can be obtained as a result of solving a specially formulated mixed 0-1 integer linear programming problem. However, in machine learning, an important stage in creating a reliable and efficient model is its construction based on the training set and checking the accuracy of its prediction based on the test set. Therefore, in this article formulates an optimization problem for subset selection in linear regressions based on the criterion of minimizing the mean absolute error on the test set. The formulation is based on a well-known technique, according to which absolute errors should be presented as the difference between two non-negative variables. Computational experiments were carried out using the statistical data on athletes' salaries stored into the Gretl package and the LPSolve optimization problem solver. For this purpose, the training set was formed from 70%, 75%, and 80% of observations. In all these cases, the average decrease in the value of the coefficient of determination of the models was 24.76%, 18.4%, and 12.22%, but the mean absolute error decreased by 24.8%, 26.3%, and 21.05%, respectively. Experiments showed that the average time to solve problems when minimizing the mean absolute error on test sets was 2.33–2.85 times higher than the time to solve problems when maximizing the coefficient of determination on training sets.
General Information
Keywords: machine learning, regression analysis, ordinary least squares method, subset selection in regression, coefficient of determination, mean absolute error, training set, test set, mixed 0-1 integer linear programming problem
Journal rubric: Optimization Methods
Article type: scientific article
DOI: https://doi.org/10.17759/mda.2024140406
Received: 09.09.2024
For citation: Bazilevskiy M.P. Optimization Problem of Constructing Linear Regressions with a Minimum Value of the Mean Absolute Error on Test Sets. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2024. Vol. 14, no. 4, pp. 91–103. DOI: 10.17759/mda.2024140406. (In Russ., аbstr. in Engl.)
References
- Rashka S. Python i mashinnoe obuchenie [Python and Machine Learning]. Moscow, DMK Press, 2017. 418 p.
- Janiesch C., Zschech P., Heinrich K. Machine learning and deep learning, Electronic Markets, 2021, vol. 31, no. 3, pp. 685–695. DOI:10.1007/s12525-021-00475-2.
- Mhlanga D. Artificial intelligence and machine learning for energy consumption and production in emerging markets: a review, Energies, 2023, vol. 16, no. 2, pp. 745. DOI:10.3390/en16020745.
- Xu Z., Mohsin M., Ullah K., Ma X. Using econometric and machine learning models to forecast crude oil prices: Insights from economic history, Resources Policy, 2023, vol. 83, pp. 103614. DOI:10.1016/j.resourpol.2023.103614.
- Haug C.J., Drazen J.M. Artificial intelligence and machine learning in clinical medicine, New England Journal of Medicine, 2023, vol. 388, no. 13, pp. 1201–1208. DOI:10.1056/NEJMra2302038.
- Kumar S., Gopi T., Harikeerthana N., Gupta M.K., Gaur V., Krolczyk G.M., Wu C. Machine learning techniques in additive manufacturing: a state of the art review on design, processes and production control, Journal of Intelligent Manufacturing, 2023, vol. 34, no. 1, pp. 21–55. DOI: 10.1007/s10845-022-02029-5.
- Molnar C. Interpretable machine learning. Lulu. com, 2020.
- Nie B., Du Y., Du J., Rao Y., Zhang Y., Zheng X., Ye N., Jin H. A novel regression method: Partial least distance square regression methodology, Chemometrics and Intelligent Laboratory Systems, 2023, vol. 237, pp. 104827. DOI:10.1016/j.chemolab.2023.104827.
- Zhuravlev Yu.I., Sen'ko O.V., Dokukin A.A., Kiseleva N.N., Saenko I.A. Dvukhurovnevyy metod regressionnogo analiza, ispol'zuyushchiy ansambli derev'ev s optimal'noy divergentsiey [Two-level regression method using ensembles of trees with optimal divergence], Doklady Mathematics, 2021, vol. 499, pp. 63–66. DOI:10.31857/S2686954321040172.
- Bazilevskiy M.P. Dvukhkriterial'noe otsenivanie lineynykh regressionnykh modeley metodami naimen'shikh moduley i kvadratov [Two-criteria estimation of linear regression models using least absolute deviations and squares], International Journal of Open Information Technologies, 2024, vol. 12, no. 6, pp. 76–81.
- Bazilevskiy M.P. Otbor informativnykh operatsiy pri postroenii lineyno-neelementarnykh regressionnykh modeley [Selection of informative operations in the construction of linear non-elementary regression models], International Journal of Open Information Technologies, 2021, vol. 9, no. 5, pp. 30–35.
- Noskov S.I. Tekhnologiya modelirovaniya ob"ektov s nestabil'nym funktsionirovaniem i neopredelennost'yu v dannykh [Technology for modeling objects with unstable operation and uncertainty in data]. Irkutsk, RITs GP «Oblinformpechat'», 1996. 320 p.
- Miller A. Subset selection in regression. Chapman and hall/CRC, 2002.
- Ayvazyan S.A., Mkhitaryan V.S. Prikladnaya statistika i osnovy ekonometriki [Applied Statistics and Basics of Econometrics]. Moscow, YuNITI, 1998. 1005 p.
- Bazilevskiy M.P. Svedenie zadachi otbora informativnykh regressorov pri otsenivanii lineynoy regressionnoy modeli po metodu naimen'shikh kvadratov k zadache chastichno-bulevogo lineynogo programmirovaniya [Reduction the problem of selecting informative regressors when estimating a linear regression model by the method of least squares to the problem of partial-Boolean linear programming], Modeling, Optimization and Information Technology, 2018, vol. 6, no. 1 (20), pp. 108–117.
- Bazilevskiy M.P. Otbor optimal'nogo chisla informativnykh regressorov po skorrektirovannomu koeffitsientu determinatsii v regressionnykh modelyakh kak zadacha chastichno tselochislennogo lineynogo programmirovaniya [Selection an optimal number of variables in regression models using adjusted coefficient of determination as a mixed integer linear programming problem], Applied Mathematics and Control Sciences, 2020, no. 2, pp. 41–54.
- Bazilevskiy M.P. Sravnitel'nyy analiz effektivnosti metodov postroeniya vpolne interpretiruemykh lineynykh regressionnykh modeley [Comparative analysis of the effectiveness of methods for constructing quite interpretable linear regression models], Modelling and Data Analysis, 2023, vol. 13, no. 4, pp. 59–83.
- Shunina Yu.S. Vliyanie sposoba formirovaniya obuchayushchey i testovoy vyborok na kachestvo klassifikatsii, Bulletin of the Ulyanovsk State Technical University, 2015, no. 2 (70), pp. 43–46.
- Mun D.E., Savchenko D.Yu. Problemy podgotovki obuchayushchikh vyborok dlya postroeniya sistemy skoringa personala [Problems of preparation of training samples for building a personnel scoring system], Modern Problems of Economic Development of Enterprises, Industries, Complexes, Territories, 2020, pp. 390–394.
- Parasich V.A., Parasich I.V., Volovich G.I., Nekrasov S.G., Parasich A.V. Pereobuchenie v mashinnom obuchenii: problemy i resheniya [Overfitting in machine learning: problems and solutions], Bulletin of the South Ural State University. Ser. Computer Technologies, Automatic Control, Radio Electronics, 2024, vol. 24, no. 2, pp. 18–27. DOI:10.14529/ctcr240202.
- Ferster E., Rents B. Metody korrelyatsionnogo i regressionnogo analiza [Methods of correlation and regression analysis]. Moscow, Finance and Statistics, 1983. 303 p.
- Charnes A., Cooper W.W., Ferguson R.O. Optimal estimation of executive compensation by linear programming, Management science, 1955, vol. 1, no. 2, pp. 138–151. DOI: 10.1287/mnsc.1.2.138.
Information About the Authors
Metrics
Views
Total: 11
Previous month: 0
Current month: 11
Downloads
Total: 2
Previous month: 0
Current month: 2