Optimally adjusted last cluster for prediction based on balancing the bias and variance by bootstrapping

Autoři: Jeongwoo Kim aff001
Působiště autorů: Korea Maritime Institute, Busan, Republic of Korea aff001;  Biomedical Research Center, Asan Institute for Life Sciences, Seoul, Republic of Korea aff002
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0223529


Estimating a predictive model from a dataset is best initiated with an unbiased estimator. However, since the unbiased estimator is unknown in general, the problem of the bias-variance tradeoff is raised. Aside from searching for an unbiased estimator, the convenient approach to the problem of the bias-variance tradeoff may be to use the clustering method. Within a cluster whose size is smaller than the whole sample, we would expect the simple form of the estimator for prediction to avoid the overfitting problem. In this paper, we propose a new method to find the optimal cluster for prediction. Based on the previous literature, this cluster is considered to exist somewhere between the whole dataset and the typical cluster determined by partitioning data. To obtain a reliable cluster size, we use the bootstrap method in this paper. Additionally, through experiments with simulated and real-world data, we show that the prediction error can be reduced by applying this new method. We believe that our proposed method will be useful in many applications using a clustering algorithm for a stable prediction performance.

Klíčová slova:

Algorithms – Approximation methods – Clustering algorithms – k means clustering – Simulation and modeling – Stock markets


1. Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995. Available from: http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf.

2. Bennett PN. Neighborhood-based local sensitivity: In European Conference on Machine Learning; 2007 Sep: Springer, Berlin, Heidelberg; 2007. 30–41.

3. Bubeck S, Von Luxburg U. Overfitting of clustering and how to avoid it. Preprint. 2007. 1–39. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=

4. Ernst J, Nau G. J, Bar-Joseph Z. Clustering short time series gene expression data. Bioinformatics. 2005;21(suppl_1):i159–i168.

5. Thrun M. C. Projection-based clustering through self-organization and swarm intelligence: combining cluster anaysis with the visualization of high-dimensional data: Springer; 2018. pp. 151–152.

6. Pruessner G. Self-organised criticality: theory, models and characterisation. Cambridge University Press; 2012. p. 218.

7. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction: Springer New York; 2013. p. 38, 199.

8. Oh KJ, Kim K-j. Analyzing stock market tick data using piecewise nonlinear model. Expert Syst Appl. 2002;22(3):249–255.

9. Bishop C, Bishop CM. Neural networks for pattern recognition: Oxford university press; 1995. p. 333.

10. Elliott G. Forecasting when there is a single break. Manuscript, University of California at San Diego. 2005.Available from: http://www.uh.edu/~cmurray/TCE/papers/Elliott.pdf.

11. Pesaran MH, Timmermann A. Selection of estimation window in the presence of breaks. J Econometrics. 2007;137(1):134–161.

12. Clark TE, McCracken MW. Improving forecast accuracy by combining recursive and rolling forecasts. Federal Reserve Bank of Kansas City, 2004. Available from:https://files.stlouisfed.org/files/htdocs/wp/2008/2008-028.pdf.

13. Horowitz JL. The bootstrap. Handbook of econometrics. 5: Elsevier; 2001. p. 3163.

14. Breiman L. Bias, variance, and arcing classifiers. 1996. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=

15. Smith RS, Windeatt T, editors. The bias variance trade-off in bootstrapped error correcting output code ensembles: International Workshop on Multiple Classifier Systems; 2009: Springer, Berlin, Heidelberg; 2009. 1–10.

16. Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach Learn. 1998;36(1–2):105–139.

17. Fumera G, Roli F, Serrau A, editors. Dynamics of variance reduction in bagging and other techniques based on randomization: International Workshop on Multiple Classifier Systems; 2005: Springer, Berlin, Heidelberg; 2005. 316–325.

18. Hassanien AE, Oliva DA. Advances in Soft Computing and Machine Learning in Image Processing: Springer International Publishing; 2017. p. 121.

19. Kim JW, Kim JC, Kim JH. Adjusted k-nearest neighbor algorithm. J. Korean Soc. of Marine Engineering. 2018;42(2):127–135.

20. Diebold FX, Chen C. Testing structural stability with endogenous breakpoint a size comparison of analytic and bootstrap procedures. J Econometrics. 1996;70(1):221–241.

21. Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biom J. 2010;52(1):70–84. doi: 10.1002/bimj.200900028 19937997

22. Greene WH. Econometric Analysis: Pearson Education, India; 2003. p. 193.

23. Kim JW, Kim JC, Kim JT, Kim JH. Forecasting greenhouse gas emissions in the Korean shipping industry using the least squares adjusted with pseudo data. J. Korean Soc. of Marine Engineering. 2017;41(5):452–460.

24. Schuster EF. Incorporating support constraints into nonparametric estimators of densities. Commun Stat Theory Methods. 1985;14(5):1123–1136.

25. Cowling A, Hall P. On pseudodata methods for removing boundary effects in kernel density estimation. J R Stat Soc Series B Stat Methodol. 1996:551–563.

26. Breiman L. Using convex pseudo-data to increase prediction accuracy. breast (Wis). 1998;699(9):2.

27. Zou H. The adaptive lasso and its oracle properties. J Amer Statistical Assoc. 2006;101(476):1418–1429.

28. Pham DT, Dimov SS, Nguyen CD. Selection of K in K-means clustering. Proc. Inst. Mech. Eng. Pt. C J. Mechan. Eng. Sci. 2005;219(1):103–119.

29. Hansen BE. Least-squares forecast averaging. J Econometrics. 2008;146(2):342–350.

30. Diebold FX, Mariano RS. Comparing predictive accuracy. J Bus Econ Stat. 2002;20(1):134–144.

31. Tseng GC. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 2007;23(17):2247–2255. doi: 10.1093/bioinformatics/btm320 17597097

32. Ma CK, Kao GW. On exchange rate changes and stock price reactions. J Bus Finan Account. 1990;17(3):441–449.

33. Kwon CS, Shin TS. Cointegration and causality between macroeconomic variables and stock market returns. Global Finance J. 1999;10(1):71–81.

34. Chen N-F, Roll R, Ross SA. Economic forces and the stock market. J Bus. 1986:383–403.

35. Ou P, Wang H. Prediction of stock market index movement by ten data mining techniques. Mod Appl Sci. 2009;3(12):28.

36. Fernández-Rodríguez F, Sosvilla-Rivero S, Andrada-Félix J. Nearest-neighbour predictions in foreign exchange markets. Computational Intelligence in Economics and Finance: Springer, Berlin, Heidelberg; 2004. pp. 297–325.

37. Barkoulas J, Baum CF, Chakraborty A. Nearest-neighbor forecasts of US interest rates. International Journal of Banking and Finance. 2003;1(1):119–135.

38. Cont R, Nitions D. Statistical properties of financial time series. 1999. Available from: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FF27A44E7634D28DD37321E26DD7EAF3?doi=

39. He H, Chen J, Jin H, Chen S-H, editors. Stock Trend Analysis and Trading Strategy. JCIS; 2006. Available from: http://users.cecs.anu.edu.au/~hdjin/publications/2006/CIEF-165.pdf.

40. Pavlidis NG, Plagianakos VP, Tasoulis DK, Vrahatis MN. Financial forecasting through unsupervised clustering and neural networks. Operational Research. 2006;6(2):103–127.

41. Cai F, Le-Khac N-A, Kechadi T. Clustering approaches for financial data analysis: a survey. arXiv preprint arXiv:160908520. 2016. Available from: https://arxiv.org/pdf/1609.08520.

42. Bhar R, Hamori S. Empirical Techniques in Finance: Springer Berlin Heidelberg; 2006. p. 46.

43. Rapach DE, Strauss JK. Forecasting real housing price growth in the eighth district states. Federal Reserve Bank of St Louis Regional Economic Development. 2007;3(2):33–42. Available from: http://research.stlouisfed.org/publications/red/2007/02/Rapach.pdf.

44. Gupta R, Kabundi A, Miller SM. Forecasting the US real house price index: Structural and non-structural models with and without fundamentals. Econ Modelling. 2011;28(4):2013–2021.

45. Piazzesi M, Schneider M. Inflation and the price of real assets: Federal Reserve Bank of Minneapolis, Research Department; 2009. Available from: https://pdfs.semanticscholar.org/8cc9/15dc446e7d4d07198d9d51b0072b75336134.pdf.

46. Lessard DR, Modigliani F. Inflation and the housing market: Problems and potential solutions. 1975. p. 15. Available from: https://www.bostonfed.org/-/media/Documents/conference/14/conf14c.pdf.

47. Kearl JR, Mishkin FS. Illiquidity, the demand for residential housing, and monetary policy. J Finance. 1977;32(5):1571–1586.

48. Hendershott PH, Bosworth BP, Jaffee DM. Real user costs and the demand for single-family housing. Brookings Pap Econ Act. 1980;1980(2):401–452.

Článek vyšel v časopise


2019 Číslo 11