A high frequency stock price prediction model based on Hermite basis expansion and LSTM neural network

. Based on Hermite basis function expansion and ensemble learning, an improved LSTM neural network method is proposed in this paper. The proposed method can be used for high frequency stock price prediction. Considering the characteristics of high frequency time series data such as high dimensionality, large noise and instability, etc. In this paper, the function information extracted from the Hermite basis function expansion is used to predict the residual sequence predicted by the LSTM neural network. Since the components of the function feature vector are unknown to the underlying model structure of the response variable, the proposed method is processed by Bagging framework. It not only captures the structure of the latent model, but also balances the variance and bias of the model. In addition, the number of prediction periods of the LSTM neural network is a hyperparameter, and the model averaging method based on distance covariance weighting is considered in this paper for optimization. The results of actual data analysis show that the proposed method can effectively optimize the prediction accuracy of the LSTM neural network, and has certain robustness. Finally, on the one hand, this optimization framework can be used to improve other time series prediction models. On the other hand, the proposed method can play an important role in forecasting problems such as daily average temperature prediction and real-time monitoring of atmospheric environmental quality.


Introduction
As we all know, the stock market is complicated and volatile, and the stock price is affected by many factors such as economic cycle, financial policy and international environment, so investors need to take huge risks while obtaining benefits. High-frequency stock prices belong to highfrequency time series data, which has the characteristics of high dimensionality, loud noise and instability, etc., while high-frequency financial data contains more information and has a significant effect on capturing micro changes of the stock market. Therefore, the use of high-frequency time series to predict the trend of stock prices can more accurately avoid risks and obtain more ideal returns.
The traditional time series prediction methods mainly solve the model parameters on the basis of determining the time series parameter model, and use the solved model to complete the prediction work. Li Zehui (2019) [1] selected China Merchants Bank as the research object, established ARIMA model, and studied the volatility of return rate based on this model. Yang Yun (2021) [2] et al. proposed an improved LOBNN&AR-GARCH model to achieve short-term prediction of stock prices. However, the above methods have dimension disaster, instability and other problems, and long-term prediction is needed. Therefore, ARIM, GARCH and other models have limitations, because they cannot capture nonlinear information and function information. Zhou Hua (2018) [3] et al. used the principal component analysis method to construct the absorption rate, which was used to measure the changes in the degree of risk association between various industries in China over time. Zhao Dandan (2019) [4] et al. used support vector machine (SVM) and kernel principal component analysis (PCA) to build a systemic risk early warning model for China's banking industry to predict the level of systemic risk in China's banking industry. This kind of dimensionality reduction and then prediction method does improve the accuracy, but the disadvantage is that it does not take into account the model structure between the dimensionality reduction variable and the response variable. Shen Zejun (2019) [5] et al. proposed a BP neural network algorithm based on granular computing thinking. Shi Jiannan (2020) [6] et al., proposed a stock price time series prediction method based on dynamic mode decomposition-Long short-term Memory Neural network (DMD-LSTM) to solve the problems of difficulty in effective feature extraction and low price prediction accuracy. LSTM and BP neural network prediction methods do not need stationarity assumption, nor dimensionality disaster, but without extracting function information, the noise of data will sometimes be enhanced incorrectly. In addition, it is difficult to choose the period number of predictor variables.
To sum up, this paper proposes an improved LSTM neural network prediction model based on Hermite basis function expansion and ensemble learning. The advantages of this method mainly include the following aspects: At the same time of reducing the dimension of time series, the function information and nonlinear information are extracted by using Hermite basis function expansion, which has the advantages of reducing the computational complexity and improving the estimation accuracy. The Bagging method is used to capture the potential model structure between the dimensionally reduced variables and the residual sequence, and to balance the bias and variance of the prediction model, so as to reduce the variance of the model, improve the generalization error and improve the accuracy. In addition, the forecasting methods in the Bagging framework are "model" free. To solve the problem of LSTM neural network parameter selection, a model averaging method based on distance covariance weighting is proposed in this paper. Experimental results show that the proposed method has a significant improvement on the original LSTM neural network.

Hermite basis function expansion
The basis expansion method refers to using the basis function expansion to approximate the real function, the specific form is as follows: Where a n is the expansion coefficient of the basis function, and the Hermite basis function ( ) is expressed as follows: Which is polynomial of degree n. H n (x) The NTH derivative of is the product of some polynomial of degree n with, which is called Hermite polynomial.

Bagging framework
Bagging is a method used to improve the accuracy of learning algorithms. The following is the flow of Bagging algorithm: The input is the sample set D={(x,y1),(x2,y2),...(xm,ym)}, the weak learner algorithm, the number of iterations T of the weak learner, and the output is the final strong learner f(x).Specifically, for t=1,2...T: firstly, the training set is randomly sampled for T times, and a total of m times are collected to obtain a sampling set Dt containing m samples. Then, the weak learner Gt(x) is trained with the sampling set Dt. Finally, the predicted value obtained by simple arithmetic average or other weighting methods is used as the final output. The specific process is shown in figure 1:

LSTM Neural networks for Hermite basis function expansion and ensemble learning
LSTM neural network is a short short-term memory network, which is an improved recurrent neural network. It can solve the problem that RNN cannot deal with the dependence of long distance. In this paper, we use the function information extracted based on the Hermite basis function expansion to predict the residual sequence obtained by the LSTM neural network, and use the Bagging framework to capture the latent model structure of the components of the function feature vector on the response variables, and balance the variance and bias of the model. In addition, we use the model averaging method based on distance covariance weighting to optimize the number of prediction periods of the LSTM neural network. The specific calculation process is shown as follows: Step1: LSTM neural network is used to directly predict the stock price, and the result is ̂+ 1 , ̂+ 2 ,……，̂,The residual between the predicted value and the real Y is generated, e; Step2: Perform Hermite basis function expansion on the data first to get A vector; Step3: Use Bagging framework for ensemble learning of E to predict the new residual sequence ̂ ; Step4: Pair the A vector with the ̂ obtained in Step2 to get the data pair(a，) − ; Step5: Use vector A to predict the residua , and add back the predicted value in SteP1 to get ̂; Step6: Use the model averaging method based on distance covariance weighting to optimize ̂ ,and get the final model.̂ As for the distance correlation coefficient, it is used to study the independence between two variables. Specifically, the Euclidean distance between sample points and Euclidean distance in space is denoted as , and the specific formula of distance covariance is as follows: When variables X and Y are completely independent, the distance covariance is the smallest and the result is 0, indicating that there is no duplicated information between the two variables. On the contrary, the greater the distance covariance is, the greater the interdependence between the two variables is, that is, the more duplicated information between the two variables is.

Data sources and pre-analysis
All data analyzed in this paper were obtained from WIND database (https://www.wind.com.cn/).We obtain three stock data by random selection to determine whether our improved methods are effective for random systems with different levels of complexity. The following are three randomly selected stocks: the stock code SH600777 corresponds to Shandong Xinchao Energy Co., LTD., whose business scope includes oil and gas exploration, exploitation, sales, etc.; The stock code SH600811 corresponds to Dongfang Group Co., LTD. The company mainly invests in and operates seven industries: modern agriculture and health food, oil and gas and new energy, information security, finance, resources and products, port transportation, and new urbanization development; The stock code SH603833 corresponds to Opai Home Furnishing Group Co., LTD., which is a comprehensive modern integrated home furnishing service provider in China. We use the standard LSTM neural network and the improved method to predict three stock prices respectively, so as to verify whether the proposed method effectively improves the LSTM prediction method. We take the first 10,000 time points of the three stocks as the research object, and the trend charts of the three stocks are shown in Figure 2:

Comparative Analysis
In this paper, 2, 3, 4, 5, and 6 were selected as parameters of LSTM, and different training sets were selected to test the robustness of the model. The results of mean square error (MSE), mean relative error (MRE) and posterior error (BE) of the experimental model and the traditional LSTM model are shown in Table 1-Table 3 and Figure 3- Figure 5. Specifically, the mean square error, absolute relative error and posterior error.

Fig. 5 Trend comparison of error values between LSTM and AFWLSTM of SH600811
According to the above trend diagram, we can clearly see that the standard LSTM neural network and the improved method have obvious differences in the results of predicting the three stock prices respectively, and the improved method has smaller error and a downward trend. On the one hand, it shows that the model has higher prediction accuracy, on the other hand, it also shows that the model has certain robustness. In summary, the experimental results prove that the proposed method has a significant improvement on the original LSTM neural network. The main reasons may be as follows: Firstly, while reducing the dimension of time series, the improved method uses Hermite basis function expansion to extract the function information and nonlinear information, which reduces the computational complexity and improves the estimation accuracy. Secondly, the model averaging method is used to balance the bias and variance of the prediction model, which reduces the variance of the model and improves the generalization error.

Conclusion and Discussion
In this paper, we propose an LSTM neural network method based on Hermite basis function expansion and ensemble learning improvement to predict high-frequency stock prices. First, we use the function information extracted from the Hermite basis function expansion to predict the residual sequence obtained by the LSTM neural network, and then use the Bagging framework to process it, so as to capture the underlying model structure and balance the variance and bias of the model. In addition, we consider the model averaging method based on distance covariance weighting to optimize the number of prediction periods of LSTM neural network. The above data analysis results show that the proposed method can effectively optimize the prediction accuracy of the LSTM neural network, and has certain robustness.
For future research directions, this paper firstly uses Hermite basis function expansion method to improve the LSTM neural network, but whether other different basis function expansions have other effects on the improved results of the LSTM neural network needs to be further verified. Secondly, in this paper, we use the model averaging method based on distance covariance weighting to optimize the prediction period number of LSTM neural network, so whether other weighting methods or model averaging methods have better optimization effect on the prediction period number of LSTM neural network needs further testing. Finally, for dense time data or sparse time series data, whether the proposed method is still applicable, and how to deal with time series data with different frequencies, the above issues need further study.