A Two-Stage ARIMA Model via Machine Learning and its Application in Stock Price Prediction

. Stock price prediction has always been a hot issue in the financial sector and quantitative investment. Since stock price time series data tends to have linear and nonlinear features, traditional ARIMA models exhibit certain limitations in modeling such data. Based on this, this paper innovatively uses intraday transaction data of the stock market as auxiliary information, and proposes an improved ARIMA stock price prediction model based on machine learning methods. The specific principle is to use the ARIMA model to predict the linear information of the data, and machine learning-related algorithms (RF, XGBoost, LSTM) are used to predict the nonlinear residual information. The empirical results show that compared with the traditional ARIMA model, the model can effectively improve the prediction accuracy and is robust in stock price prediction. Finally, because this framework is very flexible in content, it can be equipped with machine learning methods with the best prediction accuracy for different practical application scenarios. In addition, we can use the model averaging method in the two-stage framework to improve the accuracy, and the mixed or high-frequency data can be further mined.


Introduction
As a barometer of the economy, the stock market functions value discovery and resource allocation optimization. In addition, predicting stock prices is of great significance for national macro-control and risk avoidance for investors. However, accurately predicting stock prices remains challenging due to the noisy, nonlinear and non-stationary nature of stock price data.
Based on the research results of scholars in related fields, it can be found that the main methods used for stock price prediction include time series analysis methods and machine learning methods.
The methods of time series analysis mainly include the differential autoregressive moving average model (ARIMA model), and the autoregressive condition heteroscedasticity model (GARCH model). Yuxia Wu and Xin Wen [1] established the ARIMA model to predict the movement law and trend of the closing price of Huatai securities in the 250th issue. Yue Chang and Yuxu Feng [2] established an ARIMA-GARCH fitted model for stock price prediction for the CSI 300-day yield. Franses P H et al. [3] improved the model's predictive performance by correcting the residual outliers of the GARCH model. However, the above models assume that stock price data obey a specific linear relationship, but stock price data tends to be non-stationary and has a potential nonlinear relationship.
Major machine learning methods include artificial neural networks (ANNs), support vector machines (SVMs), and long-term, short-term memory neural networks (LSTM). Lei L et al. [4] used wavelet neural networks (WNNs) to predict the stock price data based on the feature dimensionality reduction of rough sets. Xinbin Yang and Xiaojuan Huang [5] used SVM nonlinear extension samples to rank time series models for stock price prediction. Kai Chen et al. [6] used LSTM models to model and forecast Chinese equity returns. However, machine learning methods rely heavily on selecting feature engineering for stock price prediction, and well-constructed feature engineering depends on personal experience. Different feature engineering has a greater impact on the model's prediction accuracy, which predicts stock prices limited by machine learning methods.
It is worth noting that the research mentioned above on stock price prediction methods is based on the data itself to fit and forecast. However, due to the non-linearity and non-stationarity of the stock price data, the residuals often contain more valid information than the original data itself. The extraction of valid information from residuals is also essential to improve stock price predictions' accuracy. In order to more effectively extract the useful information in the prediction error and eliminate the influence of systematic error on the prediction accuracy, this paper proposes a method to divide the stock price time series data into linear main part and nonlinear residual part, and predict them separately. First, we use the ARIMA model and rolling window method to predict the stock price data, so as to obtain the predicted value of the linear main part, and extract the residuals to form a new series. Subsequently, the extracted nonlinear residuals are modeled and predicted by common machine learning algorithms such as random forest, XGBoost, LSTM, etc. Finally, the prediction sequence of the nonlinear residual part is added to the prediction sequence of the linear main part of the ARIMA stock price to form the final stock price forecast value.
The main contributions of this paper are: 1) This paper proposes a method for dividing the data series into linear subject parts and nonlinear residual parts, and modeling and predicting the above two parts, respectively. 2) Combine traditional time series prediction methods with machine learning algorithms to extract as much valid information as possible in the residual sequence, which can improve the model's predictive performance. 3) The model proposed in this paper is developed in a common framework. Its applicability is very flexible, which can be equipped with machine learning algorithms with the best prediction accuracy for the data series of different application scenarios.

ARIMA Model
The full name of the ARIMA model is the differential autoregressive moving average model, which was jointly proposed by Box and Jenkins [7]. It is a linear model for analyzing and studying time series problems. The model can effectively measure the linearity of time series data, and has good performance in short-term prediction. In the ARIMA model, three parameters need to be set, namely the autoregressive order p , the difference order d and the moving average order q . The mathematical expressions of ( , , ) ARIMA p d q are as follows. (1 ) where, t y is the sequence data of phase t after differential transformation; denote autoregressive parameters and moving average parameters respectively; L is the lag operator; (1 )

Machine Learning related algorithms
The random forest (RF) model was proposed by Breiman [8]. As one of the bagging methods of ensemble learning, RF is extracted from the original training sample set by the bootstrap resampling technique to generate a training sample subset, and then multiple training samples are generated from the training sample set. The decision tree is used to form a random forest, and its classification or regression results are determined by the voting score of the decision tree.
XGBoost is a gradient boosting method whose objective function consists of training loss function and regularization. Through the second-order Taylor approximation of the objective function, the greedy algorithm is used to search for the segmentation point with the highest score, and the next step is to segment and expand the leaf nodes. This has the advantage of ensuring that the tree structure will not be too complicated and over-fitted in the process of minimizing the loss function, on the other hand, improving the computational efficiency [9][10].
Recurrent Neural Network (RNN) is mainly used to process time-series data, which is characterized by the fact that the output of neurons at a certain moment can be used as input to enter the neurons, so that the neural network has memory ability. Hochreiter et al. [11] proposed Long Short-Term Memory (LSTM) neural network to overcome the problem that RNN can not solve the dependence on long-term time series.

Two-stage prediction model
Suppose that the stock price data series t y consists of two parts, the linear body part t L , and the nonlinear residual part  The dataset for each stock is divided into two parts, daily data and intraday trading data. The daily data contains 1,285 records, and the time interval for obtaining intra-day transaction data is 30 minutes, so each daily data corresponds to 8 intra-day data, and there are 10,280 records of intra-day BCP Business & Management

FIBA 2022
Volume 26 (2022) transaction data. Both parts include six factors: the opening price, closing price, highest price, lowest price, volume, and amount [12].
The candlestick charts for the three stocks over the period studied are shown in Figure 2. The fundamentals of the opening prices of the three sample stocks are analyzed, and the results are shown in Table 1.

2)Standardization
In order to improve the accuracy of prediction, we normalize the residuals input to the two-stage prediction model, that is, let the variance of this dataset become 1, and the mean becomes 0. We adopt a z-score normalization strategy. 3)Rolling Forecast prediction of stock data is essentially time series prediction, which may not perform well over the long term. Therefore, the method of rolling prediction is adopted, that is, the data in a time window of a certain length is used to predict the data of the next time point. When using the ARIMA model to predict daily data, 100, 150, and 250 periods of data are usually taken as rolling time windows. When using machine learning algorithms to fit residuals, the data set is generally divided into a training set and test set according to the ratio of 7:3, and uses the length of the training set as a rolling time window.

1)Mean Squared Error
Mean squared error measures the difference between the estimator and the estimator as the sum of squared errors.

2)Mean Relative Error
The mean relative error measures the difference in the residual value of the predicted value relative to the original data.

3)Posterior Difference Test
The posterior difference test is based on the residuals of each period, and examines the probability of the occurrence of points with small residuals. The posterior error C and the small error probability P are calculated as follows.
Where, 2 S is the standard deviation of the residual sequence, and 1 S is the standard deviation of the original sequence [13]. In the subsequent analysis, we only use the posterior error C as the result of the posterior difference test.

Experimental Results
In the first stage, we take the 100-period data as the rolling prediction time window of ARIMA. In stage two, three machine learning methods, RF, XGBoost, and LSTM, are used to fit the residuals of the first stage, and the rolling method is used to forecast the residual of the opening price of the next trading day. The residual forecast value is added to the first-stage forecast result. Compared with the original forecast result of the first stage, three indicators of MSE, MRE and C are obtained to measure the error. This process is programmed in Python, and the forecast results of the three stocks are shown in Tables 2-4. The residuals of the final predicted values of three kinds of stocks are shown in Figure 3. Analysis of the results shows that no matter which model is used to fit the residual error in the second stage, its error is smaller than that of ARIMA prediction of the first stage. This is also determined by the nature of the model itself. From the fitting results, RF and XGBoost contribute significantly to reducing the total error of the model. It can be seen that using intra-day stock data as supplementary information to forecast the opening price of the next trading day can improve the accuracy of the forecast and make up for the limitations of the traditional ARIMA model.

Robustness Check
Change the number of rolling forecast periods and set the rolling time windows to 150 periods and 250 periods, respectively. For these three stocks, the revised forecast data is shown in Table 5-7. Re-fitting under the condition of changing the number of periods, the fitting effect of the three machine learning models is still the same as that of the 100-period forecast, and the errors all have a clear downward trend. Therefore, it is considered that the model has good robustness and high stability, which can not only effectively realize the forecast of stock opening prices under rolling time windows of various lengths, but also contribute to the further promotion and generalization of the model.

Conclusion
This paper proposes an improved ARIMA stock price prediction model based on machine learning methods. ARIMA models and machine learning-related algorithms (RF, XGboost, LSTM) can extract the linear body and nonlinear residuals of a stock price data series, respectively. Experimental results show that the improved ARIMA stock price prediction model based on machine learning methods constructed in this paper has higher prediction accuracy than the traditional ARIMA model. Robustness testing by using multiple stock data and changing the rolling time window of the same stock data shows that the model proposed in this paper is robust. Since many time series sequences in real life exhibit instability and nonlinearity, the model framework presented in this paper has a certain application value for studying this type of data.
On the basis of this paper, we can further study the selection of a prediction model for nonlinear residual. Different models fit different data series with different information, so they often obtain different forecasts. So which model is the best? Further discussion can be carried out based on model selection and model averaging theory. Scholars have proposed various methods and criteria for model selection, such as AIC, BIC, cross-validation, Lasso, etc. However, model selection methods often lead to uncertainty in the model selection process, which in turn underestimates the actual variance. , which in most cases can circumvent the drawbacks of the model selection method. For example, the asymptotic optimal model averaging method can directly aim to reduce the estimated or predicted risk, making the goal more explicit [14]. In addition, with the development of computer technology and the advent of the era of big data, the price information of stocks can be recorded at a higher frequency. High-frequency intraday trading price data contains much information, so the research and analysis of high-frequency data also have certain application value. At present, high-frequency data can be studied using the data smoothing method or functional data analysis method.