A comparative research of portfolio return prediction based on the ARIMA and LSTM models

. The ongoing development of deep learning or machine learning techniques makes time series prediction more precise. This technology has also achieved remarkable success in predicting the return of portfolio in the financial field, which means that the quantitative investment method continues to progress and shows strong applicability, and investors can obtain excess benefits in the market. This study seeks to anticipate portfolio return using deep learning and machine learning and compare and analyze the differences between them. This paper selects the stock data of six representative companies in various fields and calculates the maximum sharp ratio portfolio and then forecasts the return of the portfolio with ARIMA model and LSTM model respectively. The result indicates that, firstly, ARIMA model and LSTM model can be well applied to predict the future return of portfolio; Secondly, ARIMA model performs better in short-term prediction and stable data prediction. When it comes to long-term prediction, the LSTM model performs better. The research results may be useful for investors to choose appropriate time series models for portfolio prediction and analysis.


Introduction
In recent decade, with the fast progress of financial market and the increasingly close relationship with the world economy, it has brought more and greater opportunities and challenges to investors. With the change of the external political and economic environment and the impact of the COVID-19, the investment environment is becoming more and more complex. Nowadays, in the influence of the breakthrough and development of related technologies in the computer domain, deep learning and machine learning models have also achieved great success in time series prediction [1]. This makes the quantitative investment method begin to rise and show strong applicability, which can obtain huge excess return in the market [2].
In fact, numerous investigations have been done regrading this interesting field. For example, Centeno and Jackson only researched the cumulative return, daily return and the maximum sharp ratio of the best venture capital portfolio based on ARIMA model [2][3]. Wang [4] and Sen, Dutta and Mehtab [5] only studied the advantages of LSTM model in deep learning of time series prediction in the financial field based on the data such as cumulative return and sharp rate. The above studies only involve one aspect of machine learning or deep learning. In addition, Ma, Han and Wang [6] and Chen, Zhang and Mehlawat [7] studied the mixed application of LSTM model and mean-variance model in predicting portfolio returns. And Senneset and Gultvedt [8] and Choi [9] has researched the advantages of ARIMA-LSTM hybrid model in predicting portfolio returns, but the study of them rarely involves the comparison of ARIMA and LSTM. In the research of Siami-Namini, Tavakoli and Namin, they evaluated and compared the ARIMA and LSTM models and came to the conclusion that the LSTM model is superior to the ARIMA model [10].
However, the aim of this paper is to study the comparison between the ARIMA and LSTM models in forecasting the maximum sharp ratio portfolio return, and under what conditions the two models have better forecast results. The ARIMA and LSTM models which are used in this work are used to build stock forecasting models respectively. Through data preprocessing, calculation and analysis, visualization and other steps, two time series forecast models are trained to predict future portfolio returns, and to carry out a comparison. As a result, the portfolio may be accurately predicted by the ARIMA and LSTM models. It is noteworthy that ARIMA model can predict the returns of stable portfolio data in a short time, while LSTM model can predict the multi-dimensional portfolio returns for a long time. The research of this paper has certain reference significance for investors to use deep learning and machine learning techniques to study the stock market and build a portfolio with good return performance.
This essay's remaining sections are divided into different sections. Section 2 is the main part of this paper including data, methods and results. Section 3 shows the conclusion.

Data
This paper selects data including five representative companies: Apple, Amazon, Microsoft, Netflix and Tesla from Yahoofinance (https://finance.yahoo.com). Apple, Microsoft and Amazon are all among the world's top five technology companies. They have outstanding positions in mobile computers, e-commerce and software services, and their market capitalization has reached $1 trillion. In addition, Netflix and Tesla are the top companies in the streaming media and new energy vehicle industries respectively. The selected data is between July 5, 2013, and July 5, 2022. Through the preliminary statistical analysis of the data in these 10 years, a total of 2265 data are collected, whose statistical data are showed in the table 1.
Through visualizing the data, TSLA has the highest mean value with 0.0021, while mean values of AMZN and MSFT are the lowest with 0.0011. Besides, TSLA also has the largest max value with 0.1990 and NFLX has the smallest min value with -0.3512.  Figure 1 shows the range of returns of each stock, and some outliers which will affect the analysis should be cleared. In addition, figure 2 shows that the cumulative returns of stocks is also different which can be observed on the y-axis.

Sharp ratio
If the Sharp ratio is less than one, the fund's operational risk exceeds its rate of return; if it is more than one, the fund's return exceeds its volatility risk. The Sharp ratio demonstrates that for every additional risk point that investor take on, they may receive an excess return. This allows for the calculation of the Sharp ratio, or the ratio of investment return to risk-taking, for each portfolio. The ratio which is higher represents that the portfolio will be better. The formula is shown as: Where E(R p ) is the anticipated annualized yield on a financial portfolio, R f indicates the annualized interest rate with no risk and denotes the annualized return standard deviation.

ARIMA
The article employs an autoregressive integrated moving average (ARIMA) model. "AR" means for "autoregressive," "p" stands for the number of autoregressive terms, "Ma" represents for moving average, "q" stands for the number of moving average terms, and I stands for the number of differences orders required to make the sequence stationary in the formula ARIMA (p, i, q).
ARIMA model requires the data to be stable. From a statistical point of view, data distribution is stable, which means it does not vary as time passes. Therefore, non-stationary data shows fluctuations due to trends, which must be converted before analysis.
In ARIMA model, 'AR' is an autoregressive model that firstly needs to determine a value of the order p, which indicates that the historical value across a number of time periods predicts the current value. The p-order autoregressive model's formula is as follows: Where is the error, p is the order, is the autocorrelation coefficient, μ is the constant and is the current value. The autoregressive model's buildup of error components is highlighted by the moving average model "Ma." The q-order autoregressive process's formula is as follows: In ARIMA model, 'I' is differential. For non-stationary series, the data may be made more stable by using the difference approach. The first-order difference technique is the most used approach. The formula for first order difference is:

LSTM model
LSTM algorithm is an important time series algorithm which is used most at present. Long-term dependencies can be learned via a specific type of RNN (recurrent neural network). Its major objective is to address the gradient vanishing and explosion problems that arise during extensive timeline training. Shortly said, LSTM outperforms regular RNN in lengthier sequences.
In addition, LSTM adds three neural network layers: "forgetting gate", "memory gate" and "output gate". The forgetting gate can decide which information to discard and output a value from 0 to 1. 1 means fully reserved, while 0 means fully discarded.
where is forgetting gate. Memory gates can determine what information to remember. The closer the output value is to 1, the more information will be remembered.
where is memory gate, c is memory cell and is cell state. The output result of the output gate combines the status information with the current memory information.
Where is output gate and ℎ is hidden state.

Portfolio optimization
Through the figure 3, Tesla has the highest variance among all stocks, followed by Netflix. The one with the least variance is MSFT. In addition, AAPL and AMZN had the lowest covariance, followed by that of AAPL and NFLX.
Where r p = E( 5 ), = E(X), I is the 5x1 unit column vector. The efficient frontier in the figure  4 can help analyze the maximum sharp ratio portfolio. The horizontal axis in the figure 3 represents the volatility of the portfolio, and the vertical axis represents the expected return of the portfolio. According to the legend on the right, the point which is greener means that the ratio of expected return to volatility that is sharp ratio is greater. Then, through the table 2, the maximum sharp ratio is 127.46%. And the weight are shown in table 3, where AAPL has the largest weight with 37.42%, while NFLX has the smallest weight with 1.78%.   The returns of the maximum sharp ratio portfolio are shown in a time series diagram in Figure 5. ARIMA model requires data to be stable so that it is necessary to use Dickey-Fuller test to check the data stability. Under a certain confidence level, for time series data, make a Null Hypothesis: time series data is unstable. Time series data will be stable if the Null Hypothesis can be rejected with a specific degree of confidence. The p value is significantly below 0.05, and the test statistic value is significantly below the critical values in the three scenarios of 1 percent, 5 percent, and 10 percent, according to data in table 4. Therefore, the data is stable, and the difference is not necessary. After that, by viewing figures 6 and 7, it is possible to identify the order of the model, which is represented by the values of p and q in the ARIMA (p, I q), so that p equals 3 and q equals 5. In addition, the value of i is 0, because there is not difference.  After predicting the return rate of the portfolio using the correspondingly trained ARIMA model and LSTM model, through figure 8, the result is that ARIMA model performed better in the prediction of the first 10 days, but the performance has deteriorated in the prediction of longer time, which means that the predicted value begins to tend to the average. Besides, according to the figure 9, the LSTM model can predict the complete 30 days and has a better performance in long-term prediction. Fig. 8 The predict of ARIMA model in 30 days

Conclusion
At present, most relevant studies only research the application of ARIMA and LSTM models in the financial field separately. The purpose of this paper is to analyze the difference between ARIMA and LSTM models in predicting the return of portfolio in order to help investors to choose more appropriate time series prediction models to obtain greater returns. This paper selects the maximum sharp ratio portfolio formed by the stock data of several companies from 2013 to 2022 and predicts the future return. It is found that ARIMA model performs better in short-term prediction and stable data prediction. In contrast, LSTM model has higher accuracy in long-term prediction. However, whether these models are universal needs further study, and the data source can be extended.