US Stocks Market Movements Prediction: Classification of SP- 500 Using Machine Learning Technology

In the field of quantified investment, risk quantification and maximum expected return are the problems focused on by the investors. Besides, a powerful toolkit for predicting the stock price movement is also very important for investors. In this paper, five stocks that are components of the SP-500 Index are selected, and the Mean-Variance method is used to optimize the portfolio of the above stocks. Moreover, five machine learning methods are compared to evaluate the performance in the application of stock price movement prediction. The results show that the combination of “AMZN”, “MSFT” and “AAPL” can achieve a good expected return within a low risk. In addition, the Artificial Neural Network method has the highest accuracy in predicting the multiclass stock price movement. Our research has a reference significance for the investors in the application of risk quantification and stock price prediction.


Introduction
Trend prediction for the stock is very useful in the field of intelligent measured investment [1]. However, the risk existing in the market would influence the judgement of people [2]. The classical framework of modern portfolio theory assumes that the investor only cares about the first two moments of the return distribution: mean and variance. The variance serves as a measure of risk, and the risk-adjusted portfolio performance is measured by the Sharpe ratio, which the investor wants to maximize. However, a rational investor would not consider all variability as risk, but the only variability below a certain benchmark that depends on his preferences [3]. After minimizing the risk, the return may be maximized, if we can find a suitable model to predict the trend of investment combination.
Many researchers proposed some methods to measure the risk, for example, Sharpe [4,5] defined the Sharpe ratio (SR) as a ratio of the expected portfolio return to the standard deviation. Please note that the optimal portfolio in the sense of maximizing the SR belongs to the efficient frontier in the case without a risk-free asset, which can be obtained as a solution to Markowitz's optimization problem. Krokhmal et al [6] proposed the downside risk measures which depend only on the positive values of the loss function or negative values of the return. The risk theory has been pushed that the quantile-based measures are well-suited functions to quantify risk [7][8][9].
The nature of the stock market is dynamic and unpredictable, it is a very challenging task to predict the actual stock prices. And the previous research results about predicting the binary future stock movements by machine learning techniques did not have a very persuasive accuracy score. Hence, this paper considers the situation when the stock prices are relatively stable and transforms the binary classification into the ternary classification.
In this paper, two steps are proposed to obtain the best combination of investment. The first step is Mean-Variance-based risk optimization; the second step is the machine learning-based trend prediction. Based on our work, it is possible to transfer positions convertible based on our prediction, which is useful to reduce our investment risk. Finally, a comparison of machine learning-based classification for predicting the stock market movements is given to analyze the performance of different methods in applying stock price prediction. In this paper, several machine learning models are implemented, including Logistics Regression (LR), Random Forest (RF), Gradient Boosting Machine (GBM), Support Vector Machine (SVM), and Artificial Neural Network (ANN). And the model performances are tested using the accuracy score.
The rest of this paper includes that Section 2 introduces the methods; Section 3 introduces the results and discussion; Section 4 introduces the conclusion.

Data Preparation
In this paper, the data sets of the target stocks and indexes are derived from the Yahoo Finance website [16]. To be more specific, six features (high price, low price, open price, close price, adjusted close price and volume) are extracted directly. In the trend prediction step, these features are translated into fifteen indicators as the inputs of our machine learning models, including "Simple Moving Average", "Weighted Moving Average", "Exponential Moving Average", "Momentum", "Williams %R", "Accumulation/Distribution Oscillator", "Moving Average Convergence Divergence", "Stochastic Oscillator", "Relative Strength Index", "Commodity Channel Index", "Accumulation/Distribution Index", "Negative Directional Indicator", "Positive Directional Indicator", "Aroon Down Indicator", "Aroon Up Indicator".
The formulas of these indicators are given in Table 1.
Weighted n (5 here)-day Moving Average (WMA) − MACD(n) −1 ) Stochastic Oscillator (SO) Aroon Down Indicator (ADC) 5 − 5 5 × 100 Aroon Up Indicator (AUC) 5 − 5 ℎ 5 × 100 In the above table, C t means the closing price at time t, H t represents the high price and L t is the low price. α means . HH t and LL t are highest price and the lowest price in the last t days, respectively.
. UP t and DW t mean upward and downward price change, respectively.
Different from the ten prominent indicators selected by [14], indicators calculated using volume are considered in this paper. Moreover, the previous movement of stocks is another important feature. All the indicators are treated as continuous variables. The movement of the stock is turned into categorical variables. To be more specific, if the next-day close price increases more than 0.5%, the representation is "+1"; if the next-day close price decreases more than 0.5%, the representation is "-1"; and in the other cases, the representation is "0". All the rows with null values are deleted. Then the data is split into training and testing set using the "train_test_split" function in the package of toolkit sklearn, where 20% of the data is used as a testing set.

Mean Risk Optimization
In this paper, the best return should be within the limitation of risk criteria. Therefore, the risk quantification is of great importance. For example, the benchmark below which volatility is considered to be downside volatility depends on the investor's preferences, and it does not necessarily coincide with the mean portfolio return. If returns are perfectly symmetrically distributed, the investor requires to target the volatility below the benchmark directly. Finally, the measurement by the downside deviation can be calculated by the formula (1).
T means the whole-time series, and r represents the returns; the investor proposes a benchmark B set.

Machine Learning for Predicting the Movement
In this paper, five classical machine learning-based methods (LR, RF, GBM, SVM, ANN) are used to predict the movement of the stock price.

Logistical Regression
In this paper, the logistic regression is used to fit for those dependent features calculated by Sucsection "Data Preparation". And the stock price movement prediction could be transformed into a three-classes classification machine learning problem. The result can be categorized into "up (1)", "neutral (0)" and "down (-1)". The historical data is used to train the model, which could predict the point in the next timestamp. The probability of the point in the next timestamp can be obtained by formula (2).
where represents the stable variable; represents the i th feature. Natural logarithm operations are performed on the formula (1) obtained through the likelihood function of joint density function with n samples. Therefore the maximum likelihood probability can be calculated by Formula (3).
where the x ki represents the k th feature index variable of the i th time stamp, k = 0,1,2, …, m. Hyperparameter tuning is performed for Logistics Regression. Two penalizations are used, L1 and L2, and different regularization strength values are tried.

Random Forest
In this paper, random forest is used to aggregate a large number of decision trees to solve the inherent defects of a single model or a set of parameters of the model, learn from each other, and avoid limitations.
There are four steps for the random forest: Step 1 is to do random sampling and train the decision tree. For a sample with a sample size of N, N samples are put back and extracted for N times, and 1 sample is extracted for each time, finally forming N samples. The selected N samples are used to train a decision tree as the samples at the root node of the decision tree.
Step 2 is to select random attributes to make node splitting attributes. When each sample has M attributes, and each node of the decision tree needs to be split, m attributes are randomly selected from these M attributes to meet the condition m << M. Then some strategy (such as information gain) is used to select one attribute from the m attributes as the split attribute of the node.
Step 3 is to repeat step 2 until the tree can no longer be split.
Step 4 is to repeat steps 1 to 3 to create a forest of decision trees. Hyperparameter tuning is also performed for Random Forest Classifier. The different number of trees in the forest, the maximum depth of the tree, the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node, and the number of features to consider are tried.

Gradient Boosting Machine
In this paper, a gradient boosting machine is used to combine a few simple models iteratively, called "weak learners, " to get "strong learners" with a higher accuracy score.
The best result in multiple classifications can be expressed as formula (4).
Hyperparameter tuning is also performed for Gradient Boosting Classifier. The different number of boosting stages to perform, the maximum depth of the tree, the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node, and the number of features to consider are tried.

Support Vector Machine
In this paper, Support Vector Machine (SVM) is used to maximize the margin in the feature space calculated by Subsection "Data Preparation". Assume that input vectors x i ∈ R n , i = 1,2 … N and class labels y i ∈ {+1, −1}, i = 1,2 … N, and there is a resulting decision boundary defined by formula (5).
Then, the problem can be turned into a quadratic programming problem, which is shown by formula (6) to (8).
Subject to 0 ≤ ≤ (7) Use Gaussian kernel function as kernel function and classification decision function can be shown as formula (9).
Hyperparameter tuning is also performed for the Support Vector Classifier. Two kernel types, linear and RBF are used, and different strengths of the regularization are tried.

Artificial Neural Network
The artificial neural network is a typical machine learning algorithm designed to imitate human brain processes based on a collection of interconnected nodes or units called neurons. This paper uses a three-layer feed-forward neural network, which contains one input layer, one hidden layer, and one output layer. The input layer has 16 units or neurons which corresponds to 16 technical indicators. The output layer has 3 units or neurons which corresponds to 3 classes and it employed the 'softmax' activation function. Hyperparameter tuning for ANN is also used in this paper. The different number of hidden layer neurons or units, different values of epochs, different number of batch size, and three different activation function is tested. The set of parameters that gives the best test accuracy score is using 10 neurons in the hidden layer, 100 as epochs value, 20 as batch size, and is using the 'softplus' as activation function.

Results and Discussion
In this paper, the risk optimization step is processed to minimize the risk, which could give a satisfying investment strategy for getting the best return. The following shows the results of "risk optimization" and "prediction of stock movement".

Results of Risk Optimization
Five stocks (AAPL, AMZN, FB, GOOG, MSFT) which are components of the S&P 500 Index, are chosen to form an optimal portfolio. The stock price during ten years (from 2010-12-31 to 2021-1-1) is extracted from the Yahoo finance website [16] and the corresponding daily returns of these five stocks are calculated. The expected returns and covariance matrix of five stocks are estimated based on the historical data. In the experiment, the risk measure used to optimize the portfolio is the standard deviation, and the objective function is set to maximize the Sharpe ratio of the generated portfolio. The experiment is based on the toolkit Riskfolio-Lib. The optimal weights are shown in Table 1. In table 1, the results show that the weights of "AMZN", "MSFT" and "AAPL" are the biggest, however, the weights of "GOOG" and "FB" are the smallest. That means among five stocks Facebook and Google have relatively small weights compared to others. Besides, please note that the summary of "AMZN", "MSFT" and "AAPL" arrives above 0.9. In figure 1, the x-axis represents the quantified risk; the y-axis represents the expected return. The red star means the maximum risk-adjusted return shown in the legend. Figure 1 shows that the bestexpected return may stay at the 27.8% standard deviation ratio of the risk.
Obviously, the results show that our strategy, which selects the "AMZN", "MSFT" and "AAPL" can achieve a good expected return within a low risk. This may be because the three companies are the big stable corporations, which can resist the bad influences of the risk from the stock market.

Results of Movement Prediction
From the results of part A, four stocks that have relatively large weights in the optimal portfolio are selected to be the targets for movement prediction. Besides, the SP-500 Index is also considered in the experiment. The accuracy score is used to evaluate the performance of five machine learningbased methods. The results are shown in Table 3.  Table 3 shows the performance analysis of stock movement prediction based on machine learning. The accuracy is used to evaluate the five methods. The accuracy scores for four stocks are quite low from the results, with most of the scores below 0.500, which is useless in the prediction. This illustrates that the stocks selected has a random characteristic. However, the results for SP-500 are much higher than those for stocks, with the highest accuracy score of 0.640 reached by the ANN method. To sum up, ANN is the most suitable model in most cases for movement prediction.
According to the combination of risk optimization and movement prediction results, it is reasonable to obtain that the US stocks market has great volatility, which means that there is a high risk when we want to quantify the expected return from the US stocks. However, after our optimization, the combination that "AMZN", "MSFT" and "AAPL" can help to achieve a good expected return within a low risk. Moreover, the machine learning methods can aid in predicting the stock movement to some extent.

Conclusion
This paper processed two steps to quantify the expected return including risk control and stock price movement prediction. Firstly, portfolio optimization is proposed to quantify the risk and keep the risk within a low level. Secondly, five machine learning methods are compared based on US stocks movement prediction.
Our proposed strategy can effectively help the investor obtain their maximum expected return while keeping the risk at the limit of their accepted range. And the prediction results of the ANN method can be a meaningful reference for investors to change their positions. Moreover, the indexes are much suitable for movement prediction than the stocks. In conclusion, our research has a reference significance in the application of quantified investment.