Portfolio return prediction model based on gold and Bitcoin

. Maximizing returns has always been people's investment goal. Gold and bitcoin are popular with investors because of their hedges and volatility. However, markets are risky and can be influenced by different economic, political and environmental factors. As a result, bitcoin and gold prices fluctuate wildly, leading to uncertain investment and uncertain returns. In order to maximize the profit, this paper completes the data processing and model construction to make decisions. Based on the Markov decision process of avoiding risk avoidance, reducing transaction cost and maintaining liquidity, and assuming that the stock market is not affected by enhanced trading agent, deep reinforcement learning (DRL) is used to simulate stock trading. The application of the model is helpful to forecast the return of investment portfolio and brings strong application value to the relevant practitioners.


Introduction
Market traders often buy and sell volatile assets with the goal of maximising their total returns. There is usually a commission on every sale. Two of those assets are gold and bitcoin.There is volatile property in the market, and its value constantly fluctuates over time.Traders can maximize total returns by constantly buying and selling volatile property to continuously increase the total value of the holding property.But different investment projects require a different commission for each sale. Investing is not blind, the key issue of trading decision is to execute the right decision [1] at the right time.When holding a variety of property such as gold, Bitcoin needs more strategy. Investors decide whether to buy the property or sell or continue to hold the property in the portfolio.We can provide us with the current transaction optimal strategy in the daily price changes of investment property known to date, finally achieving the purpose of maximizing the corresponding returns [2].
To explore the best investment trading strategy, our team has studied it with gold and Bitcoin.Bitcoin can be traded every day, gold can only be traded on trading days, and they have different trading commissions.Our team studied investment trading strategies by performing different portfolios for gold and bitcoin during the five-year trading period from November 9,2016 to October 9,2021.Develop a trading model based only on the daily price data as of the day, giving the best trading strategy of the day, including how to buy, sell and continue to hold property in the portfolio [3].

Data selection
The data provided consist of the closing price of a troy ounce gold and the price of a bitcoin of each day during the five-year period commencing 9/11/2016. The data is pre-processed in the following fashion [4]: The gold is not traded when the market is closed and as a result many dates are absent from the data. Additionally, some dates included in the data set do not have the corresponding closing price. All these dates are labelled as "not trading gold" [5].

Technical index addition
It is assumed that the trader is able to trade both gold and bitcoin in fractional shares. To simplify our model, we divide all prices of each equity by an arbitrary constant equity, where equity date, equity for all dates, and the resultant values date,equity equity are used as the minimum units of trade for the equity [6].
For the model to mimic the decisions of human traders and fully exploit the patterns of the data, several common technical indicators are added to each equity of each date, which are: (1)The -day simple moving average, i.e. the average closing price for previous trading days, where we take as 5, 10, 30 to capture the trend of the prices of different length of periods; (2)The difference of the price of the current day and the previous day; (3)The prices of previous days, where we take as 1, 2, 3, 4, as the most recent prices have the largest influence on future prices based on the common random walk hypothesis[]; (4)The moving average convergence divergence (MACD), calculated by subtracting the 26period exponential moving average (EMA) from the 12-period EMA; the signal line, which is a nine-day EMA of the MACD; (5)The volatility, calculated by the standard deviation of previous 20 trading days, which measures the intensity of recent fluctuation; (6)The bias, which is the percentage of the deviation of current day price from the 5-day simple moving average.
Because the calculation of some indicators requires data of previous dates, the data set is dropped until 21/10/2016 so that all dates have all indicators stated [7].

Reinforcement learning constitutes the environment
Similar to the tutorial FinRL: Multiple Stock Trading, we model the portfolio management process as a Markov Decision Process (MDP). We then formulate our trading goal as a maximization problem [8]. The algorithm is trained using Deep Reinforcement Learning (DRL) algorithms and the components of the reinforcement learning environment are: Action: The action space describes the allowed actions an agent can take at a state. In our task, the action ( ) ∈ corresponds to the portfolio weight vector decided at the beginning of time slot, and should satisfy the constraints: firstly, the each element is between 0 and 1, secondly the summation of all elements is 1.
Reward function: The reward function ( ( ), ( ), ( + 1) is the incentive for an agent to learn a profitable policy. We use the logarithmic rate of portfolio return: ( ( ) ( )) as the reward, where ( ) ∈ is the price relative vector. SRL Algorithms: We use two popular deep reinforcement learning algorithms: Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO

The Model od Deep Reinforcement Learning
The model is based on Deep Reinforcement Learning (DRL), which combines the framework of reinforcement learning and deep learning techniques.
The Markov Decision Process (MDP) is used to describe the trading process. We have the following BCP Business & Management The environment is modelled to simulate that of a trader in the gold and bitcoin market. To train the model each environment randomly picks a -day period and slice the data only from these dates. The portfolio of the trader is initialized as 1000, 0, 0 . The environment then informs the agent of the first date's state. Then on each day, it sequentially takes the following steps: The environment receives the action given by the agent and updates the traders portfolio as ′ = ( )according to the equity allocation function.
The environment shifts to the next date. It calculates the reward of given the new date's market and informs the agent of the reward as well as the new date's state.
A naive way of defining is to find the closest number of shares of such a value ratio. We define such an as 0 The agent needs to develop a set of rules that decide the best award based on the given state and reward. This is taken care of with the use of a DdpgAgent by TensorFlow. The agent is trained with = 0.05, batch size 60 and iteration number 400. (Details about random policy, replay buffer, etc.) (some graphs here) After training the model is run on the entire data set to obtain the overall profit rate of the strategy. (We need some indicators of the model's performance.) When the transaction fee is zero the trained model with 0 performs quite well with profit rate ... (This is actually question 3) However when the transaction fee gets higher the model tends to degenerate... (some terrible graphs here) This is due to the high frequency of transaction taken by the model. We mitigate this by redefining In order to reduce unnecessary transaction fee, no transaction will be conducted if the change rates of , are both less than %.
In this section we will use the Markov Decision Process (MDP) to model the stock trading process and the objective function will be returned when it achieves its maximum.
We aim to describe the stochastic process of dynamic stock market, a MDP was deployed as following: State: s=[ s , a , s ' ] : a vector that includes stock prices ∈ +, the stock shares ℎ ∈ +, and the remaining balance , where denotes the number of stocks and denotes non-negative integers. Action: a vector of actions over stocks. The actions could be taken upon each of share includes selling, buying, or holding, representing for decreasing, increasing, and no change of the stock shares ℎ, respectively.
Reward: the incentive mechanism for an agent to learn a better action. At state , the firsthand reward of taking action , and then arriving at the new state s' Q-value: ( , ) : under policy , the expected value of reward of taking action at state . The state transition of stock trading is shown. In the portfolio, there are three possible actions could be taken on stock ( = 1, 2, 3,……… , ).
Selling Action is taken at time t and then the stock price changes at time + 1. Consequently, the portfolio may be updated, for example, from portfolio value 0 to portfolio value 1. Where the portfolio value is ℎ + . Now, consider the following constraints and assumptions, which are mainly used for practical reasons: risk aversion, transaction costs, liquidity.
Liquidity: we assume that stock market will not be affected by our reinforcement trading agent. These orders can be rapidly executed at the closing price.
Balance b being non-negative: the balance should not be less than 0, i.e negative. At time t , the stocks are divided into sets for selling , buying , and holding , where ∪ ∪ = 1 … … .Meanwhile, there is no overlappings. Let = [ : ∈ ] and = [ : ∈ ] be the vectors of price and number of buying shares for the stocks in the buying set.
We can similarly define and for the selling stocks, and and for the holding stocks. There is a relationship stating the constraints: +1 = + ( − ( ) ) ≥ 0 Transaction cost: transaction costs are included within each of the trade. We assume our commission to be 0.1% for gold and 0.2% for bitcoin of each trade as: = * 0.1%, 1 = * 0.2% Risk aversion for market crash: there is always a possibility that some events might cause the market crash. To control this, financial turbulence index turbulence is used to measure the extreme value movements of the asset:

Comprehensive analysis
We conducted a comprehensive analysis to obtain the optimization mode under different circumstances. We calculated the maximum return value when the transaction cost is the gold transaction cost and the bitcoin transaction cost combination as 0.005,0.01,0.02,0.02,0.04,0.05 and 0.1, respectively [9].
Gold transaction cost 0.005, Bitcoin transaction cost 0.01 is shown as figure 1.    Using such a model, we will get the maximum investment utility and calculate the change situation and the return maximum. As can be seen from the figure, the different transaction costs will change the location of their maximum value, based on the fact that the different transaction costs can be changed in the time series by affecting the investment expectations [10].Different transaction costs do not have much impact on the maximum, presumably because the maximum has been greatly affected by the daily price in the past, and long transactions have stabilized it.

Conclusion
The model can give an optimization and improvement scheme for the relevant trading mode in a relatively short time, so that its efficiency, cost, and quality are all at a good level, and its effect is good, which can be dealt with for such problems.The model has a good application effect for such optimization problems, and can easily find its optimal trading mode, which is of great significance to its future research and development, and needs to be further improved through multiple modes. At the same time, it can effectively regulate the possible risk value. After optimizing according to the goal of risk reduction and utility improvement, we can get its optimal value. When optimization, the model needs comprehensive analysis of multiple data to meet the specific requirements. Various conditions and values should be kept within a reasonable range. This is more difficult, and it needs to further optimize the model and control it.At the same time, for reinforcement learning and other methods, the relevant parameters need to consider, the adjustment of hyperparameters has high requirements, and the optimization algorithm should be used to obtain a more reasonable learning rate, iteration number and other parameters.The gold and bitcoin price prediction models based on DRL are scientific and reasonable, and can not only grasp the general characteristics of price data, but also grasp the time series characteristics of price data.