Time Series Analysis and Prediction on Bitcoin

. Bitcoin is the most famous digital currency in the world and has become an investment asset. Prediction is one of the important matters in the investment market. In the economic field, there are different studies on the reasons for the price change of Bitcoin and how to predict the price trend of Bitcoin or how Bitcoin studies the market. Therefore, for Bitcoin, predicting the trend of Bitcoin price can effectively help Bitcoin investors. Data from www. Coingecko, the price of bitcoin is sorted according to the time sequence. Using the time series model, the change of bitcoin price in a specific period which is from 28 April 2013 to 22 August 2022 is calculated to predict the future trend of bitcoin price. Data preprocessing includes attributes removal, stationary test, and differencing. In predicting the price of Bitcoin, the ARIMA method that can produce high accuracy in short-term prediction is adopted. Use prediction test AIC and Check the residuals to select the best prediction model among the candidate models. The results of model testing show that AIC of ARIMA (5,1,2) is the smallest among all candidate models, and the results of residual check also show that ARIMA (5,1,2) model is the best model for predicting four periods.


Introduction
The 2008 financial crisis caused by the problems on subprime loans led to the great global recession and the devaluation of the U.S. dollar. The collapse of Lehman Brothers, the largest bank on Wall Street, and the frozen cash liquidity have also prompted a negative attitude of people towards traditional currencies. It was at that period, a Japanese person called Satoshi Nakamoto appeared with an innovative cryptocurrency named Bitcoin [1]. The amount of Bitcoin is limited by 21 Million. This currency can be traded and acquired. Users mine bitcoins by providing the computing power of their own computers, and to be more specifically, by decrypting the Hash, the user can obtain bitcoins. Initially, after mining one block, the users could receive a total of 50 BTC, however, as the number of bitcoins decreases, they can only receive about 6 now, and they need to solve 2.7 quadrillion hashes in order to generate one single Bitcoin [2]. Solving hashes is not a waste of effort. For example, when Bitcoin transactions occur, they need to pass a security verification, which requires miners to resolve the hashes. This is also known as the Proof of Work model, ensuring that these bitcoins are unique, non-repeating and real. This means that instead of identifying the authenticity of the currency as traditional banks do, this power and right is now devolved to everyone [3]. In recent years, the world economy has followed the same path as the recession of 2008. Due to the pandemic, COVID-19, the hike of interest rate and Russia-Ukraine war, the S&P 500 has had a similar trend to the history of 2008 from 2020 to the present. With people losing confidence and trust in traditional currencies, the interest in cryptocurrency has developed, driving this essay to examine Bitcoin as a research object.
Bitcoin's value stems from its scarcity, decentralization and public trust [4]. First of all, bitcoin has a finite amount, just like gold. At the same time, the measures to prevent double spending are always improving, ensuring that the bitcoins received by the payee are unique and no longer belong to the payer. The limited amount of bitcoin has the advantage of preventing a high degree of inflation like that caused by printing large amounts of paper money [5]. Secondly, the decentralization of Bitcoin brings a lot of convenience to it. Its price is not easily influenced by government regulations towards the national paper currency, such as increase in interest rate. It is also not subject to a corrupt central authority or by counterfeit transactions. Finally, since Bitcoin uses Blockchain to perform its distributed ledgers feature, and is difficult to be stolen, it is trusted by the public [6].
Although Bitcoin has innovative advantages, as a new type of currency, its price can still be influenced by factors such as politics and human behaviors. Global geopolitical situation is a factor that affects Bitcoin price. This factor may cause investors to overreact to the market [7]. Additionally, by examining the connection between bitcoin trading volume and its price, this also confirms that the price of bitcoin can fall due to panic selling in the market, demonstrating the impact that human behavior and social factors can have on cryptocurrency [8]. These two phenomena suggest the truth that Bitcoin is currently a highly volatile investment vehicle for day traders or short-term traders. This irrational investment behavior also proves the applicability of "Greater Fool Theory" to Bitcoin (i.e. Investors tend to hold bitcoins when the price is going up and sell bitcoins to greater fools who are ignorant before the price goes down) [4]. However, Bitcoin as a new type of currency, its market is currently in its infancy. As the time passes by, buying property with Bitcoin, or shopping with Bitcoin at Brisbane Airport in Australia has become a reality [4]. Overseas transactions can also be easily paid by Bitcoin, such as airlines paying for fuel for their planes at foreign airports, which could save a lot of time [4]. The current volatility and bubble, in the long run, may gradually smooth out and reduce [7]. In order to understand Bitcoin and its viability to transform from an investment vehicle of investors and speculators into a stable currency and asset, this essay will focus on predicting the price of Bitcoin by using time series models as a starting point to analyze the future development of Bitcoin and cryptocurrency.

Methodology
The research is divided into five steps, as Figure 1 shown.

Data Collection
The data for this study was obtained from the website www.coingecko.com. CoinGecko conducts a fundamental analysis of the cryptocurrency market. CoinGecko tracks community growth, opensource code development, major events, and on-chain metrics in addition to price, volume, and market capitalization [9]. The data provided is updated all times to ensure the final result. Coingecko offers a data export feature for easy access to price history of each cryptocurrency. There are 3402 data in the CSV file that was created from the Coingecko site's exported dataset. A price history for Bitcoin is included in the dataset from 28 April 2013 to 22 August 2022. Snapped_at, price, market_cap, and total_volume are the four attributes of the dataset.

Preprocessing
The obtained dataset will now be prepared for research at this point. Attribute removal, a stationary test, and differencing are all parts of data preprocessing. To eliminate undesirable qualities from the data, attributes are removed. These characteristics are not included since they have no bearing on the prediction's outcome and might even make it less accurate. The attributes are chosen based on the requirements of the ARIMA approach. carrying out a stationary test to see if the data is stable. You can accomplish it by directly viewing the data's graph. The data is stationary if it is evenly distributed around a straight line in the graph. Utilizing an ACF plot on the data is another method of confirming stationary. The time series should be made stationary if it is not already stationary by recording or differencing it. Log transformation can be used to stabilize the variance of a series with non-constant variance. Differencing the data means that the influence of time has been eliminated, and now the statistical distribution can be reasoned just like the standard probability distribution function. Double check that the data after difference is stationary. If it's not stationary, differencing again. Differencing processes is defined as equation (1). (1)

ARIMA Model
The Autoregressive Integrated Moving Average (ARIMA) method was developed in 1970 by George Box and Gwilyn Jenkins and is also known as the Box Jenskins method [10]. A general class of models called ARIMA models is employed to forecast time series data. The ARIMA method is excellent for connected statistical data since it completely disregards independent factors when making predictions. Assumptions like autocorrelation, trend, or seasonality must also be met. The ARIMA model may predict historical data while taking into account data that is technically challenging to understand. When predicting short-term data, ARIMA model has a high accuracy, and it can also deal with seasonal data fluctuations.
The first is an analysis technique known as autoregression (AR), which makes advantage of the dependent relationship between an observation and a set of lagged observations. In an autoregression model, this study employs a linear combination of the variable's prior values to obtain the variable's prediction of interest. The order value of the coefficient p is calculated using the AR model. P can be used to interpret how a value is related to the nearest earlier value. An AR model with order p can be written as the equation (2), where is white noise, p is the number of lagged observations in the model.
The second is MA model, a moving average model uses past forecast errors in a regression-like model. This method is used to determine the coefficient q, which can be used to interpret previous residual value's movement. A MA model with order q can be written as the equation (3), where is white noise, c is a constant, and s are parameters.
Then is the ARMA model, a composite of the AR model and MA model. The data used in this model is affected by previous period data and the forced value of previous period. An ARMA model can be written as the equation (4). The ARIMA model must use stationary data. Nonstationary data must be change to stationary before using through differencing process. ARIMA model contains three parts, AR, I and MA. An ARIMA model with (p, d, q) can be written as the equation (5).

Model Parameters Determination
The ARIMA Model has three orders which are p, d, and q. They can be determined by plotting ACF and PACF are two terms for the autocorrelation function (PACF). The correlation between a time series of data and a lagged version of itself is depicted using the ACF plot.The value of q can be determined by the ACF plot. On the other hand, the PACF plot is used to measure the partial correlations of a time series data with its own lagged values. The PACF plot can be used to calculate the value of p. Additionally, the value of d is the number of differencings required to make the data steady.
Analyzing the presence of tails off and cuts off in the ACF plots and PACF plots can be used to determinate the orders. The tails off explains how the data correlation declines gradually until it reaches the plotted value of 0. When the data correlation declines significantly more than 0.05, the cutoff occurs. However, the cuts off only appears to the 1st to 10th lag are be considered. The more than 10th lag means the data need to difference one more time. The ACF () and PACF () functions of the package forecast in RStudio can be used to generate the ACF and PACF visualisation. Table 1 shows the orders that were determined using the ACF and PACF plot.

Testing Model for Prediction and Evaluation
The model needs to be tested after the candidates have been obtained. Each candidate be determined will be used to test. The testing process has two steps, which is calculate the error rate and check the residuals of the model. If the candidates cannot be obviously determinate by ACF and PACF. There is an alternative way to get some candidates, which is auto.arima () function from forecast package in RStudio. This function will automatically try each p, d and q values to get the model. After having some candidates of ARIMA models, Akaike Information Criterion (AIC) is used to determinate which model is more fit. If the model has the smallest AIC, it means the model is more fit than others.
The ARIMA model that was selected in the previous phase is checked for residuals in the following step. The ARIMA model is suitable for forecasting if its residuals are white noise. The residuals can be verified as white noise by Ljung-Box test. Using the Akaike Information Criterion (AIC) [11], the accuracy of the prediction findings was tested. AIC is a technique for assessing how well a model fits the data from which it was created. AIC is employed in this study to evaluate various potential models and find which one best fits the data. Low AIC values suggest that the outcome value is reasonably close to the real value. AIC stands for the Akaike information criterion, and its calculation is represented by equation (6), where k is the number of estimated parameters in the model, and ̂ is the maximum value of the likelihood function for the model.
The Ljung-Box test was used in this paper's residual checking phase [12]. A statistical test called the Ljung-Box test is used to determine whether a time series' group autocorrelation is different from zero. The Ljung-Box test is performed in this study to determine whether any residuals are white noise. Equation (7), where 0 denotes independently distributed data and denotes inconsistently distributed data, defines the test statistics of the Ljung-Box test.
Where h is the number of lags being examined, n is the sample size, and is the sample autocorrelation at lag k. The statistic Q asymptotically follows a (ℎ) 2 under 0 . The critical area for rejecting the randomness hypothesis at significance level is defined as > 1− ,ℎ 2 , where 1− ,ℎ 2 is the (1-)-quantile of the chi-squared distribution with h degrees of freedom.

Best Model Determination
Once all model candidates successfully conduct the AIC. Their AIC will be compared to choose the one with smallest AIC, which means the model is the best one among them. The model be chosen to need to be verified by the Ljung-Box test. If the residuals of it is white noise, the model is good for forecast. A 14-day forecast will be made by the best model.

Preprocessing
At this part, the first thing is to remove the unwanted attributes in the data. Only two of the four attributes in the dataset-the snapped_at and price attributes-can be used. The characteristics Market_cap and Total_volume are not required for the forecast. When importing data into R Studio, the procedure to remove such properties is carried out. Next, stationary test can be done using two methods. The first one is viewing the graph of the data and the second one is viewing the plot of the ACF data. Both charts are shown below in Fig.2 and Fig.3, which shows that the data is not stationary. The data must then be logged in order to be transformed into stationary data. The time series package's log () function in R Studio is used to carry out the log process. The outcome is depicted in Figures 4 and 5. The data is still not stationary, as seen by both charts. A differencing procedure is then used to transform the data so that it is stationary. The time series package's diff () function in R Studio facilitates the differencing procedure. Figures 6 and 7 below display the results. The graphs in Figures 6 and 7 show the data changed to a stationary state. The data are spread out around the value 0. The ACF plot also shows how the value of lag has changed over time.

Model Candidate Determination
The lag value of the ACF and PACF plots can be used to establish the candidates' p and q order. The plot of ACF and PACF can be performed by using ggAcf() and ggPacf() funtion of the package ggplot2 in R studio. The value of ACF and PACF can be seen in Table 2. Based on the result of ACF and PACF, although the ACF value and PACF value exceed 0.05 at lag 6 and 10, they are neither tails off nor cut off. It is difficult to directly derive the value of p and q by observation. So it needs to use other method to find the value of p and q.

Choosing the Best Model
Since the ACF and PACF cannot be used to determinate the value of p and q, auto. Arima () function will be an alternative way. In this function, it will search for the model with smallest by stepwise search but not searching all possible p and q. The Table. 3 is done using the function of auto. Arima () from the forecast package. The results of AIC using auto. Arima () function for finding the best model is in the Table 3. The ARIMA model (5,1,2) with drift has the smallest AIC among them, which means it is the best model among them. However, whether it is a good model should be analyzed by its residuals. The residuals check (1) is done by the function checkresiduals () from the forecast package in RStudio. The Ljung-Box test is also done by function checkresiduals (). The p-value of the test is rounded 0.05 which means it failed to reject the H0: the data are independently distributed. Therefore, the residuals of the ARIMA model (5,1,2) with drift are white noise. The model is good for prediction. Table. 4 shows the outcomes of the ARIMA model (5,1,2) with drift forecast for the fourteen next periods. Since the log-transformation is took for the data. The backward log-transformation needs to be done to get the actual prediction. Table shows the actual predictions made using the ARIMA model (5,1,2) with drift for the fourteen upcoming periods.
there are also limitations. First, the irregular Bitcoin price variation characteristics make determining the corresponding ARIMA model difficult. The second limitation is that the ARIMA model is better suited for short-term prediction, as it produces a higher level of accuracy. The more the prediction period, the lower the accuracy level. Our ARIMA model is used to predict the bitcoin price in the future 14 periods, so the accuracy of prediction may be lower than the model that only predicts the next week. Furthermore, when finally got the best prediction model, this paper only used the residual check to check the accuracy of the model and did not use other methods to evaluate our prediction model. For example, the average absolute percentage error (MAPE) was not used to evaluate the forecasted results in this paper. As a result, this paper's best model may differ from the actual best model. Overall, according to the limitation of our Arima model. In future research, it can use more test methods to test the predicted model as much as possible to reduce the deviation from the actual best model. Additionally, as time series data is time-consuming, the more historical the data is, the less value it can be used. Therefore, to improve the accuracy of the prediction, it can reduce the time interval to increase the accuracy. The data used is the daily bitcoin price from 2017 to the present. If the accuracy of the data is increased to an hourly basis, it can be more reproducible to the historical data and thus increase the accuracy of the forecast. Because bitcoin prices are affected by different factors, such as national policies or the volume of social media searches for bitcoin can have an impact on bitcoin price fluctuations, it could also help the model perform better if other factors' effects are taken into account in the prediction model [13].

Conclusion
The purpose of this essay is to use the ARIMA model to forecast the future price of Bitcoin. The process of the prediction consists of Data Collection, Data Preparing, Model Parameters Determination, Model Testing and Evaluation, checking residuals and white noises, and determining the best model. The entire model build was completed around August 22. By using the model ARIMA (5,1,2) and collecting the data of Bitcoin's price from April 2013 to August 2022, Bitcoin price will see a small price increase in the 14 days after August 22. If a smaller AIC is sought, the model needs more variables, but this will lead to model overfit. Thus, the current model is a balance point between these two demands. This method can also be used to forecast the price anytime by changing a few input values. Although the final result is showing a slight upward trend of Bitcoin's price in September 2022, the unpredictable factors such as the high volatility of its price and the influence of human behavior must be considered when sketching Bitcoin future development. In order to be more accurate in predicting the price of bitcoin, researchers could consider and build the model to analyze the Google Trends of searching keyword Bitcoin from different countries and sources (when a large search volume occurs, this usually means that Bitcoin is in a high volatility range), the stock market capital flows or Federal Reverse Repurchase Agreement. These data can be compared with the price of Bitcoin, and examine the related pattern, thereby ensuring the reliability and stability of the whole trend of Bitcoin's price. In terms of the future development of Bitcoin or the Cryptocurrency industry, broader and deeper data is needed to make predictions about it. In the short term, Bitcoin can only be used as a more rapid payment method in the case of immediate cross-border transactions.