Prediction of Term Deposit in Bank: Using Logistic Model

.


Background
Marketing is a process which display the products by appealing and attractive ways to clients and customers to serve their requirements. There are two kinds of ways of marketing, which are direct marketing and Mass marketing [1]. Mass marketing is to market massively without any target, whereas direct marketing is to concentrate on a small group of customers with a specific goal. Telemarketing is one of a direct marketing way using the telephone, internet, or fax to potential customers [2]. Telemarketing remains to be one of the most popular direct marketing techniques because the potential effectiveness of human-to-human individual contact is sometimes quite the opposite of many impersonal and robotic marketing messages relayed through social and digital media.
Mass campaigns involve a significant amount of waste because only a very small proportion of customers that are communicated with show an interest in buying the product. However, executing direct telemarketing also requires a huge investment by the business as large call centers need to be contracted to contact clients directly [3].
When competitions increase, it is more important for the companies to keep track of customers' preference so as to win in the market. In the current business climate, customer preferences are very complicated and often change dynamically, therefore difficult to collect. Direct marketing campaigns are essential methods for enhancing the economic gain of a firm in two respects: obtaining new customers and creating additional yield from present customers. Notably, many financial services providers made use of telemarketing strategies to attract new customers, and provide better services to existing customers by satisfying their special needs. The bank is one of the most important parts of the national economic structure. It can offer loan and deposit services to customers. Term deposits are cash investments held at a financial institution and are a major source of revenue for banks.
However, there are limited research about banks' telemarketing in subscribe deposit. In this case, our study aims to predict the accuracy of the telemarketing practice of banks for selling long-term deposits.

Related research
Many researchers have developed a number of methods for term deposit [4][5][6]. In Hou's study, machine learning was applied to the creation of prediction models for bank deposit subscriptions using five algorithms (naive bayes, decision trees, random forests, support vector machines, and neural networks) [4]. Khan Mohd Zeeshan's study uses machine learning to suggest a suitable model to the banks in order to offer an explainable AI-based solution for predicting potential term deposit customers [5]. Dutta Shawni devised a method applying convolutional-GRU to create a system for term deposit likelihood prediction [6].
There are studies on credit management and bank default in the publications that use logistic regression. Some proposed algorithms based on logistic regression helped to provide a strong statistical background to trust prediction [7].

Objection and motivation
The algorithm of logistic regression has been more mature, and the prediction is more accurate. Numbers of questions regarding term deposits remain to be addressed. Most studies have only relied on machine learning like neural trees, and they are not conclusive because they did not using logistic regression to select attributes and predict. Therefore, using logistic model to predict term deposits in a bank is a worth studying topic, which fills in the gaps in the research.
Therefore, the objective of this paper is to predict whether the clients will subscribe to the term deposit or not and introduce the product to potential customers. It can help a bank save time and marketing costs. There are three main parts to the study. First is data preprocessing, which is to code the dummy variables and divide the data into 80% train set and 20% test set. The second part is attributes selection using a logistic model, which includes correlation analysis and irrelevant variable removal. The third part is to establish and optimize the prediction model and compare it to the decision tree model by predicting accuracy and AUC indicators.

Data Sources
The data is downloaded from Kaggle.com and it is related to marketing campaigns of a Portuguese bank on phone. The data showed about 14 attributes about the clients' information and the results of the clients' whether would subscribe to the term deposit or not. The date of data was from May 2008 to November 2010 and the number of instances was 45211.
The attributes of clients from the data can be categorized into three parts. The first is bank client data, which includes age, job, marital, education, credit default, yearly balance, housing loan, and personal loan. The second part is the information from the last contact with the clients: contact communication type, and last contact duration. The last part is some other information from clients: the number of times the bank contacted the customers during the campaign, the number of days that passed by after the bank called the customers last time, the number of times the bank contacted the customers before the campaign, result of the last marketing campaign.

Logistic regression
Logistic regression is a generalized linear regression analysis model, which belongs to supervised learning in machine learning and is used to solve dichotomous classification problems. The model is trained with several sets of given data, and one or more sets of given data are classified after training [8]. Each set of data is made up of indicators. The input to logistic regression is the result of linear regression as follows function (1), among it, is the regression coefficient. is the independent variable.
Using sigmoid function (2), the result of linear regression will be mapped to [0,1]. Assuming that 0.5 is the threshold, the value of less than 0.5 will be 0 by default, and the value of greater than 0.5 will be 1.

Decision trees
One of the highly interpretable models that can carry out classification and regression tasks is the decision tree. [9]. A decision tree is a tree-like structure model similar to an inverted tree. Consider a scenario where there are classical machine learning models (such as linear regression and logistic regression) to perform regression and classification tasks, it is often necessary to make sure that the data used to train the model is free of all irregularities, such as missing values, outliers to deal with, and multicollinearity. A lot of data preprocessing needs to be completed. In decision trees, however, no data preprocessing is required. The decision tree is powerful enough to handle all of these problems to make a decision. Moreover, decision trees can deal with nonlinear data that classical linear models cannot. Therefore, decision trees are diverse enough to perform regression and classification tasks. Decision trees are constructed by asking a series of questions to the data to make a decision. It can be said that the decision tree mimics the human decision-making process. During tree construction, it divides the entire data into subsets of data until a decision is reached.

ROC and AUC
Receiver Operating Characteristic (ROC) and Area under the curve (AUC) are valuable indicators used for evaluating the performance in classification models. They clearly help determine and get to know the capability of a model in differentiating different classes.
A ROC curve measures the performance of a classification model by plotting the rate of true positives against false positives. AUC is the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative one by a classifier. Graphically speaking, AUC measures the area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1) [10]. AUC of 0 indicates that the predictions of the model is 100% wrong. AUC of 1 shows that the prediction is 100% correct. 0. 5

Code for Dummy Variable
Apply the 14 attributes as X variables and the result of subscription as Y variable. The Y variables are binary: "yes", and "no". Coding "yes" as 1, "No" as 0. There are 8 attribute variables that need coding dummy variables. Use SPSS to code and the outcome is shown in Table 1.

Divide the data sets and test sets
The division of data after shuffling data. In order to prevent overfitting, 80% of the data is divided into the training set, and the remaining data is divided into the training set.

Test the Model and Robustness Analysis
This paper tests and compares the prediction accuracy of both the logistic regression model and the decision tree model after coding by Python. In both models, the prediction is done based on the original data, and the data with irrelevant variables (age, default, previous, p days) removed. So, there are basically four scenarios that need to test. Four scenarios are used to evaluate the performance of both models.
1) Logistic regression model under complete data.
2) Logistic regression models when irrelevant attributes are removed.
3) Decision tree model under complete data. 4) Decision tree model when irrelevant attributes are removed. Prediction accuracy and AUC.

Using prediction accuracy as criterion
When using prediction accuracy as the criterion to choose prediction models and data types, the four scenarios respectively are tested, as shown in Table 4. Firstly, the logistic regression model under complete data is tested. After running the code on python, the prediction accuracy is shown to be 0.88698. The result is higher than 0.85, indicating that the prediction accuracy is quite good. Then, the data type is changed to the reduced data (four irrelevant variables are removed), and the result shows that the value remains the same. So, prediction accuracy won't be affected by data types when using a logistic regression model to do the prediction.
The Decision tree model is tested under complete data. Although its prediction accuracy is still higher than 0.85, it is much smaller than the previous value (0.88698). So, if prediction accuracy is used to evaluate both models, the logistic regression model turns out to be the better one. When the data type changes to reduced data, the prediction accuracy decreases once again.

Using AUC as criterion
Using AUC as a criterion and testing the four scenarios again, as shown in Table 5 and Figure 1-4. Firstly, when the logistic regression model under complete data is tested, the area under the ROC curve is 0.8524. F2.D in this paper shows that when AUC ranges from [0.85, 0.95], the model is excellent. So, using a logistic regression model to do the forecast is excellent. Then, four irrelevant attributes are removed. Running the model again, AUC increases a little. This reveals that using reduced data to do the prediction is better than predicting using complete data under the logistic model.
Next, the decision tree model is tested under complete data. The statistical result of python shows that AUC decreased to 0.67257. So, using a decision tree to predict is less accurate than the logistic regression model, which is confirmed again in the four graphs below. At last, running the model under reduced data and found that AUC was further reduced.
Based on the above results, the conclusion is reached, that if prediction accuracy is used to evaluate the models and data types, the logistic regression model is better than the decision tree model and irrelevant attributes won't affect the accuracy here. If using AUC to evaluate, the logistic regression model under reduced data is the best forecasting method.

Conclusions
Under the trend of globalization, the traditional operation mode is not enough to bring businesses like education institutions, insurance companies, and banks competitive advantages, therefore various marketing strategies need to be used. Telemarketing is one of a direct marking strategies in which a salesperson can learn whether the prospective customers would like to purchase products or services or not through several phone calls.
In this paper, the patterns and characteristics of customers who are more interested in the product are identified through the telemarketing method so as to explore the best strategies to improve the Portuguese bank's next direct marketing campaign. And the best prediction model by comparing prediction accuracy and AUC of alternative models is selected. Based on the results of python, the logistic regression model has higher prediction accuracy than the decision tree found. Besides, after removing the insignificant attributes, the optimized logistic model has a better performance. Therefore, the optimized logistic model is a good model for the bank to predict target customers who want to subscribe to a term deposit. Portuguese banks can use the model to identify the clients who will subscribe to the term deposit or not and can save time and money to call mass clients randomly. It is a good way for the bank to make telemarketing efficient.
However, there are still some limitations and shortages of the paper. For example, other forecasting methods other than the logistic regression model and decision tree model into account have not been taken. In the future, other classification methods can be analyzed. For example, random forest, support vector machine, and Gaussian Naive Bayes to see if they perform better than the logistic model in terms of forecasting the Portuguese bank's next direct marketing campaign. Besides, the deposit policy hasn't been considered in this paper. Since the deposit policy keeps changing and it plays an essential role in predicting the future marketing campaign, it should always update the model based on the newest policy established.