Introduction¶
What is Machine Learning ? "Machine learning is a field that develops algorithms designed to be applied to datasets, with the main areas of focus being prediction (regression), classification, and clustering or grouping tasks. These tasks are divided into two main branches, supervised and unsupervised ML." (Athey 2019).
- supervised learning: the model is learning to associate data with labels (classification) or values (regression)
- unsupervised learning: the model works on the data only, without labels
- We may add reinforcement learning: the model interacts with an environment and determines the optimal policy (for the connection with dynamic programming, check Bertsekas http://www.mit.edu/~dimitrib/RLbook.html)
The use of machine learning in economics is not always natural: econometricians are concerned with causal inference, which is something ML was not built to do. ML however brings economics a set of tools that can be used either as an intermediary step (when one needs to classify observations before conducting regressions), or as a central method, in particular when the goal of the paper is to provide predictions.
Applications¶
For example, Björkegren and Grissen (2019), attempt to predict individual credit risk based on mobile phone usage. Example where machine learning would be used as an intermediary step: you want to assess the impact of an agricultural policy on the crop choice by farmers. Starting from a dataset of satellite crop images, you can classify the fields using image recognition, and then identify the treatment effect using traditional econometric methods. Image recognition can also be used for urban economics using satellite images.
Otherwise, in policymaking, the prediction aspect may be important. For example, you may want to install wind turbines in areas where the wind turbine project will not generate too much backlash. We will conduct such analysis at the end of the notebook.
Here, we will see two packages, Scikit-Learn and Tensorflow 2, which can be used for machine learnings applications. We will briefly review the general setup for adapting data to a machine learning project, and present some methods, in particular Ridge, Lasso, Elasticnet, and Neural networks, for regression and classification. I will also present PCA, a method allowing dimensionality reduction.
Applications will be done on MNIST (the standard toy dataset for deep learning applications) and a dataset on the Beijing housing market.
References:¶
The Scikit-Learn documentation:
A review of machine learning applications in economics:
- Machine Learning Methods That Economists Should Know About, Athey & Imbens, 2019, Annual Review of Economics, Vol. 11:685-725
Two textbooks on machine learning applications (not too much theory)
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, A. Géron, 2019, O'Reilly Media, Inc.
- Machine Learning for Economics and Finance in TensorFlow 2: Deep Learning Models for Research and Industry, I. Hull, 2020, Apress, 1st ed.
Additional: More theoretical resources:
- Foundations of Machine Learning, Mohri et al., MIT Press, Second Edition, 2018
- Deep Learning, Goodfellow et al, 2016, MIT Press, http://www.deeplearningbook.org
#!pip install tensorflow
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, r2_score
from sklearn.linear_model import SGDClassifier
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_probability as tfp
tfd = tfp.distributions
from math import *
import warnings
warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (5,5)
Motivation: MNIST¶
To begin with, we will solve a number classification problem. The MNIST dataset contains input of hand-written numbers, and the output is a vector containing the "actual" number that was written. We will train a model on a subset of the input, and then test it. The strength of scikit learn is that we can simply import the method and implement it immediately.
mnist = pd.read_csv("https://raw.githubusercontent.com/AntoineChapel/MachineLearning_CC/main/mnist_df.csv")
X, y = mnist.iloc[:, 1:785].to_numpy(), mnist['target'].to_numpy().astype(int)
%matplotlib inline
num = X[11, :].reshape(28, 28)
plt.imshow(num, cmap="binary")
plt.show()
The MNIST dataset is to machine/deep learning what the auto-mpg stata dataset is to econometrics. We will start from here, and then move to more econometric frameworks.
#Step 1: Data splitting, usually train/test, sometimes train/validation/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Step 2: Model building and fitting
sgd_class = SGDClassifier(random_state=42)
sgd_class.fit(X_train, y_train)
SGDClassifier(random_state=42)
#Step 3: out of sample prediction
y_predict = sgd_class.predict(X_test)
#Step 4: Testing. Depending on the task, find the right test (here the task is multilabel classification)
confusion_mnist = confusion_matrix(y_predict, y_test)
plt.matshow(confusion_mnist, cmap = plt.cm.gray)
<matplotlib.image.AxesImage at 0x243a277cdc8>
accuracy_score(y_predict, y_test)
0.878
The stochastic classifier is already quite good. We see on the confusion matrix that 7 are sometimes mistaken for 9. Now let us check if we can do better with a neural network built with Keras.
#Model training
model_mnist = keras.models.Sequential()
model_mnist.add(keras.layers.Dense(300, activation="relu"))
model_mnist.add(keras.layers.Dense(300, activation="relu"))
model_mnist.add(keras.layers.Dense(300, activation="relu"))
model_mnist.add(keras.layers.Dense(300, activation="relu"))
model_mnist.add(keras.layers.Dense(10, activation="softmax"))
model_mnist.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
#Model fitting
model_mnist.fit(X_train, y_train, batch_size=100, epochs=10, validation_split=0.2)
Epoch 1/10 56/56 [==============================] - 1s 12ms/step - loss: 7.4912 - accuracy: 0.6782 - val_loss: 0.8829 - val_accuracy: 0.8307 Epoch 2/10 56/56 [==============================] - 0s 8ms/step - loss: 0.4215 - accuracy: 0.9027 - val_loss: 0.5941 - val_accuracy: 0.8729 Epoch 3/10 56/56 [==============================] - 0s 7ms/step - loss: 0.1447 - accuracy: 0.9548 - val_loss: 0.5285 - val_accuracy: 0.8957 Epoch 4/10 56/56 [==============================] - 0s 7ms/step - loss: 0.0600 - accuracy: 0.9816 - val_loss: 0.5082 - val_accuracy: 0.8957 Epoch 5/10 56/56 [==============================] - 0s 7ms/step - loss: 0.0213 - accuracy: 0.9937 - val_loss: 0.4691 - val_accuracy: 0.9107 Epoch 6/10 56/56 [==============================] - 0s 7ms/step - loss: 0.0083 - accuracy: 0.9986 - val_loss: 0.4525 - val_accuracy: 0.9164 Epoch 7/10 56/56 [==============================] - 0s 8ms/step - loss: 0.0021 - accuracy: 1.0000 - val_loss: 0.4449 - val_accuracy: 0.9164 Epoch 8/10 56/56 [==============================] - 0s 7ms/step - loss: 0.0010 - accuracy: 1.0000 - val_loss: 0.4449 - val_accuracy: 0.9171 Epoch 9/10 56/56 [==============================] - 0s 7ms/step - loss: 7.8686e-04 - accuracy: 1.0000 - val_loss: 0.4432 - val_accuracy: 0.9186 Epoch 10/10 56/56 [==============================] - 0s 9ms/step - loss: 6.5253e-04 - accuracy: 1.0000 - val_loss: 0.4433 - val_accuracy: 0.9186
<keras.callbacks.History at 0x243a271af48>
y_pred = model_mnist.predict(X_test)
class_prediction_nn = np.argmax(y_pred, axis=1)
confusion_mnist_nn = confusion_matrix(class_prediction_nn, y_test)
plt.matshow(confusion_mnist_nn, cmap = plt.cm.gray)
<matplotlib.image.AxesImage at 0x243ae7feb88>
#Out-of-sample prediction:
accuracy_score(class_prediction_nn, y_test)
0.9213333333333333
The power of neural networks/perceptrons for image classification like the one above is impressive: with a simple, 4-layers-perceptron, we get around 95% accuracy out of the sample, the confusion matrix shows that classification is nearly perfect.
General Introduction to the Machine Learning framework¶
As mentioned earlier, Machine learning is most often concerned with classification and regression/prediction tasks. The goal is accuracy/precision of the model, not too much its interpretability. To test how well the model performs, we need to test it out of the sample. This leads to one big concern: overfitting.
Overfitting happens if the model trains too much and reaches 100% accuracy/$R^2$ on the training set. However, as one can expect, the regression line above will definitely not be optimal when we provide the model with a testing sample. So, the model needs to be constrained: either by design (OLS), or by avoiding training indefinitely on the same subsample.
Another form of overfitting may arise: if the scientist building the model systematically looks at what his model parameters yield once the trained model is tested on the testing sample, he will build a model that is designed to give good results on the testing sample. But then, the model would simply be overfitting on the testing sample. This temptation to cheat is why the testing sample is generally reserved until the very last step: the model should first be trained and tested on a subset of the training set and validated (pre-tested) on a subset of the training set called the validation set.
Implementation of Scikit Learn¶
For now, let's go over some simple regression designs: OLS, Ridge, Lasso and ElasticNet. These tools focus on estimating the parameter of a Linear Regression equation. The functional form is thus the following, for each observation $i$.
$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_j x_{ij} + \epsilon_i$$
Where $\beta$ = ($\beta_0$, ..., $\beta_j$) are the parameters we want to estimate, and ($x_i$; $y_i$) is the data for individual i.
In matrix form, it can be rewritten \begin{align} y = X \beta + \epsilon \end{align}
The simplest method: OLS¶
We all know here about OLS: it can be implemented very simply with ScikitLearn, and we will use it as a benchmark to which subsequent methods will be compared.
The example at hand: the Beijing Housing market¶
To present the use of scikit learn methods, we will use a dataset on the housing market in Beijing. The original dataset can be found here: https://www.kaggle.com/ruiqurm/lianjia. We are using a cleaned and pre-processed version of the data, to avoid formatting issues.
beijing_data = pd.read_csv("https://raw.githubusercontent.com/AntoineChapel/MachineLearning_CC/main/data_beijing.csv")
beijing_data.head()
id | lng | lat | followers | totalprice | price | square | kitchen | renovationcondition | buildingstructure | ... | commu_avg | dist_from_cent | logprice | logdist_from_cent | logfollowers | logcommu_avg | logladder | logsquare | logdom | log_age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 101084782030 | 116.47549 | 40.019520 | 106 | 415.0 | 31680 | 131.00 | 1 | 3 | 6 | ... | 56021 | 0.104175 | 10.363441 | -2.261679 | 4.672829 | 10.933482 | -1.527858 | 4.875197 | 7.288928 | 2.564949 |
1 | 101086012217 | 116.45392 | 39.881535 | 126 | 575.0 | 43436 | 132.38 | 1 | 4 | 6 | ... | 71539 | 0.064503 | 10.679044 | -2.741044 | 4.844187 | 11.177998 | -0.404965 | 4.885676 | 6.805723 | 2.639057 |
2 | 101086041636 | 116.56198 | 39.877144 | 48 | 1030.0 | 52021 | 198.00 | 1 | 3 | 6 | ... | 48160 | 0.156425 | 10.859403 | -1.855177 | 3.891820 | 10.782284 | -0.693147 | 5.288267 | 7.147559 | 2.564949 |
3 | 101086406841 | 116.43801 | 40.076115 | 138 | 297.5 | 22202 | 134.00 | 1 | 1 | 6 | ... | 51238 | 0.143917 | 10.007937 | -1.938518 | 4.934474 | 10.844236 | -1.298283 | 4.897840 | 6.872128 | 2.302585 |
4 | 101086920653 | 116.42839 | 39.886230 | 286 | 392.0 | 48396 | 81.00 | 1 | 2 | 2 | ... | 62588 | 0.049165 | 10.787172 | -3.012577 | 5.659482 | 11.044329 | -1.099613 | 4.394449 | 6.831954 | 4.060443 |
5 rows × 31 columns
varnames = beijing_data.columns
print(varnames)
Index(['id', 'lng', 'lat', 'followers', 'totalprice', 'price', 'square', 'kitchen', 'renovationcondition', 'buildingstructure', 'ladderratio', 'elevator', 'fiveyearsproperty', 'subway', 'district', 'dom_int', 'livingroom_int', 'drawingroom_int', 'bathroom_int', 'buildingtype_int', 'construction_int', 'commu_avg', 'dist_from_cent', 'logprice', 'logdist_from_cent', 'logfollowers', 'logcommu_avg', 'logladder', 'logsquare', 'logdom', 'log_age'], dtype='object')
y_b = beijing_data['logprice'].to_numpy().reshape(-1, 1)
X_b_short = beijing_data['dist_from_cent'].to_numpy().reshape(-1, 1)
A good practice is to divide the dataset into a training subset and a testing subset. Scikitlearn does that automatically with us with train_test_split and we can also do stratified sampling if we want to ensure the representativity of our subsets.
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b_short, y_b, test_size=0.3, random_state=42)
reg = linear_model.LinearRegression()
reg.fit(X_train_b, y_train_b)
y_pred_s = reg.predict(X_train_b) #in-sample prediction
plt.scatter(X_train_b, y_train_b, s=0.1)
plt.plot(X_train_b, y_pred_s, 'r')
plt.show()
We immediately see that a linear approximation gives us some idea about the housing market structure (the furthest you are from the center, the cheaper housing). However, it is clearly not perfect, as there is a lot of unexplained variance. To measure the accuracy of the model, we can use the $R^2$ directly from sklearn as well.
y_pred_out = reg.predict(X_test_b) #out-sample prediction
r2_score(y_test_b, y_pred_out)
0.29842841519034147
features = ['lng', 'lat', 'kitchen', 'renovationcondition', 'buildingstructure', 'ladderratio', 'elevator', 'fiveyearsproperty', 'subway', 'district', 'livingroom_int', 'drawingroom_int', 'bathroom_int', 'construction_int', 'commu_avg', 'logdist_from_cent', 'logfollowers', 'logcommu_avg', 'logladder', 'logdom', 'log_age']
X_b_long = beijing_data[features].to_numpy()
First, let's re-do an OLS regression with multiple predictors. We write it as a function.
def ols_reg(y, X):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
reg_ols = linear_model.LinearRegression()
reg_ols.fit(X_train, y_train)
y_predict = reg_ols.predict(X_test)
return reg_ols.coef_, reg_ols.coef_.shape, r2_score(y_test, y_predict)
coef_ols, size_ols, r2_ols = ols_reg(y_b, X_b_long)
print(coef_ols)
print(size_ols)
print(r2_ols)
[[-1.06188976e-01 1.38460555e-02 1.45959111e-01 4.20352936e-02 1.20085318e-03 -5.16990867e-09 -2.23851540e-03 -3.23175175e-02 6.10203712e-03 1.17730704e-03 -2.77443371e-02 -2.16688498e-02 -2.63103019e-02 -2.62604515e-03 -9.98761173e-07 -2.94957081e-02 1.38484820e-02 9.97146888e-01 2.47312449e-02 9.49736702e-02 -1.97911799e-02]] (1, 21) 0.7447412224088903
So, we have our 21 coefficients, and a satisfying $R^2$ out of the sample of 74.4%. In traditional econometrics methods, we would try to run t-tests to assess whether the variables are significant, worry about possible endogeneity issues... This is easy in Stata or R, but a bit tedious in Python. So, if we want to reduce the number of coefficients and make the model a bit more interpretable, we should use regularised regressions, such as Ridge, Lasso, and Elasticnet.
Linear models¶
The goal of these methods is to improve the efficiency of our predictions, by keeping only the most relevant coefficients. Instead of using 21 features in our data, we may reduce the size of the problem. In OLS, there is by definition no penalisation, thus we solve, given $k$ features:
\begin{align*} \min_{\beta} ||X\beta - y||_2^2 \end{align*}
Ridge:¶
\begin{align*} \min_{\beta} ||X\beta - y||_2^2 + \alpha ||\beta||_2^2 \end{align*}
Here, $||\cdot||_2^2$ denotes the l2-norm, which develops as $\sum_{j=1}^k \beta_j^2$
Ridge regression is useful when the variables are highly correlated with one another (multi-collinearity). It can be used to make the coefficients robust to collinearity, which an issue for OLS. $\alpha$ is the regularisation parameter. Equal to 0, we find back OLS. The larger it is, the most the coefficients will be robust to collinearity.
LASSO (Least Absolute Shrinkage and Selection Operator):¶
\begin{align*} \min_{\beta} ||X \beta - y||_2^2 + \alpha ||\beta||_1 \end{align*}
LASSO penalises the coefficients that are close to 0, and thus leads to a restricted set of explanatory variables. Here, $||\cdot||_1$ denotes the l1-norm, which develops as $\sum_{j=1}^k |\beta_j|$. $\alpha$ is the regularisation parameter, but this time, using the l1-norm instead of the l2-norm allows us to bring coefficients that are close to 0 to 0 exactly. Doing so, we can reduce the number of coefficients by obtaining a sparse vector of coefficients. Obviously, it will be done at the loss of some precision of the model, which will be reflected by the $R^2$.
Elastic-Net¶
\begin{align*} \min_{\beta} ||X \beta - y||_2^2 + \alpha \rho ||\beta||_1 + \frac{\alpha (1 - \rho)}{2} ||\beta||_2^2 \end{align*}
As we can see, Elastic-Net allows us to use a linear combination of l1-norm and l2-norm penalisation. It is using 2 parameters, $\alpha$ and $\rho$, which are respectively the regularisation parameter and the share of l1-norm penalisation in the process.
def linear_reg(y, X, method, α=0.1, l1_r=0.5):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
if method=="OLS":
lin_reg = linear_model.LinearRegression()
elif method=="Ridge":
lin_reg = linear_model.Ridge(alpha=α)
elif method=="Lasso":
lin_reg = linear_model.Lasso(alpha=α)
elif method=="ElasticNet":
lin_reg = linear_model.ElasticNet(alpha=α, l1_ratio = l1_r)
lin_reg.fit(X_train, y_train)
y_predict = lin_reg.predict(X_test)
coef_list = lin_reg.coef_[lin_reg.coef_ != 0]
coef_ind = np.argwhere(lin_reg.coef_ != 0)
coef_names = np.array(features)[[coef_ind]]
return coef_list, len(coef_list), r2_score(y_test, y_predict), coef_ind
regression_ols = linear_reg(y_b, X_b_long, method="OLS")
coefs_ols, nbcoefs_ols, r2_ols = regression_ols[0], regression_ols[1], regression_ols[2]
print(coefs_ols)
print(nbcoefs_ols)
print(r2_ols)
[-1.06188976e-01 1.38460555e-02 1.45959111e-01 4.20352936e-02 1.20085318e-03 -5.16990867e-09 -2.23851540e-03 -3.23175175e-02 6.10203712e-03 1.17730704e-03 -2.77443371e-02 -2.16688498e-02 -2.63103019e-02 -2.62604515e-03 -9.98761173e-07 -2.94957081e-02 1.38484820e-02 9.97146888e-01 2.47312449e-02 9.49736702e-02 -1.97911799e-02] 21 0.7447412224088903
np.array(features)[[regression_ols[3]]][:, 1]
array(['lng', 'lat', 'kitchen', 'renovationcondition', 'buildingstructure', 'ladderratio', 'elevator', 'fiveyearsproperty', 'subway', 'district', 'livingroom_int', 'drawingroom_int', 'bathroom_int', 'construction_int', 'commu_avg', 'logdist_from_cent', 'logfollowers', 'logcommu_avg', 'logladder', 'logdom', 'log_age'], dtype='<U19')
regression_elastic = linear_reg(y_b, X_b_long, method="ElasticNet")
coefs_elastic, nbcoefs_elastic, r2_elastic = regression_elastic[0], regression_elastic[1], regression_elastic[2]
print(coefs_elastic)
print(nbcoefs_elastic)
print(r2_elastic)
[ 7.61869951e-03 4.74007355e-08 -1.24118333e-03 1.44728336e-05 7.28613849e-03 8.84109890e-02] 6 0.700160192657924
np.array(features)[[regression_elastic[3]]]
array([['renovationcondition'], ['ladderratio'], ['construction_int'], ['commu_avg'], ['logfollowers'], ['logdom']], dtype='<U19')
ElasticNet regression shows us that we can obtain a $R^2$ of 70% with a simpler linear model of only 6 variables instead of 21
Hyperparameter tuning¶
As seen earlier, some statistical models imported with Sklearn rely on hyperparameters. For example, for ElasticNet, the regularisation parameter $\alpha$ and the l1-penalisation parameter $\rho$. How to fix these hyperparameters in an optimal manner should not be done by testing how the model performs on the testing set, since it would lead to overfitting again. Instead, we will use a tool sklearn provides: Grid Search.
The idea is to specify a list of hyperparameter values that will be tested on the training set using the Cross-Validation principle (repeated training and validation on the training set). When the number of hyperparameters grows big, and it is time consuming to test every possible combination of values, we may rely on Randomized Grid Search. Let us try these methods using ElasticNet:
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_b_long, y_b, test_size=0.3, random_state=42)
#Here, we have two parameters, with 9 possible values each, so 81 possible combinations
parameters = {'alpha': np.logspace(0, 3, 9, base=10)/(10**5),
'l1_ratio': np.linspace(0.1, 0.9, 9)}
model_elasticnet = linear_model.ElasticNet()
elasticreg = GridSearchCV(model_elasticnet, parameters, cv=5)
elasticreg
GridSearchCV(cv=5, estimator=ElasticNet(), param_grid={'alpha': array([1.00000000e-05, 2.37137371e-05, 5.62341325e-05, 1.33352143e-04, 3.16227766e-04, 7.49894209e-04, 1.77827941e-03, 4.21696503e-03, 1.00000000e-02]), 'l1_ratio': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])})
elasticreg.fit(X_train_b, y_train_b)
GridSearchCV(cv=5, estimator=ElasticNet(), param_grid={'alpha': array([1.00000000e-05, 2.37137371e-05, 5.62341325e-05, 1.33352143e-04, 3.16227766e-04, 7.49894209e-04, 1.77827941e-03, 4.21696503e-03, 1.00000000e-02]), 'l1_ratio': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])})
elasticreg.best_params_
{'alpha': 5.6234132519034914e-05, 'l1_ratio': 0.8}
# Exercise 1: Now, instantiate an elasticnet regression of your own. Does it perform better than the one obtained earlier ? (It should)
# What is the issue with doing this with ElasticNet ?
# Exercise 2: redo the Grid Searching through the following grid (with 400 combinations) using RandomizedSearchCV.
# What is the limitation of this method ?
parameters_randomized = {'alpha': np.logspace(0, 3, 20)/(10**5),
'l1_ratio': np.linspace(0.01, 0.1, 20)}
Support Vector Machines¶
SVMs are another machine learning that were originally designed to handle classification tasks. In one sentence, a SVM classifier works under the hood by projecting the data to a higher-dimensional space and finding a linear separator between the categories. For this part, we will rely on an example by Aurélien Géron (Hands On Machine Learning) relying on polynomial classification and the Pipeline technique, which can be used to synthesize the building of a machine learning model with Scikit Learn. Here, we see the classification of a noisy nonlinear dataset.
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC
X, y = make_moons(n_samples=1000, noise=0.25)
df = pd.DataFrame(dict(X1=X[:, 0], X2=X[:, 1], target = y))
plt.scatter(df['X1'], df['X2'], c=df['target'].map({0:'red', 1:'green'}))
<matplotlib.collections.PathCollection at 0x2440a64ea48>
X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X, y, test_size=0.2)
polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", SVC(C=10))
])
polynomial_svm_clf.fit(X_train_svm, y_train_svm)
Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3)), ('scaler', StandardScaler()), ('svm_clf', SVC(C=10))])
y_pred = polynomial_svm_clf.predict(X_test_svm)
print(accuracy_score(y_pred, y_test_svm))
0.955
NB: SVMs are versatile and can also be used for regression.
Neural Networks: the Multi-Layer Perceptron¶
I will not go into the inner workings of a neural network here. For details, go to the book "Deep Learning", by Ian Goodfellow et al.
Now let's move away from linear models and try neural networks. If you have lived in a cave for the last 8 years or so you may not have heard of it. It has proved remarkably efficient for detecting nonlinear patterns, and do image recognition.
There is a neural network available in scikitlearn, that we will implement here, but Google has developed a much more advanced tool, TensorFlow, which will give us more control. TensorFlow can be a bit complex to learn and use, but the high-level API Keras makes it easy to build (simple) neural nets.
Tensorflow can be used to do a lot more than deep learning (optimisation, regression tasks...), so I refer the interested student to the resource "Machine Learning for Economics and Finance in TensorFlow 2" (Isaiah Hull), in which the author does nearly everything (OLS, PCA...) in TensorFlow only.
First, it is recommended that we scale the data, and scikitlearn has a tool for this. You may check, by training the model on un-scaled data, that the model is performing clearly worse.
scaler_b = StandardScaler()
scaler_b.fit(X_train_b)
X_train_bs = scaler_b.transform(X_train_b)
X_test_bs = scaler_b.transform(X_test_b)
nn_b = MLPRegressor(max_iter=50, activation='relu', early_stopping=True, batch_size=1000)
nn_b.fit(X_train_bs, y_train_b)
MLPRegressor(batch_size=1000, early_stopping=True, max_iter=50)
#If our goal was to predict the housing price, we now have a good predictor that clearly outperforms OLS.
#We may gain some additional precision points by tweaking the parameters.
y_pred_o_nn = nn_b.predict(X_test_bs)
r2_score(y_test_b, y_pred_o_nn)
0.7606359183948814
#This increase in precision is done at the expense of interpretability
coefs_nn = nn_b.coefs_
coefs_nn[0].shape
(21, 100)
Given that there are 100 layers (default) in our neural network, and we have 21 variables, there are 2100 weights in our model that are difficult to interpret. Now let us build a neural network with TensorFlow and Keras. Here, we will build a simple neural network with the simplest way to use the Keras API: Sequential.
model_beijing_nn = tf.keras.Sequential()
model_beijing_nn.add(tf.keras.layers.Dense(50, activation='relu'))
model_beijing_nn.add(tf.keras.layers.Dense(50, activation='relu'))
model_beijing_nn.add(tf.keras.layers.Dense(50, activation='relu'))
model_beijing_nn.add(tf.keras.layers.Dense(1))
model_beijing_nn.compile(optimizer='adam', loss='mse')
history_beijing = model_beijing_nn.fit(X_train_bs, y_train_b, batch_size=100, epochs=10, validation_split=0.2)
Epoch 1/10 852/852 [==============================] - 3s 3ms/step - loss: 4.4656 - val_loss: 1.4410 Epoch 2/10 852/852 [==============================] - 2s 3ms/step - loss: 0.1208 - val_loss: 0.9650 Epoch 3/10 852/852 [==============================] - 3s 3ms/step - loss: 0.0677 - val_loss: 0.8127 Epoch 4/10 852/852 [==============================] - 3s 3ms/step - loss: 0.0515 - val_loss: 0.6815 Epoch 5/10 852/852 [==============================] - 3s 3ms/step - loss: 0.0468 - val_loss: 0.6612 Epoch 6/10 852/852 [==============================] - 2s 3ms/step - loss: 0.0436 - val_loss: 0.6316 Epoch 7/10 852/852 [==============================] - 2s 3ms/step - loss: 0.0419 - val_loss: 0.5731 Epoch 8/10 852/852 [==============================] - 3s 3ms/step - loss: 0.0407 - val_loss: 0.5710 Epoch 9/10 852/852 [==============================] - 3s 4ms/step - loss: 0.0410 - val_loss: 0.5756 Epoch 10/10 852/852 [==============================] - 3s 3ms/step - loss: 0.0392 - val_loss: 0.5583
y_pred_bs = model_beijing_nn.predict(X_test_bs)
r2_score(y_test_b, y_pred_bs)
0.786035652906714
pd.DataFrame(history_beijing.history).plot(figsize=(8, 5))
plt.grid(True)
plt.show()
We may gain some precision by adding layers to the network, and adding data to the model. We should be careful however to avoid overfitting: if we add too many layers, the model risks becoming "too good" at predicting housing prices within the sample, but being unable to make any meaningful prediction out of the sample. In any case, with our 80% $R^2$ predictor out of the sample, we have a convincing approximation of the housing price in Beijing.
Dimensionality Reduction: PCA¶
In computational sciences, the "curse of dimensionality" is often faced. We have so much data available that, geometrically, the dataset becomes an object with too many dimensions to be exploitable. A possible solution is PCA: Principal components analysis. A bit like regularisation, we need to accept to lose some of the noise of the data in order to make the data more "friendly" to the model. I won't go into the details of PCA, but briefly, it consists in projecting the data orthogonally on several consecutive "axis", which are the principal components. The goal of PCA is to obtain a dataset which still allows classification/regression, but relies on a smaller number of variables. Here, we will look at what this means in practice.
from sklearn.decomposition import PCA
#Two options: first, you can specify the number of variables you want to end up with. Here, we want to plot it, so 2 is good.
pca1 = PCA(n_components=2)
X_reduced_bs = pca1.fit_transform(X_train_bs)
plt.scatter(X_reduced_bs[:, 0], X_reduced_bs[:, 1], s=0.1)
plt.show()
print(pca1.explained_variance_ratio_)
[0.17455936 0.12837277]
Issue: we can see with "explained_variance_ratio_" that the two first principal components only explain 0.17 + 0.13 = 30% of the data variance. We would rather aim at something like 90% or 95%. To do so:
pca2 = PCA(n_components = 0.9)
X_reduced_bs2 = pca2.fit_transform(X_train_bs)
print(X_reduced_bs2.shape)
print(np.sum(pca2.explained_variance_ratio_))
plt.pie(pca2.explained_variance_ratio_)
plt.show()
(106474, 14) 0.921170098566794
New issue: we still have 14 variables, which is not a major improvement compared to the original 21. Here, any computer is able to handle the 21 variables, but if this problem arises in a problem with thousands of variables, one needs to try alternative techniques for dimensionality reduction (see Géron textbook).
ncomp = 144
X, y = mnist.iloc[:, 1:785].to_numpy(), mnist['target'].to_numpy().astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pca_mnist = PCA(n_components = ncomp)
X_train_reduced = pca_mnist.fit_transform(X_train)
X_test_reduced = pca_mnist.transform(X_test)
print(np.sum(pca_mnist.explained_variance_ratio_))
0.9453900824575866
print(X_train.shape)
print(X_train_reduced.shape)
(49000, 784) (49000, 144)
np.sum(pca_mnist.explained_variance_ratio_)
0.9453900824575866
num = X_train[3, :].reshape(28, 28)
plt.imshow(num, cmap="binary")
plt.show()
num = X_train_reduced[3, :].reshape(int(sqrt(ncomp)), int(sqrt(ncomp)))
plt.imshow(num, cmap="binary")
plt.show()
X_train_inversed = pca_mnist.inverse_transform(X_train_reduced)
num = X_train_inversed[3, :].reshape(28, 28)
plt.imshow(num, cmap="binary")
plt.show()
Although the image is a bit noisy, we did not lose much data. We may train the model on the PCA-transformed array X_train_reduced instead of the original X_train, making the classification faster. Feel free to experiment with over values for compression, using one or the other method described earlier.
# Exercise 3 : implement the MNIST classification task with a Keras neural network
# on the data that was compressed using PCA, and compare its accuracy with
# the original classification task. Make sure to use the right activation and loss functions.
Introduction to TensorFlow¶
This part is based on the Hull reference: Machine Learning for Economics and Finance in TensorFlow 2
#Equivalent of a numpy array in TensorFlow:
X = tf.constant([[1, 2],
[3, 4]], tf.float32)
y = tf.constant([[3],
[8]], tf.float32)
print(tf.matmul(X, y) == X@y)
tf.Tensor( [[ True] [ True]], shape=(2, 1), dtype=bool)
#OLS in TensorFlow:
beta = tf.linalg.inv(tf.transpose(X) @ X) @ (tf.transpose(X) @ y)
print(beta)
tf.Tensor( [[2.0000305 ] [0.49999237]], shape=(2, 1), dtype=float32)
def ols_predict(X_test, beta):
y_predict = tf.matmul(X_test, beta)
return y_predict
y_pred = ols_predict(X, beta)
print(y_pred, y)
tf.Tensor( [[3.0000153] [8.000061 ]], shape=(2, 1), dtype=float32) tf.Tensor( [[3.] [8.]], shape=(2, 1), dtype=float32)
Generalized Linear Model with TensorFlow¶
This part was taken and adapted from https://yxue-me.com/post/2019-09-21-tensorflow-for-statisticians-2019/. It relies on the TensorFlow Probability library, which is more advanced than the rest of this tutorial. If you are interested in going further, you should take a look here: https://www.tensorflow.org/probability?hl=fr
#Data Generation: Quick tuto on TensorFlow Probability
#Set up a distribution:
ndist = tfd.Normal(loc = 0, scale = 1)
#Take a sample of 5 draws:
ndist.sample(5)
<tf.Tensor: shape=(5,), dtype=float32, numpy= array([ 1.7109102 , -0.2803424 , 1.9079707 , -0.05166156, 0.6280364 ], dtype=float32)>
#Bernoulli:
bdist = tfd.Bernoulli(probs=[-10, 0.5, 0.9])
bdist.sample(20)
<tf.Tensor: shape=(20, 3), dtype=int32, numpy= array([[0, 1, 1], [0, 1, 1], [0, 1, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 1, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 1, 1], [0, 1, 1], [0, 1, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 0], [0, 1, 1]])>
#Data Simulation
data = tfd.Normal(loc= 0, scale = 1).sample([10000, 20])
pars = tfd.Beta(2, 3).sample([20, 1])
lp = data @ pars
Y = tf.reshape(tf.cast(tfd.Bernoulli(probs=lp, dtype = np.float32).sample(1), dtype = np.float32), 10000)
#Option 1: tfp.glm.fit
w, linear_response, is_converged, num_iter = tfp.glm.fit(model_matrix = data, response = Y, model = tfp.glm.Bernoulli())
#Option 2: Keras
model_glm = tf.keras.Sequential()
model_glm.add(layers.Dense(1, activation = "sigmoid"))
model_glm.compile(optimizer = tf.optimizers.Adam(0.05), loss = 'BinaryCrossentropy')
model_glm.fit(data, Y, epochs = 10, batch_size = 1024)
Epoch 1/10 10/10 [==============================] - 0s 999us/step - loss: 0.5490 Epoch 2/10 10/10 [==============================] - 0s 1ms/step - loss: 0.2946 Epoch 3/10 10/10 [==============================] - 0s 1ms/step - loss: 0.2159 Epoch 4/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1810 Epoch 5/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1631 Epoch 6/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1524 Epoch 7/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1447 Epoch 8/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1391 Epoch 9/10 10/10 [==============================] - 0s 1ms/step - loss: 0.1347 Epoch 10/10 10/10 [==============================] - 0s 999us/step - loss: 0.1311
<keras.callbacks.History at 0x2017d4bff48>
model_glm.weights[0].numpy()
array([[1.5476562 ], [1.5945425 ], [1.4098365 ], [1.5348563 ], [0.8680767 ], [2.2522497 ], [1.6124176 ], [0.25305015], [0.9425288 ], [0.58567697], [2.1909125 ], [0.20936796], [0.40090698], [1.5517644 ], [1.8131294 ], [0.7720003 ], [0.85735404], [1.2410616 ], [1.3905089 ], [0.89756656]], dtype=float32)
w.numpy().reshape(20, 1)
array([[1.5615354 ], [1.6151834 ], [1.3195904 ], [1.5078776 ], [0.81928235], [2.1917088 ], [1.6208267 ], [0.25970194], [0.96157074], [0.5722857 ], [2.335817 ], [0.24222252], [0.33490622], [1.5336287 ], [1.7106426 ], [0.7676628 ], [0.84719825], [1.2232621 ], [1.4053323 ], [0.8839661 ]], dtype=float32)
To go further¶
Scikit-Learn, TensorFlow, Keras all offer much more possibilities than what was very briefly presented here. Ensemble methods, unsupervised learning, deep reinforcement learning, can all be done with the tools we introduced in this notebook. The documentation of Scikit-Learn and Keras are very good, and Google has done tutorials on how to use TensorFlow.