MATH2319 Machine Learning

MATH2319 Machine Learning
Semester 1, 2020
Assignment 3ΒΆ
Assignment Rules: Please read carefully!
Assignments are to be treated as “limited open-computer” take-home exams. That is, you must not discuss your assignment solutions with anyone else (including your classmates, paid/unpaid tutors, friends, parents, relatives, etc.) and the submission you make must be your own work. In addition, no member of the teaching team will assist you with any issues that are directly related to your assignment solutions.
For other assignment Codes of Conduct, please refer to this web page on Canvas:
You must document all your work in Jupyter notebook format. Please submit one Jupyter notebook file & one HTML file per question. Specifically, you must upload the following 4 files for this assignment:
StudentID_A3_Q1.html (example: s1234567_A3_Q1.html)
StudentID_A3_Q2_AUC.html (here, AUC needs to be the highest AUC you can get for Q2; example: s1234567_A3_Q2_0.632.html)
Please put your Honour Code at the top in your answer for the first question. At least one of your HTML files must contain the Honour Code.
Please make sure your online submission is consistent with the checklist below:
For full Assignment Instructions and Summary of Penalties, please see this web page on Canvas:
So that you know, there are going to be penalties for any assignment instruction or specific question instruction that you do not follow.
Programming Language Instructions
You must use Python 3.6 or above throughout this entire Assignment 3. Use of Microsoft Excel is prohibited for any part of any question in this assignment. For plotting, you can use whatever Python module you like.

Question 1
(65 points)
This question is inspired from Exercise 5 in Chapter 6 in the textbook. Our problem is based on the US Census Income Dataset that we have been using in this course. Here, the annual_income target variable is binary, which is either high_income or low_income. As usual, high income will be the positive class for this problem.

For this question, you will use different variations of the Naive Bayes (NB) classifier for predicting the annual_income target feature. You will present your results as Pandas data frames.

Bayesian classifiers are some of the most popular machine learning algorithms out there. The goal here for you is then two-fold:

To gain valuable skills on how to use popular variants of the Naive Bayes classifier using Scikit-Learn and
Be able to identify which variant to use for a given particular dataset.
Throughout this question,

Use the “A3_Q1_train.csv” dataset (with 500 rows) to build NB models.
Assume that the “A3_Q1_train.csv” dataset is clean in the sense that there are no outliers or any unusual values.
Use accuracy as the evaluation metric to train models.
NOTE: In practice, you should never train and test using the same data. This is cheating (unless there is some sort of cross-validation involved). However, throughout this entire Question 1, you are instructed to do just that to make coding easier. Besides, NB is a simple parametric model and the chances that it will overfit for this particular problem is relatively small.

Part A (10 points): Data Preparation
TASK 1 (5 points):
Transform the 2 numerical features (age and education_years) into 2 (nominal) categorical features. Specifically, use equal-width binning with the following 3 bins for each numerical feature: low, mid, and high. Once you do that, all the 5 descriptive features in your dataset will be categorical. Your dataset’s name after Task 1 needs to be df_all_cat. Please make sure to run the following code for marking purposes:

# so that we can see all the columns
pd.set_option(‘display.max_columns’, None)
# please run below in a separate cell!!!
for col in df_all_cat.columns.tolist():
print(col + ‘:’)
HINT: You can use the cut() function in Pandas for equal-width binning.

TASK 2 (5 points):
Next, perform one-hot-encoding (OHE) on the dataset (after the equal-width binning above). Your dataset’s name after Task 2 needs to be df_all_cat_ohe. Please make sure to run the following code for marking purposes:

You will provide your solutions for Parts B, C, and D below after you have taken care of the above two data preparation tasks.

MARKING NOTE: If your data preparation steps are incorrect, you will not get full credit for a correct follow-through.

Part B (5 points): Bernoulli NB
In Chapter 6 PPT Presentation, we recently added some explanation on a useful variant of NB called Bernoulli NB. Please see the updated Chapter 6 PPT Presentation on Canvas.

For this part, train a Bernoulli NB model (with default parameters) using the train data and compute its accuracy on again train data.

Official documentation on Bernoulli NB:

Part C (5 points): Gaussian NB
For this part, train a Gaussian NB model (with default parameters) using the train data and compute its accuracy on again train data.

As you know, the Gaussian NB assumes that each descriptive feature follows a Gaussian probability distribution. However, this assumption no longer holds for this problem because all features will be binary after the data preparation tasks in Part A. Thus, the purpose of this part is to see what happens if you apply Gaussian NB on binary-encoded descriptive features.

Official documentation on Gaussian NB:

Part D (20 points): Tuning your Models
In this part, you will fine-tune the hyper-parameters of the Bernoulli and Gaussian NB models in the above two parts to see if you can squeeze out a bit of additional performance by hyper-parameter optimization.

TASK 1 (5 points each): Tuning:
Fine-tune the alpha parameter of the Bernoulli NB model and the var_smoothing parameter of the Gaussian NB model.

TASK 2 (5 points each): Plotting:
Display a plot (with appropriate axes labels and a title) that shows the tuning results. Specifically, you will need to include two plots:

One plot for Bernoulli NB tuning results
One plot for Gaussian NB tuning results
You must clearly state the respective optimal hyper-parameter values and the corresponding accuracy scores.

There are no hard rules for hyper-parameter fine-tuning here except that you should follow fine-tuning best practices.

HINT: You can perform these fine-tuning tasks in simple “for” loops.

Part E (20 points): Hybrid NB
In real world, you will usually work with datasets with a mix of categorical and numerical features. On the other hand, we covered two NB variants so far:

Bernoulli NB that assumes all descriptive features are binary, and
Gaussian NB that assumes all descriptive features are numerical and they follow a Gaussian probability distribution.
The purpose of this part is to implement a Hybrid NB Classifier on the “A3_Q1_train.csv” dataset that uses Bernoulli NB (with default parameters) for categorical descriptive features and Gaussian NB (with default parameters) for the numerical descriptive features. You will specifically train your Hybrid NB model using the train data and compute its accuracy on again train data. This part will require you to think about how NB classifiers work in general and how Bernoulli and Gaussian NB classifiers can be combined via the “naivety” assumption of the Naive Bayes classifier.

Part F (5 points): Wrapping Up
For this part, you will summarize your results as a Pandas data frame called df_summary with the following 2 columns:

accuracy (please round these accuracy results to 3 decimal places)
As for the method, you will need to include the following methods in the order given below:

Part B (Bernoulli NB)
Part C (Gaussian NB)
Part D (Tuned Bernoulli NB)
Part D (Tuned Gaussian NB)
Part E (Hybrid NB)
After displaying df_summary, please briefly explain the following:
(i) Whether hyper-parameter tuning improves the performance of the Bernoulli and Gaussian NB models respectively.
(ii) Whether your Hybrid NB model has more predictive power than the (untuned) Bernoulli and Gaussian NB models respectively.

Question 2:
(35 points)
This question is actually a class competition.

The purpose of this question is to come up with a machine learning algorithm that will maximize the AUC (Area Under Curve) score for a loan default prediction problem. You will use the “loan_default_train.csv” (with 40,000 rows) and “loan_default_test.csv” (with 20,000 rows) datasets for training and testing respectively, which you will read in from the Cloud. You will assume that these datasets are clean in the sense that there are no outliers or any unusual values.

A brief description of the features in these datasets are given below:

loan_ID: ID of the loan
loan_amount: amount of loan in dollars
log_annual_income: log of annual income in dollars
delinq_2yrs: number of delinquent accounts in the past 2 years
dti: debt-to-income ratio
log_credit_age: log of the customer’s credit age in years
emp_length: length of employment in years
home_ownership: home ownership status
purpose: purpose of loan
inq_last_6mths: number of credit inquires on the customer’s accounts in the past 6 months
open_accounts: number of open accounts
total_accounts: total number of accounts
log_inc_payment_ratio: log of income to payment ratio
log_revol_income_ratio: log of revolving income ratio
revolving_util_rate: revolving utilization rate
term: term of the loan (36 or 60 months)
target: loan status (Paid or Default) with Default being the positive class.
(Data Source: Not disclosed)
Your goal here will be to use the training dataset to build a powerful ML algorithm that will give the highest AUC on the test dataset. Remember, by accurately identifying the customers who are likely to default (that is, people who will take the money and never come back), you can save your company millions of dollars.

For coming up with the best algorithm, you are free to choose WHATEVER algorithm you like, e.g., decision trees, Naive Bayes, random forests, SVMs, neural networks, deep learning, gradient boosting, ensemble methods, custom hybrid methods, whatever. Sky is the limit! If you like, you can also use customized feature selection/ extraction/ construction, hyper-parameter fine-tuning, pipelines, or whatever.

For simplicity, you are hereby instructed to use the prepare_dataset() function below for preparating both the training and the test datasets for modeling. You can add additional data preparation steps, but these need to be done after running our prepare_dataset() function.

Part A (30 points): Your Model’s Test Performance
We will set the “lowest AUC” as the AUC of a decision tree classifier (with default values) built on the train data and evaluated on the test data. We will then identify the highest test AUC among student submissions. We will use a linear scale in between for marking your model’s performance. For example, suppose the lowest test AUC is 0.55 and highest is 0.65, and your model’s test AUC is 0.63. You mark for Part B will be set as (0.63-0.55)/(0.65-0.55) = 0.08/0.1 of 30 points = 24 points.

Once we release the marks, we will announce the best algorithm without mentioning the name of the winning student (unless he/she is OK with sharing this information).

PART B (5 points): Documentation of Your Model
As part of your submission, you will need to document your algorithm with sufficient detail. You need to explain any additional data preparation steps, feature selection (if any), your pipeline (if any), any other relevant details, and your actual model. You also need to include any relevant code that you have written. The idea here is that any other student in this class should be able to replicate your results based on your documentation & your code.

For this part, please keep it short and sweet! We do not need to know how you came up with your algorithm. So, please do not document the other algorithms you tried, or how you fine-tuned your algorithm, etc. We just would like to know what worked (and we are not interested in what didn’t work).

As a clarification, your documentation will be marked separately from your model’s performance. That is, your model’s performance can be terrible, but you can still get the full mark for this part if your documentation is done properly.

Good Luck & Have Fun!

Code to Use for Question 2
You need to use the Python code chunks below in your Question 2 submission without making any changes! If you must change any of our code, you need to clearly explain both WHY and HOW.

Needless to say, you can add as much code to these as you like as part of your modelling process.

1. Setting Random States to Control Randomness
Please see additonal information on “Controlling Randomness in Your Jupyter Notebooks” section at this link:

# Set a seed value
seed_value = 999

# 1. Initialise `PYTHONHASHSEED` environment variable
import os

# 2. Initialise Python’s own pseudo-random generator
import random

# 3. Initialise Numpy’s pseudo-random generator
import numpy as np
2. Getting Started: Read in the Datasets from the Cloud
import warnings

import pandas as pd
import io
import requests

# so that we can see all the columns
pd.set_option(‘display.max_columns’, None)

# for Mac OS users only!
# if you run into any SSL certification issues,
# you may need to run the following command for a Mac OS installation.
# $/Applications/Python 3.x/Install Certificates.command (replace “x” per your Python version)
# if this does not fix the issue, please run the code chunk below
import os, ssl
if (not os.environ.get(‘PYTHONHTTPSVERIFY’, ”) and
getattr(ssl, ‘_create_unverified_context’, None)):
ssl._create_default_https_context = ssl._create_unverified_context

data_url_prefix = ‘’
data_train = ‘loan_default_train.csv’
data_test = ‘loan_default_test.csv’
url_content_train = requests.get(data_url_prefix + data_train).content
url_content_test = requests.get(data_url_prefix + data_test).content
df_train = pd.read_csv(io.StringIO(url_content_train.decode(‘utf-8’)))
df_test = pd.read_csv(io.StringIO(url_content_test.decode(‘utf-8’)))
df_train[‘target’] = df_train[‘target’].replace({‘Paid’: 0, ‘Default’: 1})
df_test[‘target’] = df_test[‘target’].replace({‘Paid’: 0, ‘Default’: 1})
(40000, 17)
(20000, 17)
loan_ID loan_amount log_annual_income delinq_2yrs dti log_credit_age emp_length home_ownership purpose inq_last_6mths open_accounts total_accounts log_inc_payment_ratio log_revol_income_ratio revolving_util_rate term target
0 1 4800 10.819778 0 26.66 0.427705 5 MORTGAGE credit_card 1 17 26 3.230382 -0.748998 26.4 36 months 0
1 2 18000 11.289782 0 31.89 0.468968 10 OWN debt_consolidation 2 15 45 2.257057 -1.031992 82.4 36 months 0
2 3 4600 10.778956 0 10.83 0.432080 7 MORTGAGE credit_card 2 12 40 3.204049 0.319173 24.6 36 months 1
3. Function to Prepare Datasets
# you need to use the prepare_dataset() function below to prepare
# BOTH the train and test data

def prepare_dataset(df):

from sklearn import preprocessing

target = df[‘target’].values
Data = df.drop(columns = [‘target’])

# get all columns that are strings
categorical_cols = Data.select_dtypes(include=’object’).columns.tolist()

# if a nominal feature has only 2 levels:
# encode it as a single binary variable
for col in categorical_cols:
n = len(Data[col].unique())
if n == 2:
Data[col] = pd.get_dummies(Data[col], drop_first=True)

# for categorical features with >2 levels: use one-hot-encoding
# below, numerical columns will be untouched
Data = pd.get_dummies(Data)

# also return column names for convenience
Data_column_names = Data.columns.tolist()

# finally, scale Data
Data = preprocessing.MinMaxScaler().fit_transform(Data)

# in case there are some anomolies with the scaling
Data = np.nan_to_num(Data)

return (Data, target, Data_column_names)
4. Preparing the Datasets for Training and Testing
D_train, t_train, Data_column_names = prepare_dataset(df_train)
D_test, t_test, Data_column_names = prepare_dataset(df_test)
(40000, 30)
(20000, 30)
[‘loan_ID’, ‘loan_amount’, ‘log_annual_income’, ‘delinq_2yrs’, ‘dti’, ‘log_credit_age’, ’emp_length’, ‘inq_last_6mths’, ‘open_accounts’, ‘total_accounts’, ‘log_inc_payment_ratio’, ‘log_revol_income_ratio’, ‘revolving_util_rate’, ‘term’, ‘home_ownership_MORTGAGE’, ‘home_ownership_OWN’, ‘home_ownership_RENT’, ‘purpose_car’, ‘purpose_credit_card’, ‘purpose_debt_consolidation’, ‘purpose_home_improvement’, ‘purpose_house’, ‘purpose_major_purchase’, ‘purpose_medical’, ‘purpose_moving’, ‘purpose_other’, ‘purpose_renewable_energy’, ‘purpose_small_business’, ‘purpose_vacation’, ‘purpose_wedding’]
5. Model Definition
This part is where you define your best model.

The example below is for a decision tree with default parameters. You will need to replace DecisionTreeClassifier(random_state=999) with the best classifier you came up with.

Before fitting your classifier, you are free to do any additional coding as needed as part of your model documentation (example: feature selection, defining new features, setting up a pipeline, etc).

If you are doing any hyper-parameter fine-tuning, we do not want to see the details. We just want to see the final model. You can just say:

My best model is DecisionTreeClassifier(criterion=’entropy’, max_depth=7, random_state=999)
# in this example, we’re using a DecisionTreeClassifier() with default values.
# In your submission, you need to set the “best_classifier” variable below
# to the best model you came up with
#e.g., DecisionTreeClassifier(criterion=’entropy’, max_depth=7, random_state=999)
from sklearn.tree import DecisionTreeClassifier
best_model = DecisionTreeClassifier(random_state=999)
6. Model Fitting and Evaluation

t_pred = best_model.predict(D_test)

from sklearn import metrics
auc_test = metrics.roc_auc_score(t_test, t_pred)
print(f’AUC = {auc_test:.3f}’)
AUC = 0.550
Important Marking Notes for Question 2
For us to figure out the highest AUC in the class easily, please name your HTML and Notebook files as instructed in Assignment Rule 3, which would be something like “s1234567_A3_Q2_0.632.html” for the HTML where hypothetically 0.632 is your highest AUC.
It is easy to “cheat” by selectively running your code cells to get an AUC higher than your model’s actual test performance. In addition, execution time might be an issue for training when we try to verify contents of your notebook. For this reason, we will not mark any Question 2 submissions where code cells do not start at 1 and/ or code cells are not consecutive.
While setting up and training your model, you must never use the test dataset! If you use the test dataset in any way while model building, it will be cheating and you will not get any points for Part A of this question. To clarify, you can do whatever you want with the test data while experimenting (you will not be documenting your experimentation anyway). However, the model documentation you submit is not allowed to use the test data during the model building process.
You will not get any credits for any “mysterious” classifiers (that you might perhaps load from a file without any explanation). We need to see what you have done so that we can give you credit for it.