- Scrap Box Dataset
- Explore the Dataset
- Data Preprocessing
- Feature Selection
- Baseline Result
- Conclusion
- Further Work
Scrap Box Dataset
Days passed from my first Random Forest practical experiment, where I was attempting to predict the weight of an Aluminium Scarp Box.
Spending days going deeper on Random Forest, here you can find a revisioned and hope improved version of the previous one post.
Short learning cycle suggested me, gradually, what’s matter the most.
Figure out the metrics properly.
Same tip and trick came from Thakur book where he underlines, before any kind of splitting: understand the data and implement the right metric.
Target drives metric, therefore undestanding deeply the target will return the right metric.
The Problem
Initially the problem to solve included 681
classes. Now I’ve kept only the 11
most common.
Previously I was using the wrong metric, today I switched to AUC ROC metric where it’s mainly used on multi class classification problem.
So, but what’s the target? A multi class classification problem with imbalanced data. It took me a while but worth it.
Wait, imbalanced what? I don’t know yet. Let’s dig into unbalanced data another day.
Explore the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("scraps/scrap_202210181239.csv")
df.shape
df["tare_weight"].nunique()
df["tare_weight"].value_counts().head(11)
top_classes = df["tare_weight"].value_counts().head(11)
(df.shape[0]-top_classes.sum())/df.shape[0] *100
In my case I want to reduce the target spectrum. From 681
classes to 11
classes. This target reduction impacts the dataset by 4.47%
of size. 670
classes are the result of inappropriate software usage. I’m pretty confident the current inserts are happening mostly right.
top_classes["top_classes"] = top_classes.index
df = df[df['tare_weight'].isin(top_classes["top_classes"])]
df.shape
82388 - 78708
Removed 3680
rows which meet the 670
surplus classes: a bit cut for a big up.
Let’s see features and target correlation with pairplot
method.
import seaborn as sns
# df_2 = df_2[df_2["weight"] <= 3500]
sns.pairplot(df[:50], hue="tare_weight")
I don’t see any strong linear correlation (except fews which are duplicated features). It suggests Random Forest, thanks to its ability to work uninformative features, would take advantage of the dataset form.
Data Preprocessing
from fastai.tabular.all import Categorify, FillMissing, cont_cat_split, RandomSplitter
dep = "tare_weight"
df = df.drop("net_weight", axis=1)
procs = [Categorify, FillMissing]
df = df.rename(columns={"max_tickness.1": "article_max_tickness",
"min_tickness.1": "article_min_tickness",
"max_tickness": "alloy_max_tickness",
"min_tickness": "alloy_min_tickness",
"name": "location_name"})
cont,cat = cont_cat_split(df, 1, dep_var=dep)
splits = RandomSplitter(valid_pct=0.25, seed=42)(df)
from fastai.tabular.all import TabularPandas
to = TabularPandas(
df, procs, cat, cont,
y_names=dep, splits=splits)
to.train.xs.iloc[:3]
len(to.train),len(to.valid)
from fastai.tabular.all import save_pickle
save_pickle('to.pkl',to)
from fastai.tabular.all import load_pickle
to = load_pickle('to.pkl')
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
def ovr_rf(xs, y, n_estimators=40,
max_features=0.5, min_samples_leaf=5, **kwargs):
return OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True)).fit(xs, y)
Here I’ve simply re-adapted a Jeremy function to work with One-Versus-Rest pipeline. I’m improving my buzzy worlds man!
m = ovr_rf(xs, y)
pred_prob = m.predict_proba(valid_xs)
pred_prob
Actually I’m not using the classic predict()
method. pred_prob
is an array - generated by predict_proba()
method - which contains classes probabilities. See also sklearn source code.
ROC Curve
Now it’s time to analyze our performance with a different metric: AUC ROC.
First encoding all classes then binirize and finally plot them.
#Lets encode target labels (y) with values between 0 and n_classes-1.
#We will use the LabelEncoder to do this.
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
label_encoder.fit(valid_y)
transfomerd_valid_y=label_encoder.transform(valid_y)
classes=label_encoder.classes_
from sklearn.preprocessing import label_binarize
#binarize the y_values
plt.figure(figsize = (15, 10))
y_test_binarized=label_binarize(valid_y,classes=np.unique(valid_y))
# roc curve for classes
fpr = {}
tpr = {}
thresh ={}
roc_auc = dict()
n_class = classes.shape[0]
for i in range(n_class):
fpr[i], tpr[i], thresh[i] = roc_curve(y_test_binarized[:,i], pred_prob[:,i])
roc_auc[i] = auc(fpr[i], tpr[i])
# plotting
plt.plot(fpr[i], tpr[i], linestyle='--',
label='%s vs Rest (AUC=%0.2f)'%(classes[i],roc_auc[i]))
plt.plot([0,1],[0,1],'b--')
plt.xlim([0,1])
plt.ylim([0,1.05])
plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='lower right')
plt.show()
avg_roc_auc = pd.Series(roc_auc)
avg_roc_auc.mean()
An average of 94%
of being right is really good. Only 750
box is mainly miss-classified, with a 83%
.
Feature Selection
Feature selection starts from feature_importances_
. I’ve adapted Jeremy method to work with multi class model.
Feature Importances
def rf_feat_importance(m, df, i):
return pd.DataFrame({'cols':df.columns, 'imp':m.estimators_[i].feature_importances_}
).sort_values('imp', ascending=False)
Every class have its own feature importances so I have to compress everything in array and remove the last one. I’ve implemented a simple concat
and mean
.
df_all = pd.DataFrame()
for i in range(df["tare_weight"].nunique()):
df_all = pd.concat([df_all, rf_feat_importance(m, xs, i)])
cols = df_all["cols"].sort_index().unique()
df_all = df_all.groupby(df_all.index).mean()
df_all["cols"] = cols
df_all = df_all.sort_values('imp', ascending=False)
Finally plotting averaged feature importances of the whole classes.
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(df_all[:30]);
Let’s remove less significant ones.
df_all[df_all["imp"] >= 0.002]
fi = df_all[df_all["imp"] < 0.002]
filtered_xs = xs.drop(fi["cols"], axis=1)
filtered_valid_xs = valid_xs.drop(fi["cols"], axis=1)
filtered_xs.shape, filtered_valid_xs.shape
m = ovr_rf(filtered_xs, y)
pred_prob = m.predict_proba(filtered_valid_xs)
def roc_plot(classes):
plt.figure(figsize = (15, 10))
y_test_binarized=label_binarize(valid_y,classes=np.unique(valid_y))
# roc curve for classes
fpr = {}
tpr = {}
thresh ={}
roc_auc = dict()
n_class = classes.shape[0]
for i in range(n_class):
fpr[i], tpr[i], thresh[i] = roc_curve(y_test_binarized[:,i], pred_prob[:,i])
roc_auc[i] = auc(fpr[i], tpr[i])
# plotting
plt.plot(fpr[i], tpr[i], linestyle='--',
label='%s vs Rest (AUC=%0.2f)'%(classes[i],roc_auc[i]))
plt.plot([0,1],[0,1],'b--')
plt.xlim([0,1])
plt.ylim([0,1.05])
plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='lower right')
plt.show()
roc_plot(classes)
def avg_roc_auc(pred_prob):
return pd.Series(roc_auc_classes(pred_prob)).mean()
avg_roc_auc(pred_prob)
With just removing the less important ones, the model has improved by few decimals.
from fastai.tabular.all import save_pickle
save_pickle('filtered_xs.pkl',filtered_xs)
save_pickle('filtered_valid_xs.pkl',filtered_valid_xs)
filtered_xs = load_pickle('filtered_xs.pkl')
filtered_valid_xs = load_pickle('filtered_valid_xs.pkl')
Features Correlation
import matplotlib.pyplot as plt
import seaborn as sn
xs_corr = filtered_xs.corr()
compressed_xs = xs_corr[((xs_corr >= .5) | (xs_corr <= -.5)) & (xs_corr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(compressed_xs, annot=True, cmap="Reds")
plt.show()
def corrFilter(x: pd.DataFrame, bound: float):
xCorr = x.corr()
xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
return xFlattened
corrFilter(filtered_xs, .65)
def oob_estimators(filtered_xs):
m = ovr_rf(filtered_xs, y)
return [m.estimators_[i].oob_score_ for i in range (df["tare_weight"].nunique())]
Since I’m working with a dataset with 11
classes, it’s essential to evaluate for each class relative Out-of-Bag score. So the goal is to remove closely correlated features which keep stagnant or improve the OOB score.
oob_estimators(filtered_xs)
to_drop = ["id", "timestamp", "slim_alloy", "id_alloy", "pairing_alloy",
"international_alloy", "id_user", "address",
"location_name", "article_min_tickness", "article_max_tickness_na"]
{c:oob_estimators(filtered_xs.drop(c, axis=1)) for c in to_drop}
The features belongs to to_drop
list with an average of OOB
score higher, will be dropped.
to_drop = ["id", "pairing_alloy", "id_alloy",
"article_max_tickness_na", "location_name", "article_max_tickness_na"]
filtered_xs = filtered_xs.drop(to_drop, axis=1)
filtered_valid_xs = filtered_valid_xs.drop(to_drop, axis=1)
filtered_valid_xs.shape, filtered_xs.shape
m = ovr_rf(filtered_xs, y)
pred_prob = m.predict_proba(filtered_valid_xs)
roc_plot(pred_prob)
avg_roc_auc(pred_prob)
Obtaining 94.5%
AUC ROC
score while keeping OOB
score higher is a good achievement. Breakpoint saved.
save_pickle('filtered_xs.pkl',filtered_xs)
save_pickle('filtered_valid_xs.pkl',filtered_valid_xs)
filtered_xs = load_pickle('filtered_xs.pkl')
filtered_valid_xs = load_pickle('filtered_valid_xs.pkl')
Baseline Result
Now it’s time to fix Out of Domain Data to minimize overfitting.
Out of Domain Data
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
df_dom = pd.concat([filtered_xs, filtered_valid_xs])
is_valid = np.array([0]*len(filtered_xs) + [1]*len(filtered_valid_xs))
m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:15]
for c in ('timestamp', 'weight', 'slim_alloy',
'international_alloy', 'id_machine_article_description',
'id_idp_user', 'last_name', 'id_machine', 'slim_number',
'first_name', 'code_machine', 'description_machine'):
m = ovr_rf(filtered_xs.drop(c,axis=1), y)
pred_prob = m.predict_proba(filtered_valid_xs.drop(c,axis=1))
print(c, avg_roc_auc(pred_prob))
to_drop = ['international_alloy', 'last_name', 'id_machine', 'slim_number', 'description_machine']
xs_final = filtered_xs.drop(to_drop, axis=1)
valid_xs = filtered_valid_xs.drop(to_drop, axis=1)
xs_final.shape, valid_xs.shape
m = ovr_rf(filtered_xs, y)
pred_prob = m.predict_proba(filtered_valid_xs)
avg_roc_auc(pred_prob), oob_estimators_avg(m)
Everything ended with less features (15
) and higher score both AUC ROC
and OOB
.
save_pickle('final_xs.pkl',xs_final)
save_pickle('final_valid_xs.pkl',valid_xs)
Hyperparameter Tuning
Before the game end I’ll try some hypertuning.
xs_final = load_pickle('final_xs.pkl')
valid_xs = load_pickle('final_valid_xs.pkl')
m.get_params()
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 200, num = 4)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 50, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'estimator__n_estimators': n_estimators,
'estimator__max_features': max_features,
'estimator__max_depth': max_depth,
'estimator__min_samples_split': min_samples_split,
'estimator__min_samples_leaf': min_samples_leaf,
'estimator__bootstrap': bootstrap}
random_grid
from sklearn.model_selection import ShuffleSplit
sp = ShuffleSplit(n_splits=2, test_size=.25, random_state=42)
rf.get_params().keys()
from sklearn.ensemble import RandomForestClassifier
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = OneVsRestClassifier(RandomForestClassifier(oob_score=True))
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = sp, verbose=2, random_state=42, n_jobs = 3)# Fit the random search model
rf_random.fit(xs_final, y)
rf_random.best_params_
from sklearn.metrics import accuracy_score
best_model = rf_random.best_estimator_
pred_prob = best_model.predict_proba(valid_xs)
avg_roc_auc(pred_prob), oob_estimators_avg(best_model)
Now narrowing the range and trying to gain lil decimals.
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
n_estimators = [50, 100, 150]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [30, 50, 100]
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 20]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True]# Create the random grid
random_grid = {'estimator__n_estimators': n_estimators,
'estimator__max_features': max_features,
'estimator__max_depth': max_depth,
'estimator__min_samples_split': min_samples_split,
'estimator__min_samples_leaf': min_samples_leaf,
'estimator__bootstrap': bootstrap}
random_grid
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = OneVsRestClassifier(RandomForestClassifier(oob_score=True))
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = sp, verbose=2, random_state=42, n_jobs = 3)# Fit the random search model
rf_random.fit(xs_final, y)
rf_random.best_estimator_
narrowed_model = rf_random.best_estimator_
pred_prob = narrowed_model.predict_proba(valid_xs)
avg_roc_auc(pred_prob), oob_estimators_avg(narrowed_model)
From 94%
to almost 94.7%
is the final score. OOB stable on 95.3%
range.
Conclusion
Miss-classifying the tare weight (Aluminium scarp box) is expensive causing damage to the company (less revenue) and environment (re-melting Aluminium).
Scoring a 94.7%
of predicting right is a great baseline. Sure less scraps will be wasted.
Further Work
- Develop service which host the model.
- How reacts the model if I remove duplicated rows? Do it.
- I know the dataset is imbalanced. Implement it.
- Compare the result with deep learning tabular model.
- Compare the result with XGBoost model.
- Using same method to classify Aluminium alloys.