Scrap Box Dataset
Days passed from my first Random Forest practical experiment, where I was attempting to predict the weight of an Aluminium Scarp Box.
Spending days going deeper on Random Forest, here you can find a revisioned and hope improved version of the previous one post.
Short learning cycle suggested me, gradually, what’s matter the most.
Figure out the metrics properly.
Same tip and trick came from Thakur book where he underlines, before any kind of splitting: understand the data and implement the right metric.
Target drives metric, therefore undestanding deeply the target will return the right metric.
The Problem
Initially the problem to solve included 681
classes. Now I’ve kept only the 11
most common.
Previously I was using the wrong metric, today I switched to AUC ROC metric where it’s mainly used on multi class classification problem.
So, but what’s the target? A multi class classification problem with imbalanced data. It took me a while but worth it.
Wait, imbalanced what? I don’t know yet. Let’s dig into unbalanced data another day.
Explore the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
= pd.read_csv("scraps/scrap_202210181239.csv") df
df.shape
"tare_weight"].nunique() df[
"tare_weight"].value_counts().head(11) df[
= df["tare_weight"].value_counts().head(11) top_classes
0]-top_classes.sum())/df.shape[0] *100 (df.shape[
In my case I want to reduce the target spectrum. From 681
classes to 11
classes. This target reduction impacts the dataset by 4.47%
of size. 670
classes are the result of inappropriate software usage. I’m pretty confident the current inserts are happening mostly right.
"top_classes"] = top_classes.index top_classes[
= df[df['tare_weight'].isin(top_classes["top_classes"])] df
df.shape
82388 - 78708
Removed 3680
rows which meet the 670
surplus classes: a bit cut for a big up.
Let’s see features and target correlation with pairplot
method.
import seaborn as sns
# df_2 = df_2[df_2["weight"] <= 3500]
50], hue="tare_weight") sns.pairplot(df[:
I don’t see any strong linear correlation (except fews which are duplicated features). It suggests Random Forest, thanks to its ability to work uninformative features, would take advantage of the dataset form.
Data Preprocessing
from fastai.tabular.all import Categorify, FillMissing, cont_cat_split, RandomSplitter
= "tare_weight"
dep
= df.drop("net_weight", axis=1)
df
= [Categorify, FillMissing] procs
= df.rename(columns={"max_tickness.1": "article_max_tickness",
df "min_tickness.1": "article_min_tickness",
"max_tickness": "alloy_max_tickness",
"min_tickness": "alloy_min_tickness",
"name": "location_name"})
= cont_cat_split(df, 1, dep_var=dep) cont,cat
= RandomSplitter(valid_pct=0.25, seed=42)(df) splits
from fastai.tabular.all import TabularPandas
= TabularPandas(
to
df, procs, cat, cont, =dep, splits=splits) y_names
3] to.train.xs.iloc[:
len(to.train),len(to.valid)
from fastai.tabular.all import save_pickle
'to.pkl',to) save_pickle(
from fastai.tabular.all import load_pickle
= load_pickle('to.pkl') to
= to.train.xs,to.train.y
xs,y = to.valid.xs,to.valid.y valid_xs,valid_y
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
def ovr_rf(xs, y, n_estimators=40,
=0.5, min_samples_leaf=5, **kwargs):
max_featuresreturn OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
=max_features,
max_features=min_samples_leaf, oob_score=True)).fit(xs, y) min_samples_leaf
Here I’ve simply re-adapted a Jeremy function to work with One-Versus-Rest pipeline. I’m improving my buzzy worlds man!
= ovr_rf(xs, y) m
= m.predict_proba(valid_xs) pred_prob
pred_prob
Actually I’m not using the classic predict()
method. pred_prob
is an array - generated by predict_proba()
method - which contains classes probabilities. See also sklearn source code.
ROC Curve
Now it’s time to analyze our performance with a different metric: AUC ROC.
First encoding all classes then binirize and finally plot them.
#Lets encode target labels (y) with values between 0 and n_classes-1.
#We will use the LabelEncoder to do this.
from sklearn.preprocessing import LabelEncoder
=LabelEncoder()
label_encoder
label_encoder.fit(valid_y)=label_encoder.transform(valid_y)
transfomerd_valid_y=label_encoder.classes_ classes
from sklearn.preprocessing import label_binarize
#binarize the y_values
= (15, 10))
plt.figure(figsize
=label_binarize(valid_y,classes=np.unique(valid_y))
y_test_binarized
# roc curve for classes
= {}
fpr = {}
tpr ={}
thresh = dict()
roc_auc
= classes.shape[0]
n_class
for i in range(n_class):
= roc_curve(y_test_binarized[:,i], pred_prob[:,i])
fpr[i], tpr[i], thresh[i] = auc(fpr[i], tpr[i])
roc_auc[i]
# plotting
='--',
plt.plot(fpr[i], tpr[i], linestyle='%s vs Rest (AUC=%0.2f)'%(classes[i],roc_auc[i]))
label
0,1],[0,1],'b--')
plt.plot([0,1])
plt.xlim([0,1.05])
plt.ylim(['Multiclass ROC curve')
plt.title('False Positive Rate')
plt.xlabel('True Positive rate')
plt.ylabel(='lower right')
plt.legend(loc plt.show()
= pd.Series(roc_auc)
avg_roc_auc avg_roc_auc.mean()
An average of 94%
of being right is really good. Only 750
box is mainly miss-classified, with a 83%
.
Feature Selection
Feature selection starts from feature_importances_
. I’ve adapted Jeremy method to work with multi class model.
Feature Importances
def rf_feat_importance(m, df, i):
return pd.DataFrame({'cols':df.columns, 'imp':m.estimators_[i].feature_importances_}
'imp', ascending=False) ).sort_values(
Every class have its own feature importances so I have to compress everything in array and remove the last one. I’ve implemented a simple concat
and mean
.
= pd.DataFrame()
df_all for i in range(df["tare_weight"].nunique()):
= pd.concat([df_all, rf_feat_importance(m, xs, i)]) df_all
= df_all["cols"].sort_index().unique() cols
= df_all.groupby(df_all.index).mean() df_all
"cols"] = cols
df_all[= df_all.sort_values('imp', ascending=False) df_all
Finally plotting averaged feature importances of the whole classes.
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
30]); plot_fi(df_all[:
Let’s remove less significant ones.
"imp"] >= 0.002] df_all[df_all[
= df_all[df_all["imp"] < 0.002]
fi
= xs.drop(fi["cols"], axis=1)
filtered_xs = valid_xs.drop(fi["cols"], axis=1) filtered_valid_xs
filtered_xs.shape, filtered_valid_xs.shape
= ovr_rf(filtered_xs, y) m
= m.predict_proba(filtered_valid_xs) pred_prob
def roc_plot(classes):
= (15, 10))
plt.figure(figsize
=label_binarize(valid_y,classes=np.unique(valid_y))
y_test_binarized
# roc curve for classes
= {}
fpr = {}
tpr ={}
thresh = dict()
roc_auc
= classes.shape[0]
n_class
for i in range(n_class):
= roc_curve(y_test_binarized[:,i], pred_prob[:,i])
fpr[i], tpr[i], thresh[i] = auc(fpr[i], tpr[i])
roc_auc[i]
# plotting
='--',
plt.plot(fpr[i], tpr[i], linestyle='%s vs Rest (AUC=%0.2f)'%(classes[i],roc_auc[i]))
label
0,1],[0,1],'b--')
plt.plot([0,1])
plt.xlim([0,1.05])
plt.ylim(['Multiclass ROC curve')
plt.title('False Positive Rate')
plt.xlabel('True Positive rate')
plt.ylabel(='lower right')
plt.legend(loc plt.show()
roc_plot(classes)
def avg_roc_auc(pred_prob):
return pd.Series(roc_auc_classes(pred_prob)).mean()
avg_roc_auc(pred_prob)
With just removing the less important ones, the model has improved by few decimals.
from fastai.tabular.all import save_pickle
'filtered_xs.pkl',filtered_xs)
save_pickle('filtered_valid_xs.pkl',filtered_valid_xs)
save_pickle(= load_pickle('filtered_xs.pkl')
filtered_xs = load_pickle('filtered_valid_xs.pkl') filtered_valid_xs
Features Correlation
import matplotlib.pyplot as plt
import seaborn as sn
= filtered_xs.corr()
xs_corr = xs_corr[((xs_corr >= .5) | (xs_corr <= -.5)) & (xs_corr !=1.000)]
compressed_xs =(30,10))
plt.figure(figsize=True, cmap="Reds")
sn.heatmap(compressed_xs, annot plt.show()
def corrFilter(x: pd.DataFrame, bound: float):
= x.corr()
xCorr = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
xFiltered = xFiltered.unstack().sort_values().drop_duplicates()
xFlattened return xFlattened
.65) corrFilter(filtered_xs,
def oob_estimators(filtered_xs):
= ovr_rf(filtered_xs, y)
m return [m.estimators_[i].oob_score_ for i in range (df["tare_weight"].nunique())]
Since I’m working with a dataset with 11
classes, it’s essential to evaluate for each class relative Out-of-Bag score. So the goal is to remove closely correlated features which keep stagnant or improve the OOB score.
oob_estimators(filtered_xs)
= ["id", "timestamp", "slim_alloy", "id_alloy", "pairing_alloy",
to_drop "international_alloy", "id_user", "address",
"location_name", "article_min_tickness", "article_max_tickness_na"]
=1)) for c in to_drop} {c:oob_estimators(filtered_xs.drop(c, axis
The features belongs to to_drop
list with an average of OOB
score higher, will be dropped.
= ["id", "pairing_alloy", "id_alloy",
to_drop "article_max_tickness_na", "location_name", "article_max_tickness_na"]
= filtered_xs.drop(to_drop, axis=1)
filtered_xs = filtered_valid_xs.drop(to_drop, axis=1) filtered_valid_xs
filtered_valid_xs.shape, filtered_xs.shape
= ovr_rf(filtered_xs, y) m
= m.predict_proba(filtered_valid_xs) pred_prob
roc_plot(pred_prob)
avg_roc_auc(pred_prob)
Obtaining 94.5%
AUC ROC
score while keeping OOB
score higher is a good achievement. Breakpoint saved.
'filtered_xs.pkl',filtered_xs)
save_pickle('filtered_valid_xs.pkl',filtered_valid_xs)
save_pickle(= load_pickle('filtered_xs.pkl')
filtered_xs = load_pickle('filtered_valid_xs.pkl') filtered_valid_xs
Baseline Result
Now it’s time to fix Out of Domain Data to minimize overfitting.
Out of Domain Data
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
'imp', ascending=False) ).sort_values(
= pd.concat([filtered_xs, filtered_valid_xs])
df_dom = np.array([0]*len(filtered_xs) + [1]*len(filtered_valid_xs))
is_valid
= rf(df_dom, is_valid)
m 15] rf_feat_importance(m, df_dom)[:
for c in ('timestamp', 'weight', 'slim_alloy',
'international_alloy', 'id_machine_article_description',
'id_idp_user', 'last_name', 'id_machine', 'slim_number',
'first_name', 'code_machine', 'description_machine'):
= ovr_rf(filtered_xs.drop(c,axis=1), y)
m = m.predict_proba(filtered_valid_xs.drop(c,axis=1))
pred_prob print(c, avg_roc_auc(pred_prob))
= ['international_alloy', 'last_name', 'id_machine', 'slim_number', 'description_machine'] to_drop
= filtered_xs.drop(to_drop, axis=1)
xs_final = filtered_valid_xs.drop(to_drop, axis=1) valid_xs
xs_final.shape, valid_xs.shape
= ovr_rf(filtered_xs, y)
m = m.predict_proba(filtered_valid_xs)
pred_prob avg_roc_auc(pred_prob), oob_estimators_avg(m)
Everything ended with less features (15
) and higher score both AUC ROC
and OOB
.
'final_xs.pkl',xs_final)
save_pickle('final_valid_xs.pkl',valid_xs) save_pickle(
Hyperparameter Tuning
Before the game end I’ll try some hypertuning.
= load_pickle('final_xs.pkl')
xs_final = load_pickle('final_valid_xs.pkl') valid_xs
m.get_params()
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
= [int(x) for x in np.linspace(start = 50, stop = 200, num = 4)]
n_estimators # Number of features to consider at every split
= ['auto', 'sqrt']
max_features # Maximum number of levels in tree
= [int(x) for x in np.linspace(10, 50, num = 5)]
max_depth None)
max_depth.append(# Minimum number of samples required to split a node
= [2, 5, 10]
min_samples_split # Minimum number of samples required at each leaf node
= [1, 2, 4]
min_samples_leaf # Method of selecting samples for training each tree
= [True, False]# Create the random grid
bootstrap = {'estimator__n_estimators': n_estimators,
random_grid 'estimator__max_features': max_features,
'estimator__max_depth': max_depth,
'estimator__min_samples_split': min_samples_split,
'estimator__min_samples_leaf': min_samples_leaf,
'estimator__bootstrap': bootstrap}
random_grid
from sklearn.model_selection import ShuffleSplit
= ShuffleSplit(n_splits=2, test_size=.25, random_state=42) sp
rf.get_params().keys()
from sklearn.ensemble import RandomForestClassifier
# Use the random grid to search for best hyperparameters
# First create the base model to tune
= OneVsRestClassifier(RandomForestClassifier(oob_score=True))
rf # Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
= RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = sp, verbose=2, random_state=42, n_jobs = 3)# Fit the random search model
rf_random rf_random.fit(xs_final, y)
rf_random.best_params_
from sklearn.metrics import accuracy_score
= rf_random.best_estimator_
best_model = best_model.predict_proba(valid_xs)
pred_prob avg_roc_auc(pred_prob), oob_estimators_avg(best_model)
Now narrowing the range and trying to gain lil decimals.
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
= [50, 100, 150]
n_estimators # Number of features to consider at every split
= ['auto', 'sqrt']
max_features # Maximum number of levels in tree
= [30, 50, 100]
max_depth # Minimum number of samples required to split a node
= [5, 10, 20]
min_samples_split # Minimum number of samples required at each leaf node
= [1, 2, 4]
min_samples_leaf # Method of selecting samples for training each tree
= [True]# Create the random grid
bootstrap = {'estimator__n_estimators': n_estimators,
random_grid 'estimator__max_features': max_features,
'estimator__max_depth': max_depth,
'estimator__min_samples_split': min_samples_split,
'estimator__min_samples_leaf': min_samples_leaf,
'estimator__bootstrap': bootstrap}
random_grid
# Use the random grid to search for best hyperparameters
# First create the base model to tune
= OneVsRestClassifier(RandomForestClassifier(oob_score=True))
rf # Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
= RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = sp, verbose=2, random_state=42, n_jobs = 3)# Fit the random search model
rf_random rf_random.fit(xs_final, y)
rf_random.best_estimator_
= rf_random.best_estimator_
narrowed_model = narrowed_model.predict_proba(valid_xs)
pred_prob avg_roc_auc(pred_prob), oob_estimators_avg(narrowed_model)
From 94%
to almost 94.7%
is the final score. OOB stable on 95.3%
range.
Conclusion
Miss-classifying the tare weight (Aluminium scarp box) is expensive causing damage to the company (less revenue) and environment (re-melting Aluminium).
Scoring a 94.7%
of predicting right is a great baseline. Sure less scraps will be wasted.
Further Work
- Develop service which host the model.
- How reacts the model if I remove duplicated rows? Do it.
- I know the dataset is imbalanced. Implement it.
- Compare the result with deep learning tabular model.
- Compare the result with XGBoost model.
- Using same method to classify Aluminium alloys.