Classification Model in R with Caret Package

Muhammad Andi Yudha
8 min readMar 23, 2023

--

Classification And REgression Training, shortened with the caret, is a package in R programming with functions that attempt to streamline the process of creating predictive models. This Package contains tools for :

  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation

as well as other functionality.

In this example, we are predicting Smoke detection with a dataset from https://www.kaggle.com/datasets/deepcontractor/smoke-detection-dataset with a classification model in R Programming language.

Library

The library that will use in this model are:

library(tidyverse)
library(caret)
library(rpart.plot)
library(corrplot)
library(ggcorrplot)

Using tidyverse package to modify and manipulate the dataset & also using corrplot to find correlation among the variables.

Dataset

The dataset has structured like this:

Rows: 62,630
Columns: 16
$ UTC <dttm> 2022-06-09 00:08:51, 2022-06-09 00:08:52, 2022…
$ `Temperature[C]` <dbl> 20.000, 20.015, 20.029, 20.044, 20.059, 20.073,…
$ `Humidity[%]` <dbl> 57.36, 56.67, 55.96, 55.28, 54.69, 54.12, 53.61…
$ `TVOC[ppb]` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `eCO2[ppm]` <dbl> 400, 400, 400, 400, 400, 400, 400, 400, 400, 40…
$ `Raw H2` <dbl> 12306, 12345, 12374, 12390, 12403, 12419, 12432…
$ `Raw Ethanol` <dbl> 18520, 18651, 18764, 18849, 18921, 18998, 19058…
$ `Pressure[hPa]` <dbl> 939.735, 939.744, 939.738, 939.736, 939.744, 93…
$ PM1.0 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ PM2.5 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ NC0.5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ NC1.0 <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000…
$ NC2.5 <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000…
$ CNT <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
$ `Fire Alarm` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ index <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…

and check if there is no missing data in the dataset & we can perform data processing for this dataset. The dataset consists of variables:

1.UTC: Time when the experiment was performed
2. Temperature[C]: Temperature of surroundings, measured in celsius
3. Humidity[%]: Air humidity during the experiment
4. TVOC[ppb]: Total Volatile Organic Compounds, measured in ppb (parts per billion)
5. eCO2[ppm]: CO2 equivalent concentration, measured in ppm (parts per million)
6. Raw H2: The amount of Raw Hydrogen [Raw Molecular Hydrogen; not compensated (Bias, Temperature, etc.)] present in surroundings
7. Raw Ethanol: The amount of Raw Ethanol present in the surroundings
8. Pressure[hPa]: Air pressure, Measured in hPa
9. PM1.0: Particulate matter of diameter less than 1.0 micrometer
10. PM2.5: Particulate matter of diameter less than 2.5 micrometer
11. NC0.5: Concentration of particulate matter of diameter less than 0.5 micrometer
12. NC1.0: Concentration of particulate matter of diameter less than 1.0 micrometer
13. NC2.5: Concentration of particulate matter of diameter less than 2.5 micrometer
14. CNT: Sample Count. Fire Alarm(Reality) If the fire was present then the value is 1 else it is 0
15. Fire Alarm: 1 means Positive and 0 means Not Positive

Data Processing

Processing dataset before doing modeling. Selecting the variables that are needed & unselect variable CNT, Sample Count.

raw_data %>%
select(
temp_c = `Temperature[C]`,
humidity = `Humidity[%]`,
tvoc = `TVOC[ppb]`,
co2 = `eCO2[ppm]`,
h2 = `Raw H2`,
ethanol = `Raw Ethanol`,
pressure = `Pressure[hPa]`,
pm1 = PM1.0,
pm2_5 = PM2.5,
fire_alarm = `Fire Alarm`
) %>%
mutate(
fire_alarm = factor(fire_alarm, levels = c(1,0), labels = c("yes", "no"))
) %>%
glimpse() -> df_data
Rows: 62,630
Columns: 10
$ temp_c <dbl> 20.000, 20.015, 20.029, 20.044, 20.059, 20.073, 20.08…
$ humidity <dbl> 57.36, 56.67, 55.96, 55.28, 54.69, 54.12, 53.61, 53.2…
$ tvoc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ co2 <dbl> 400, 400, 400, 400, 400, 400, 400, 400, 400, 400, 400…
$ h2 <dbl> 12306, 12345, 12374, 12390, 12403, 12419, 12432, 1243…
$ ethanol <dbl> 18520, 18651, 18764, 18849, 18921, 18998, 19058, 1911…
$ pressure <dbl> 939.735, 939.744, 939.738, 939.736, 939.744, 939.725,…
$ pm1 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ pm2_5 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ fire_alarm <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…

Perform a correlation plot

Based on the correlation matrix, we could get an insight:

  1. There is not any high correlation between the target feature and other features. Small positive correlation between target feature and Humidity, Pressure. Small negative correlation between the target feature and TVOC, Raw Ethanol.
  2. High positive correlation between eCO2 and TVOC, PM1.0. Pressure and Humidity. Raw H2 and Raw Ethanol. PM1.0 and eCO2, PM2.5.

Splitting Data

Splitting data into train and test datasets with proportion 80:20, Usually, you’ll get more accurate models the bigger that dataset you’re training on, but more training data also leads to models taking longer to train.

To split our data, we’re going to use the createDataPartition() from the caret package. The function randomly samples a proportion of the indexes of a vector you pass it. Then you can use those indexes to subset your full dataset into testing and training datasets.

# set random number
set.seed(123)

# splitiing data
train_index <- createDataPartition(df_data$fire_alarm, times = 1, p = 0.8, list = FALSE)

train_data <- df_data[train_index, ] %>% glimpse
# test data

test_data <- df_data[-train_index, ] %>% glimpse()

Develop Model

We don’t know what algorithms will perform well on this data beforehand. We have to spot-check various different methods and see what looks good then double down on those methods.

## Linear Algorithms:
1. Logistic Regression (LG),

2. Linear Discriminate Analysis (LDA)

3. Regularized Logistic Regression (GLMNET).

## Non-Linear Algorithms:
1. k-Nearest Neighbors (KNN),

2. Classification and Regression Trees (CART),

3. Naive Bayes (NB)

4. Support Vector Machines with Radial Basis Functions (SVM).

We have a good amount of data so we will use 10-fold cross-validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics. We could have gone with the Area Under ROC Curve (AUC) and looked at the sensitivity and specificity to select the best algorithms.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3, verboseIter = TRUE)

metric <- "Accuracy"

Then, we build the models for each algorithm.

# Bagged CART
set.seed(7)
fit.treebag <- train(fire_alarm~., data = train_data, method = "treebag", metric = metric,trControl = trainControl)

# RF
set.seed(7)
fit.rf <- train(fire_alarm~., data = train_data, method = "rf", metric = metric,trControl = trainControl)

# GBM - Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(fire_alarm~., data = train_data, method = "gbm",metric = metric,trControl = trainControl, verbose = FALSE)

# C5.0
set.seed(7)
fit.c50 <- train(fire_alarm~., data = train_data, method = "C5.0", metric = metric,trControl = trainControl)

# LG - Logistic Regression
set.seed(7)
fit.glm <- train(fire_alarm~., data = train_data, method="glm",
metric=metric,trControl=trainControl)
# LDA - Linear Discriminate Analysis
set.seed(7)
fit.lda <- train(fire_alarm~., data = train_data, method="lda",
metric=metric,trControl=trainControl)

# GLMNET - Regularized Logistic Regression
set.seed(7)
fit.glmnet <- train(fire_alarm~., data = train_data, method="glmnet",
metric=metric,trControl=trainControl)

# KNN - k-Nearest Neighbors
set.seed(7)
fit.knn <- train(fire_alarm~., data = train_data, method="knn",
metric=metric,trControl=trainControl)

# CART - Classification and Regression Trees (CART),
set.seed(7)
fit.cart <- train(fire_alarm~., data = train_data, method="rpart",
metric=metric,trControl=trainControl)

# NB - Naive Bayes (NB)
set.seed(7)
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
fit.nb <- train(fire_alarm~., data = train_data, method="nb",
metric=metric,trControl=trainControl,
tuneGrid=Grid)

After build the model compare model to find better accuracy

Call:
summary.resamples(object = ensembleResults)

Models: BAG, RF, GBM, C50, LG, KNN, NB, CART, GLMNET
Number of resamples: 30

Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
BAG 0.9994013 0.9998004 0.9998004 0.9998337 1.0000000 1.0000000 0
RF 0.9996008 0.9998004 1.0000000 0.9999202 1.0000000 1.0000000 0
GBM 0.9986028 0.9996008 0.9998004 0.9996407 0.9998004 1.0000000 0
C50 0.9992014 0.9996009 0.9998004 0.9998071 1.0000000 1.0000000 0
LG 0.8467372 0.8629515 0.8861393 0.8797790 0.8933134 0.9030134 0
KNN 0.9990018 0.9996007 0.9998004 0.9996341 0.9998004 1.0000000 0
NB 0.9417282 0.9443613 0.9463074 0.9470978 0.9505089 0.9527041 0
CART 0.9590900 0.9654244 0.9809419 0.9754183 0.9829849 0.9896208 0
GLMNET 0.8848303 0.8931245 0.8957294 0.8954128 0.8985681 0.9026142 0

Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
BAG 0.9985319 0.9995105 0.9995106 0.9995922 1.0000000 1.0000000 0
RF 0.9990214 0.9995107 1.0000000 0.9998043 1.0000000 1.0000000 0
GBM 0.9965712 0.9990210 0.9995105 0.9991189 0.9995106 1.0000000 0
C50 0.9980418 0.9990214 0.9995107 0.9995269 1.0000000 1.0000000 0
LG 0.6691273 0.6941347 0.7051705 0.7105437 0.7312704 0.7512402 0
KNN 0.9975507 0.9990206 0.9995106 0.9991025 0.9995106 1.0000000 0
NB 0.8632215 0.8695786 0.8735418 0.8754793 0.8834493 0.8880163 0
CART 0.8997615 0.9156698 0.9523262 0.9397203 0.9575309 0.9747781 0
GLMNET 0.7052235 0.7262274 0.7346055 0.7332479 0.7403128 0.7496404 0

based on running Random Forest is highest accuracy (99.99%), following by BAG (Bagged CART) (99.98%) and C5.0 (99.93%).

Finalize Model

Tree algorithms with higher accuracy will be selected for prediction: Random Forest, BAG, and C5.0, and testing the model & accuracy with the testing dataset.

# save model
saveRDS(fit.c50, here::here("finalModel_c50.rds"))
saveRDS(fit.rf, here::here("finalModel_rf.rds"))
saveRDS(fit.treebag, here::here("finalModel_treebag.rds"))
predict_c50 <- predict(model_c50, test_data)
summary(predict_c50)

# Confusion Matrix
cf_c50 <- confusionMatrix(predict_c50, test_data$fire_alarm)

cf_c50
yes   no 
8950 3575
Confusion Matrix and Statistics

Reference
Prediction yes no
yes 8949 1
no 2 3573

Accuracy : 0.9998
95% CI : (0.9993, 1)
No Information Rate : 0.7147
P-Value [Acc > NIR] : <2e-16

Kappa : 0.9994

Mcnemar's Test P-Value : 1

Sensitivity : 0.9998
Specificity : 0.9997
Pos Pred Value : 0.9999
Neg Pred Value : 0.9994
Prevalence : 0.7147
Detection Rate : 0.7145
Detection Prevalence : 0.7146
Balanced Accuracy : 0.9997

'Positive' Class : yes

Load model for Random Forest & Make prediction with test data.

# Decision Tree Random Forest
model_rf <- readRDS(here::here("finalModel_rf.rds"))
print(model_rf)
Random Forest 

50105 samples
9 predictor
2 classes: 'yes', 'no'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 45095, 45094, 45094, 45094, 45095, 45094, ...
Resampling results across tuning parameters:

mtry Accuracy Kappa
2 0.9999202 0.9998043
5 0.9999202 0.9998043
9 0.9998736 0.9996901

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
predict_rf <- predict(model_rf, test_data)
summary(predict_rf)

# Confusion Matrix
cf_rf <- confusionMatrix(predict_rf, test_data$fire_alarm)

cf_rf
yes   no 
8952 3573
Confusion Matrix and Statistics

Reference
Prediction yes no
yes 8951 1
no 0 3573

Accuracy : 0.9999
95% CI : (0.9996, 1)
No Information Rate : 0.7147
P-Value [Acc > NIR] : <2e-16

Kappa : 0.9998

Mcnemar's Test P-Value : 1

Sensitivity : 1.0000
Specificity : 0.9997
Pos Pred Value : 0.9999
Neg Pred Value : 1.0000
Prevalence : 0.7147
Detection Rate : 0.7147
Detection Prevalence : 0.7147
Balanced Accuracy : 0.9999

'Positive' Class : yes

Load model for Baggerd CART

# Classification and Regression Trees (CART)
model_treebag <- readRDS(here::here("finalModel_treebag.rds"))

print(model_treebag)
Bagged CART 

50105 samples
9 predictor
2 classes: 'yes', 'no'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 45095, 45094, 45094, 45094, 45095, 45094, ...
Resampling results:

Accuracy Kappa
0.9998337 0.9995922
predict_treebag <- predict(model_treebag, test_data)
summary(predict_treebag)

# Confusion Matrix
cf_treebag <- confusionMatrix(predict_treebag, test_data$fire_alarm)

cf_treebag
Confusion Matrix and Statistics

Reference
Prediction yes no
yes 8951 2
no 0 3572

Accuracy : 0.9998
95% CI : (0.9994, 1)
No Information Rate : 0.7147
P-Value [Acc > NIR] : <2e-16

Kappa : 0.9996

Mcnemar's Test P-Value : 0.4795

Sensitivity : 1.0000
Specificity : 0.9994
Pos Pred Value : 0.9998
Neg Pred Value : 1.0000
Prevalence : 0.7147
Detection Rate : 0.7147
Detection Prevalence : 0.7148
Balanced Accuracy : 0.9997

'Positive' Class : yes

Visit my repository in Github for the details. Thank you for reading this medium.

--

--

Muhammad Andi Yudha

Data Mining Enthusiast | Master of Engineering, Data and Quality Engineering | useR & Beginner at Python | visit to: https://linktr.ee/andiyudha