Classification Model in R with Caret Package

8 min readMar 23, 2023

Classification And REgression Training, shortened with the caret, is a package in R programming with functions that attempt to streamline the process of creating predictive models. This Package contains tools for :

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

as well as other functionality.

In this example, we are predicting Smoke detection with a dataset from https://www.kaggle.com/datasets/deepcontractor/smoke-detection-dataset with a classification model in R Programming language.

Library

The library that will use in this model are:

library(tidyverse)
library(caret)
library(rpart.plot)
library(corrplot)
library(ggcorrplot)

Using tidyverse package to modify and manipulate the dataset & also using corrplot to find correlation among the variables.

Dataset

The dataset has structured like this:

Rows: 62,630
Columns: 16
$ UTC              <dttm> 2022-06-09 00:08:51, 2022-06-09 00:08:52, 2022…
$ `Temperature[C]` <dbl> 20.000, 20.015, 20.029, 20.044, 20.059, 20.073,…
$ `Humidity[%]`    <dbl> 57.36, 56.67, 55.96, 55.28, 54.69, 54.12, 53.61…
$ `TVOC[ppb]`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `eCO2[ppm]`      <dbl> 400, 400, 400, 400, 400, 400, 400, 400, 400, 40…
$ `Raw H2`         <dbl> 12306, 12345, 12374, 12390, 12403, 12419, 12432…
$ `Raw Ethanol`    <dbl> 18520, 18651, 18764, 18849, 18921, 18998, 19058…
$ `Pressure[hPa]`  <dbl> 939.735, 939.744, 939.738, 939.736, 939.744, 93…
$ PM1.0            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ PM2.5            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ NC0.5            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ NC1.0            <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000…
$ NC2.5            <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000…
$ CNT              <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
$ `Fire Alarm`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ index            <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…

and check if there is no missing data in the dataset & we can perform data processing for this dataset. The dataset consists of variables:

1.UTC: Time when the experiment was performed
2. Temperature[C]: Temperature of surroundings, measured in celsius
3. Humidity[%]: Air humidity during the experiment
4. TVOC[ppb]: Total Volatile Organic Compounds, measured in ppb (parts per billion)
5. eCO2[ppm]: CO2 equivalent concentration, measured in ppm (parts per million)
6. Raw H2: The amount of Raw Hydrogen [Raw Molecular Hydrogen; not compensated (Bias, Temperature, etc.)] present in surroundings
7. Raw Ethanol: The amount of Raw Ethanol present in the surroundings
8. Pressure[hPa]: Air pressure, Measured in hPa
9. PM1.0: Particulate matter of diameter less than 1.0 micrometer
10. PM2.5: Particulate matter of diameter less than 2.5 micrometer
11. NC0.5: Concentration of particulate matter of diameter less than 0.5 micrometer
12. NC1.0: Concentration of particulate matter of diameter less than 1.0 micrometer
13. NC2.5: Concentration of particulate matter of diameter less than 2.5 micrometer
14. CNT: Sample Count. Fire Alarm(Reality) If the fire was present then the value is 1 else it is 0
15. Fire Alarm: 1 means Positive and 0 means Not Positive

Data Processing

Processing dataset before doing modeling. Selecting the variables that are needed & unselect variable CNT, Sample Count.

raw_data %>%
  select(
         temp_c = `Temperature[C]`,
         humidity = `Humidity[%]`,
         tvoc = `TVOC[ppb]`,
         co2 = `eCO2[ppm]`,
         h2 = `Raw H2`,
         ethanol = `Raw Ethanol`,
         pressure = `Pressure[hPa]`,
         pm1 = PM1.0,
         pm2_5 = PM2.5,
         fire_alarm = `Fire Alarm`
         ) %>%
  mutate(
    fire_alarm = factor(fire_alarm, levels = c(1,0), labels = c("yes", "no"))
  ) %>%
  glimpse() -> df_data

Rows: 62,630
Columns: 10
$ temp_c     <dbl> 20.000, 20.015, 20.029, 20.044, 20.059, 20.073, 20.08…
$ humidity   <dbl> 57.36, 56.67, 55.96, 55.28, 54.69, 54.12, 53.61, 53.2…
$ tvoc       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ co2        <dbl> 400, 400, 400, 400, 400, 400, 400, 400, 400, 400, 400…
$ h2         <dbl> 12306, 12345, 12374, 12390, 12403, 12419, 12432, 1243…
$ ethanol    <dbl> 18520, 18651, 18764, 18849, 18921, 18998, 19058, 1911…
$ pressure   <dbl> 939.735, 939.744, 939.738, 939.736, 939.744, 939.725,…
$ pm1        <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ pm2_5      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
$ fire_alarm <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…

Perform a correlation plot

Based on the correlation matrix, we could get an insight:

There is not any high correlation between the target feature and other features. Small positive correlation between target feature and Humidity, Pressure. Small negative correlation between the target feature and TVOC, Raw Ethanol.
High positive correlation between eCO2 and TVOC, PM1.0. Pressure and Humidity. Raw H2 and Raw Ethanol. PM1.0 and eCO2, PM2.5.

Splitting Data

Splitting data into train and test datasets with proportion 80:20, Usually, you’ll get more accurate models the bigger that dataset you’re training on, but more training data also leads to models taking longer to train.

To split our data, we’re going to use the createDataPartition() from the caret package. The function randomly samples a proportion of the indexes of a vector you pass it. Then you can use those indexes to subset your full dataset into testing and training datasets.

# set random number
set.seed(123)

# splitiing data
train_index <- createDataPartition(df_data$fire_alarm, times = 1, p = 0.8, list = FALSE)

train_data <- df_data[train_index, ] %>% glimpse

# test data

test_data <- df_data[-train_index, ] %>% glimpse()

Develop Model

We don’t know what algorithms will perform well on this data beforehand. We have to spot-check various different methods and see what looks good then double down on those methods.

## Linear Algorithms:
1. Logistic Regression (LG),

2. Linear Discriminate Analysis (LDA)

3. Regularized Logistic Regression (GLMNET).

## Non-Linear Algorithms:
1. k-Nearest Neighbors (KNN),

2. Classification and Regression Trees (CART),

3. Naive Bayes (NB)

4. Support Vector Machines with Radial Basis Functions (SVM).

We have a good amount of data so we will use 10-fold cross-validation with 3 repeats. This is a good standard test harness configuration. It is a binary classification problem. For simplicity, we will use Accuracy and Kappa metrics. We could have gone with the Area Under ROC Curve (AUC) and looked at the sensitivity and specificity to select the best algorithms.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3, verboseIter = TRUE)

metric <- "Accuracy"

Then, we build the models for each algorithm.

# Bagged CART
set.seed(7)
fit.treebag <- train(fire_alarm~., data = train_data, method = "treebag", metric = metric,trControl = trainControl)

# RF
set.seed(7)
fit.rf <- train(fire_alarm~., data = train_data, method = "rf", metric = metric,trControl = trainControl)

# GBM - Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(fire_alarm~., data = train_data, method = "gbm",metric = metric,trControl = trainControl, verbose = FALSE)

# C5.0
set.seed(7)
fit.c50 <- train(fire_alarm~., data = train_data, method = "C5.0", metric = metric,trControl = trainControl)

# LG - Logistic Regression
set.seed(7)
fit.glm <- train(fire_alarm~., data = train_data, method="glm",
                 metric=metric,trControl=trainControl)
# LDA - Linear Discriminate Analysis
set.seed(7)
fit.lda <- train(fire_alarm~., data = train_data, method="lda",
                 metric=metric,trControl=trainControl)

# GLMNET - Regularized Logistic Regression
set.seed(7)
fit.glmnet <- train(fire_alarm~., data = train_data, method="glmnet",
                 metric=metric,trControl=trainControl)

# KNN - k-Nearest Neighbors 
set.seed(7)
fit.knn <- train(fire_alarm~., data = train_data, method="knn",
                 metric=metric,trControl=trainControl)

# CART - Classification and Regression Trees (CART), 
set.seed(7)
fit.cart <- train(fire_alarm~., data = train_data, method="rpart",
                 metric=metric,trControl=trainControl)

# NB - Naive Bayes (NB) 
set.seed(7)
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
fit.nb <- train(fire_alarm~., data = train_data, method="nb",
                 metric=metric,trControl=trainControl,
                tuneGrid=Grid)

After build the model compare model to find better accuracy

Call:
summary.resamples(object = ensembleResults)

Models: BAG, RF, GBM, C50, LG, KNN, NB, CART, GLMNET 
Number of resamples: 30 

Accuracy 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
BAG    0.9994013 0.9998004 0.9998004 0.9998337 1.0000000 1.0000000    0
RF     0.9996008 0.9998004 1.0000000 0.9999202 1.0000000 1.0000000    0
GBM    0.9986028 0.9996008 0.9998004 0.9996407 0.9998004 1.0000000    0
C50    0.9992014 0.9996009 0.9998004 0.9998071 1.0000000 1.0000000    0
LG     0.8467372 0.8629515 0.8861393 0.8797790 0.8933134 0.9030134    0
KNN    0.9990018 0.9996007 0.9998004 0.9996341 0.9998004 1.0000000    0
NB     0.9417282 0.9443613 0.9463074 0.9470978 0.9505089 0.9527041    0
CART   0.9590900 0.9654244 0.9809419 0.9754183 0.9829849 0.9896208    0
GLMNET 0.8848303 0.8931245 0.8957294 0.8954128 0.8985681 0.9026142    0

Kappa 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
BAG    0.9985319 0.9995105 0.9995106 0.9995922 1.0000000 1.0000000    0
RF     0.9990214 0.9995107 1.0000000 0.9998043 1.0000000 1.0000000    0
GBM    0.9965712 0.9990210 0.9995105 0.9991189 0.9995106 1.0000000    0
C50    0.9980418 0.9990214 0.9995107 0.9995269 1.0000000 1.0000000    0
LG     0.6691273 0.6941347 0.7051705 0.7105437 0.7312704 0.7512402    0
KNN    0.9975507 0.9990206 0.9995106 0.9991025 0.9995106 1.0000000    0
NB     0.8632215 0.8695786 0.8735418 0.8754793 0.8834493 0.8880163    0
CART   0.8997615 0.9156698 0.9523262 0.9397203 0.9575309 0.9747781    0
GLMNET 0.7052235 0.7262274 0.7346055 0.7332479 0.7403128 0.7496404    0

based on running Random Forest is highest accuracy (99.99%), following by BAG (Bagged CART) (99.98%) and C5.0 (99.93%).

Finalize Model

Tree algorithms with higher accuracy will be selected for prediction: Random Forest, BAG, and C5.0, and testing the model & accuracy with the testing dataset.

# save model
saveRDS(fit.c50, here::here("finalModel_c50.rds"))
saveRDS(fit.rf, here::here("finalModel_rf.rds"))
saveRDS(fit.treebag, here::here("finalModel_treebag.rds"))

predict_c50 <- predict(model_c50, test_data)
summary(predict_c50)

# Confusion Matrix
cf_c50 <- confusionMatrix(predict_c50, test_data$fire_alarm)

cf_c50

yes   no 
8950 3575 
Confusion Matrix and Statistics

          Reference
Prediction  yes   no
       yes 8949    1
       no     2 3573
                                     
               Accuracy : 0.9998     
                 95% CI : (0.9993, 1)
    No Information Rate : 0.7147     
    P-Value [Acc > NIR] : <2e-16     
                                     
                  Kappa : 0.9994     
                                     
 Mcnemar's Test P-Value : 1          
                                     
            Sensitivity : 0.9998     
            Specificity : 0.9997     
         Pos Pred Value : 0.9999     
         Neg Pred Value : 0.9994     
             Prevalence : 0.7147     
         Detection Rate : 0.7145     
   Detection Prevalence : 0.7146     
      Balanced Accuracy : 0.9997     
                                     
       'Positive' Class : yes

Load model for Random Forest & Make prediction with test data.

# Decision Tree Random Forest
model_rf <- readRDS(here::here("finalModel_rf.rds"))
print(model_rf)

Random Forest 

50105 samples
    9 predictor
    2 classes: 'yes', 'no' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 45095, 45094, 45094, 45094, 45095, 45094, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9999202  0.9998043
  5     0.9999202  0.9998043
  9     0.9998736  0.9996901

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

predict_rf <- predict(model_rf, test_data)
summary(predict_rf)

# Confusion Matrix
cf_rf <- confusionMatrix(predict_rf, test_data$fire_alarm)

cf_rf

yes   no 
8952 3573 
Confusion Matrix and Statistics

          Reference
Prediction  yes   no
       yes 8951    1
       no     0 3573
                                     
               Accuracy : 0.9999     
                 95% CI : (0.9996, 1)
    No Information Rate : 0.7147     
    P-Value [Acc > NIR] : <2e-16     
                                     
                  Kappa : 0.9998     
                                     
 Mcnemar's Test P-Value : 1          
                                     
            Sensitivity : 1.0000     
            Specificity : 0.9997     
         Pos Pred Value : 0.9999     
         Neg Pred Value : 1.0000     
             Prevalence : 0.7147     
         Detection Rate : 0.7147     
   Detection Prevalence : 0.7147     
      Balanced Accuracy : 0.9999     
                                     
       'Positive' Class : yes

Load model for Baggerd CART

# Classification and Regression Trees (CART)
model_treebag <- readRDS(here::here("finalModel_treebag.rds"))

print(model_treebag)

Bagged CART 

50105 samples
    9 predictor
    2 classes: 'yes', 'no' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 45095, 45094, 45094, 45094, 45095, 45094, ... 
Resampling results:

  Accuracy   Kappa    
  0.9998337  0.9995922

predict_treebag <- predict(model_treebag, test_data)
summary(predict_treebag)

# Confusion Matrix
cf_treebag <- confusionMatrix(predict_treebag, test_data$fire_alarm)

cf_treebag

Confusion Matrix and Statistics

          Reference
Prediction  yes   no
       yes 8951    2
       no     0 3572
                                     
               Accuracy : 0.9998     
                 95% CI : (0.9994, 1)
    No Information Rate : 0.7147     
    P-Value [Acc > NIR] : <2e-16     
                                     
                  Kappa : 0.9996     
                                     
 Mcnemar's Test P-Value : 0.4795     
                                     
            Sensitivity : 1.0000     
            Specificity : 0.9994     
         Pos Pred Value : 0.9998     
         Neg Pred Value : 1.0000     
             Prevalence : 0.7147     
         Detection Rate : 0.7147     
   Detection Prevalence : 0.7148     
      Balanced Accuracy : 0.9997     
                                     
       'Positive' Class : yes

Visit my repository in Github for the details. Thank you for reading this medium.