Applied Machine Learning - Observing Data

This is the first entry in a series where I will be using R to replicate a graduate-level applied machine learning course that is purely in Python.

The goal of the first class period was to introduce some of the data manipulation packages in Python to students. As such, most of the code here will be pretty simple to replicate since data manipulation is the bread-and-butter of R. However, there is a bit of interactivity in the final section that I am looking forward to work on. All of the code chunks are included so feel free to take your time and understand what each piece is doing.

Preparing Data

For this lab, we used two datasets from the UCI Data Repository:
Forest Fires Data
Auto MPG Data
First we need to load in the packages we’re going to use. In Python we used Numpy and Pandas for data handling, and Matplotlib for visuzalization. The R equivalents are tidyr, dplyr, and ggplot2, though getting familiar with more of the tidyverse library doesn’t hurt. gridExtra will let us arrange multiple grids on one page.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
library(shiny)

After downloading the data and placing it in your current working directory, you can load it like so:

ff_data <- read_csv('forestfires.csv', show_col_types = F)
ff_data # Let's take a look at what we've loaded.
## # A tibble: 517 x 13
##        X     Y month day    FFMC   DMC    DC   ISI  temp    RH  wind  rain  area
##    <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     7     5 mar   fri    86.2  26.2  94.3   5.1   8.2    51   6.7   0       0
##  2     7     4 oct   tue    90.6  35.4 669.    6.7  18      33   0.9   0       0
##  3     7     4 oct   sat    90.6  43.7 687.    6.7  14.6    33   1.3   0       0
##  4     8     6 mar   fri    91.7  33.3  77.5   9     8.3    97   4     0.2     0
##  5     8     6 mar   sun    89.3  51.3 102.    9.6  11.4    99   1.8   0       0
##  6     8     6 aug   sun    92.3  85.3 488    14.7  22.2    29   5.4   0       0
##  7     8     6 aug   mon    92.3  88.9 496.    8.5  24.1    27   3.1   0       0
##  8     8     6 aug   mon    91.5 145.  608.   10.7   8      86   2.2   0       0
##  9     8     6 sep   tue    91   130.  693.    7    13.1    63   5.4   0       0
## 10     7     5 sep   sat    92.5  88   699.    7.1  22.8    40   4     0       0
## # ... with 507 more rows

In Python, we completed this first task using ‘np.loadtxt()’ which created an intentional error that required students to convert the month and day columns to numeric values. For the sake of being thorough, one way (of many) to do this is in the chunk below.
I’ll be demonstrating only the last 20 observations when printing dataframes only due to the length of the data.

ff_data_num <- as.data.frame(ff_data %>%
  mutate(month = case_when(month =='jan' ~ 1,
                           month =='feb' ~ 2,
                           month =='mar' ~ 3,
                           month =='apr' ~ 4,
                           month =='may' ~ 5,
                           month =='jun' ~ 6,
                           month =='jul' ~ 7,
                           month =='aug' ~ 8,
                           month =='sep' ~ 9,
                           month =='oct' ~ 10,
                           month =='nov' ~ 11,
                           month =='dec' ~ 12),
         day = case_when(day =='mon' ~ 1,
                         day =='tue' ~ 2,
                         day =='wed' ~ 3,
                         day =='thu' ~ 4,
                         day =='fri' ~ 5,
                         day =='sat' ~ 6,
                         day =='sun' ~ 7)))
tail(ff_data_num, 20) # Once again, let's see what we did.
##     X Y month day FFMC   DMC    DC  ISI temp RH wind rain  area
## 498 3 4     8   2 96.1 181.1 671.2 14.3 32.3 27  2.2  0.0 14.68
## 499 6 5     8   2 96.1 181.1 671.2 14.3 33.3 26  2.7  0.0 40.54
## 500 7 5     8   2 96.1 181.1 671.2 14.3 27.3 63  4.9  6.4 10.82
## 501 8 6     8   2 96.1 181.1 671.2 14.3 21.6 65  4.9  0.8  0.00
## 502 7 5     8   2 96.1 181.1 671.2 14.3 21.6 65  4.9  0.8  0.00
## 503 4 4     8   2 96.1 181.1 671.2 14.3 20.7 69  4.9  0.4  0.00
## 504 2 4     8   3 94.5 139.4 689.1 20.0 29.2 30  4.9  0.0  1.95
## 505 4 3     8   3 94.5 139.4 689.1 20.0 28.9 29  4.9  0.0 49.59
## 506 1 2     8   4 91.0 163.2 744.4 10.1 26.7 35  1.8  0.0  5.80
## 507 1 2     8   5 91.0 166.9 752.6  7.1 18.5 73  8.5  0.0  0.00
## 508 2 4     8   5 91.0 166.9 752.6  7.1 25.9 41  3.6  0.0  0.00
## 509 1 2     8   5 91.0 166.9 752.6  7.1 25.9 41  3.6  0.0  0.00
## 510 5 4     8   5 91.0 166.9 752.6  7.1 21.1 71  7.6  1.4  2.17
## 511 6 5     8   5 91.0 166.9 752.6  7.1 18.2 62  5.4  0.0  0.43
## 512 8 6     8   7 81.6  56.7 665.6  1.9 27.8 35  2.7  0.0  0.00
## 513 4 3     8   7 81.6  56.7 665.6  1.9 27.8 32  2.7  0.0  6.44
## 514 2 4     8   7 81.6  56.7 665.6  1.9 21.9 71  5.8  0.0 54.29
## 515 7 4     8   7 81.6  56.7 665.6  1.9 21.2 70  6.7  0.0 11.16
## 516 1 4     8   6 94.4 146.0 614.7 11.3 25.6 42  4.0  0.0  0.00
## 517 6 3    11   2 79.5   3.0 106.7  1.1 11.8 31  4.5  0.0  0.00

The case_when function takes a logical statement and performs an action in cases where the result is TRUE. In that way it’s very similar to if-else functions though I prefer case_when in these situations.

Next we’ll use a few functions to learn some details about our data.

head(ff_data_num, 5) # returns first 5 rows
##   X Y month day FFMC  DMC    DC ISI temp RH wind rain area
## 1 7 5     3   5 86.2 26.2  94.3 5.1  8.2 51  6.7  0.0    0
## 2 7 4    10   2 90.6 35.4 669.1 6.7 18.0 33  0.9  0.0    0
## 3 7 4    10   6 90.6 43.7 686.9 6.7 14.6 33  1.3  0.0    0
## 4 8 6     3   5 91.7 33.3  77.5 9.0  8.3 97  4.0  0.2    0
## 5 8 6     3   7 89.3 51.3 102.2 9.6 11.4 99  1.8  0.0    0
colnames(ff_data_num) # returns the column names
##  [1] "X"     "Y"     "month" "day"   "FFMC"  "DMC"   "DC"    "ISI"   "temp" 
## [10] "RH"    "wind"  "rain"  "area"
dim(ff_data_num) # returns the number of rows and columns in the data
## [1] 517  13
length(which(is.na(ff_data_num))) # returns number of NA values in the data
## [1] 0

Splicing and indexing dataframes is a useful way to access pieces of your data. This can be done within and outside of tidyverse relatively easily.

tail(ff_data_num['area'], 20)
##      area
## 498 14.68
## 499 40.54
## 500 10.82
## 501  0.00
## 502  0.00
## 503  0.00
## 504  1.95
## 505 49.59
## 506  5.80
## 507  0.00
## 508  0.00
## 509  0.00
## 510  2.17
## 511  0.43
## 512  0.00
## 513  6.44
## 514 54.29
## 515 11.16
## 516  0.00
## 517  0.00
# these indexing methods following the dataframe name produce the same output:
# [,'area'], [,13], [, -1:-12]

tail(ff_data_num %>%
  select(area), 20)
##      area
## 498 14.68
## 499 40.54
## 500 10.82
## 501  0.00
## 502  0.00
## 503  0.00
## 504  1.95
## 505 49.59
## 506  5.80
## 507  0.00
## 508  0.00
## 509  0.00
## 510  2.17
## 511  0.43
## 512  0.00
## 513  6.44
## 514 54.29
## 515 11.16
## 516  0.00
## 517  0.00

Unlike pandas.describe(), the summary function used in R doesn’t include the count or standard deviation for numeric columns.

summary(ff_data_num)
##        X               Y           month             day             FFMC      
##  Min.   :1.000   Min.   :2.0   Min.   : 1.000   Min.   :1.000   Min.   :18.70  
##  1st Qu.:3.000   1st Qu.:4.0   1st Qu.: 7.000   1st Qu.:2.000   1st Qu.:90.20  
##  Median :4.000   Median :4.0   Median : 8.000   Median :5.000   Median :91.60  
##  Mean   :4.669   Mean   :4.3   Mean   : 7.476   Mean   :4.259   Mean   :90.64  
##  3rd Qu.:7.000   3rd Qu.:5.0   3rd Qu.: 9.000   3rd Qu.:6.000   3rd Qu.:92.90  
##  Max.   :9.000   Max.   :9.0   Max.   :12.000   Max.   :7.000   Max.   :96.20  
##       DMC              DC             ISI              temp      
##  Min.   :  1.1   Min.   :  7.9   Min.   : 0.000   Min.   : 2.20  
##  1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500   1st Qu.:15.50  
##  Median :108.3   Median :664.2   Median : 8.400   Median :19.30  
##  Mean   :110.9   Mean   :547.9   Mean   : 9.022   Mean   :18.89  
##  3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800   3rd Qu.:22.80  
##  Max.   :291.3   Max.   :860.6   Max.   :56.100   Max.   :33.30  
##        RH              wind            rain              area        
##  Min.   : 15.00   Min.   :0.400   Min.   :0.00000   Min.   :   0.00  
##  1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000   1st Qu.:   0.00  
##  Median : 42.00   Median :4.000   Median :0.00000   Median :   0.52  
##  Mean   : 44.29   Mean   :4.018   Mean   :0.02166   Mean   :  12.85  
##  3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000   3rd Qu.:   6.57  
##  Max.   :100.00   Max.   :9.400   Max.   :6.40000   Max.   :1090.84

Visualizing Data

The grammar of graphics package known as ggplot2 is the primary graphical visualization package in R. We’ll flex its muscles just a bit to demonstrate a simple line chart and histogram. Notice there are a large number of forest fires with an area of 0. This is the reason I’ve opted to display the last twenty observations instead of the first twenty. This is also something to keep in mind for the Shiny app at the end.

ggplot(ff_data_num) +
  geom_line(aes(1:length(ff_data_num$area), area, color = 'coral2')) +
  labs(x= 'Row Number') +
  theme_classic()

ggplot(ff_data_num, mapping = aes(area, fill = 'coral2')) +
  geom_histogram(binwidth = 100) +
  theme_classic()

Putting it all together

Splitting data into features and targets

In later posts, we’ll want to predict a given feature which we refer to as our target. The remaining features will act as our input data to predict the target. Thus it can be useful to split our input features and target into two different variables.

t <- ff_data_num %>% select(area)
head(t)
##   area
## 1    0
## 2    0
## 3    0
## 4    0
## 5    0
## 6    0
X <- ff_data_num %>% select(-area)
head(X)
##   X Y month day FFMC  DMC    DC  ISI temp RH wind rain
## 1 7 5     3   5 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0
## 2 7 4    10   2 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0
## 3 7 4    10   6 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0
## 4 8 6     3   5 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2
## 5 8 6     3   7 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0
## 6 8 6     8   7 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0

Generating subplots in R is just a bit clunkier than it is in Python as you need to enter each plot as a graphical object into grid.arrange. Because of that, I made each graph individually rather than in a loop.

p1 <- qplot(1:length(X), X, data = X, geom = "line")
p2 <- qplot(1:length(Y), Y, data = X, geom = "line")
p3 <- qplot(1:length(month), month, data = X, geom = "line")
p4 <- qplot(1:length(day), day, data = X, geom = "line")
p5 <- qplot(1:length(FFMC), FFMC, data = X, geom = "line")
p6 <- qplot(1:length(DMC), DMC, data = X, geom = "line")
p7 <- qplot(1:length(DC), DC, data = X, geom = "line")
p8 <- qplot(1:length(ISI), ISI, data = X, geom = "line")
p9 <- qplot(1:length(temp), temp, data = X, geom = "line")
p10 <- qplot(1:length(RH), RH, data = X, geom = "line")
p11 <- qplot(1:length(wind), wind, data = X, geom = "line")
p12 <- qplot(1:length(rain), rain, data = X, geom = "line")

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, nrow=3, ncol=4)

Plotting the data with lines doesn’t always make sense, so creating a scatterplot will often time make it easier to identify relationships between variables.

p1 <- qplot(x = X$X, y = t$area, geom = "point")
p2 <- qplot(x = X$Y, y = t$area, geom = "point")
p3 <- qplot(x = X$month, y = t$area, geom = "point")
p4 <- qplot(x = X$day, y =  t$area, geom = "point")
p5 <- qplot(x = X$FFMC, y =  t$area, geom = "point")
p6 <- qplot(x = X$DMC,  y =  t$area, geom = "point")
p7 <- qplot(x = X$DC, y =  t$area, geom = "point")
p8 <- qplot(x = X$ISI, y =  t$area, geom = "point")
p9 <- qplot(x = X$temp, y =  t$area, geom = "point")
p10 <- qplot(x = X$RH, y =  t$area, geom = "point")
p11 <- qplot(x = X$wind, y =  t$area, geom = "point")
p12 <- qplot(x = X$rain, y =  t$area, geom = "point")

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, nrow=3, ncol=4)

Remember the histogram from earlier? The skew in the data can be a massive issue for further analysis or ML model application. To amplify the difference in the data, we’ll spread them out by taking the log. The 2nd histogram demonstrates the new distribution. Notice we added 1 to area, since you can’t take the log of 0.

ap1 <- ggplot(t, aes(x = area)) + geom_histogram(binwidth = 100)
t$logarea <- log(t$area + 1)
ap2 <- ggplot(t, aes(x = logarea)) + geom_histogram(binwidth = 1)
grid.arrange(ap1, ap2, nrow = 1)

We’ll the plot create the scatter plots again to gain a better understanding of the correlation between variables.

p1 <- qplot(x = X$X, y =  t$logarea, geom = "point")
p2 <- qplot(x = X$Y, y =  t$logarea, geom = "point")
p3 <- qplot(x = X$month, y =  t$logarea, geom = "point")
p4 <- qplot(x = X$day, y =  t$logarea, geom = "point")
p5 <- qplot(x = X$FFMC, y =  t$logarea, geom = "point")
p6 <- qplot(x = X$DMC,  y =  t$logarea, geom = "point")
p7 <- qplot(x = X$DC, y =  t$logarea, geom = "point")
p8 <- qplot(x = X$ISI, y =  t$logarea, geom = "point")
p9 <- qplot(x = X$temp, y =  t$logarea, geom = "point")
p10 <- qplot(x = X$RH, y =  t$logarea, geom = "point")
p11 <- qplot(x = X$wind, y =  t$logarea, geom = "point")
p12 <- qplot(x = X$rain, y =  t$logarea, geom = "point")

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, nrow=3, ncol=4)

Practice Using the Auto MPG dataset

Here we’ll be performing some of the same operations done above. Afterwards, to emulate Streamlit’s ability to build interactive tools / dashboards, we’ll create our first R Shiny app! (My first at least).

columns <- c('mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
             'acceleration', 'model year', 'origin', 'car name')
auto_mpg <- read_table("auto-mpg.data-original", 
    col_names = columns)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   mpg = col_double(),
##   cylinders = col_double(),
##   displacement = col_double(),
##   horsepower = col_double(),
##   weight = col_double(),
##   acceleration = col_double(),
##   `model year` = col_double(),
##   origin = col_double(),
##   `car name` = col_character()
## )
head(auto_mpg)
## # A tibble: 6 x 9
##     mpg cylinders displacement horsepower weight acceleration `model year`
##   <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl>        <dbl>
## 1    18         8          307        130   3504         12             70
## 2    15         8          350        165   3693         11.5           70
## 3    18         8          318        150   3436         11             70
## 4    16         8          304        150   3433         12             70
## 5    17         8          302        140   3449         10.5           70
## 6    15         8          429        198   4341         10             70
## # ... with 2 more variables: origin <dbl>, `car name` <chr>
length(which(is.na(auto_mpg)))
## [1] 14
auto_mpg <- drop_na(auto_mpg)
dim(auto_mpg)
## [1] 392   9
length(which(is.na(auto_mpg)))
## [1] 0

In the following app, we’ll used sliders and selection tools to filter data for description and visualization. The code to create it is below with a few comments to serve as guidance!

In addition, here’s very helpful R Shiny’s documentation for how to deploy an app to the web along with a blog post for how to embed your Shiny apps into a blogdown site. Just load packages that you need within the app before you deploy it!

# Define UI for application.
ui <- fluidPage(

    # Application title
    titlePanel("Forest Fires Data"),

    # Sidebar with a slider and select inputs.
    sidebarLayout(
        sidebarPanel(
            helpText('Forest fire Data Filtered by Burned Area.'),
            selectInput(
        "sum", "Show / Hide Data Summary",
        c(Show = "show",
          "Hide" = "notshow")),
            sliderInput("range",
                        "Choose a minimum and maximum area:",
                        min = min(ff_data_num$area),
                        max = max(ff_data_num$area),
                        value = c(0,30), sep = '')
        ),

        # Show table outputs and plot output
        mainPanel(
           shiny::verbatimTextOutput('temp'),
      conditionalPanel(
        condition = "input.sum == 'show'",
        tableOutput('Summary')),
        tableOutput("DataTable"),
        plotOutput("Plots")
      ) ) )

# 

# Define server logic required to create desired outputs
server <- function(input, output, session) {

    output$Summary <- renderTable({
        sumdt <- as.data.frame(do.call(cbind, lapply(ff_data_num, summary)))
        sumdt[4,] <- round(sumdt[4,], digits = 4) 
        sumdt <- sumdt%>%
  add_column('Metric' = c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')) %>%
          mutate
sumdt[,c(14,1:13)] })
    
    output$DataTable <- renderTable({
        dt <- ff_data_num[ff_data_num$area >= input$range[1] & ff_data_num$area <= input$range[2],]
        dt })
    
    output$Plots <- renderPlot({
      dt <- ff_data_num[ff_data_num$area >= input$range[1] & ff_data_num$area <= input$range[2],]
      plot1 <- ggplot(data = dt) +
        geom_point(aes(X, Y))
      plot2 <- ggplot(data = dt) +
        geom_histogram(aes(area))
      plot3 <- ggplot(data = dt) +
        geom_point(aes(temp, area))
      plot4 <- ggplot(data = dt) +
        geom_point(aes(wind, area))
      grid.arrange(plot1, plot2, plot3, plot4, nrow = 2, ncol = 2,
                   top = textGrob(paste("The number of filtered data samples:", dim(dt)[1])))
    }) }

# Run the application 
shinyApp(ui = ui, server = server)
Samuel Louissaint
Samuel Louissaint
Analyst, Econ Enthusiast, M.S. in Data Science