Chapter 8 Working with Real-World Data

8.1 Handling Large Datasets

8.1.1 Challenges of Large Data sets

When working with data in R, one can encounter large data sets that are challenging to work on. These are some of the challenges;

  • Data I/O takes a long time such that large files takes a long time to transfer data to or from a computers system which will slow down crucial processes like network operations, communication between the devices such as keyboard and microphone, and sharing data
  • R has a file size limit of approximately 2 to 4gb, therefore it is a challenge to work on data sets above the limit like 5gb.
  • There is more difficulty in indexing data sets which extremely large number of rows and columns.
  • The processing speed of several algorithms and pre-trained model will reduce.
  • Large data sets pose a threat on the memory management. R stores data entirely on the memory, therefore it can slow down or even crush the program under extreme cases.

As a data analyst/statistician, large data sets are inevitable therefore researchers have worked on the issue of large data sets and come up with the following solutions;

  • Optimize the memory usage through data type conversion and row-wise processing
  • Processing large data sets in batches or in chunks.
  • Using memory efficient objects and programming tricks like nested functions, lazy loading(load data into memory when its needed) and use of garbage collection where objects that are no longer useful are disposed.
  • Partitioning and streaming by loading only small pieces of data into memory at any point in time.
  • Use specialized packages for large scale analysis like data.table.
  • Apply data sampling. Instead of processing the whole data at once. Take a random manageable sample from the data set and process it.

8.1.2 Efficient Data Handling Techniques

Lets dive deep on how to work on large data sets in R by applying the popular methods in data science.

8.1.2.1 Using data.table Package

The data.table package delivers an improved version of a data.frame structure. The data structure from this package(has the same name, data.table) is high performance, memory efficient thus being more fit for large data sets than the data.frame kind of a data structure.

Lets create a simple data.table from a random data set.

# Load the library
library(data.table)

# Create the data.table 
student_dtb <- data.table(stuid = c(2, 5, 3, 4, 6, 7, 4, 2, 0),# student id
                          age = c(23, 45, 67, 23, 41, 43, 54, 67, 89)
                          )

student_dtb
##    stuid   age
##    <num> <num>
## 1:     2    23
## 2:     5    45
## 3:     3    67
## 4:     4    23
## 5:     6    41
## 6:     7    43
## 7:     4    54
## 8:     2    67
## 9:     0    89

A data.frame is created the same as a data.table

# Create the data.frame
student_df <- data.table(stuid = c(2, 5, 3, 4, 6, 7, 4, 2, 0),# student id
                          age = c(23, 45, 67, 23, 41, 43, 54, 67, 89)
                          )

student_df
##    stuid   age
##    <num> <num>
## 1:     2    23
## 2:     5    45
## 3:     3    67
## 4:     4    23
## 5:     6    41
## 6:     7    43
## 7:     4    54
## 8:     2    67
## 9:     0    89

They almost look similar but they are more different in operations. Here are some of the differences;

data.table data.frame
faster than data.frame 20X slower than data.table
Used for more complex data structures and big data Used for smaller tables and matrices
Has built-in features like rolling joins and overlapping range Lacks more features but friendly to a beginner
code efficient Utilizes more code to get something done
setDF(dt) function is used to convert it to a data.frame where argument dt is the data.table setDT(df) function is used to convert it to data.table where argument df is the data.frame
Syntax: data.table() Syntax: data.frame()

There are varieties of data.table operations that can be applied on large data sets. Let’s explore a few of them

  • Filtering rows

Rows in a data.table can be filtered using conditions in data.table. For instance, lets filter the data above to find the ones above the age of 40

# Get the ones above 40 years
dtb1 <- student_df[age > 40]
print(dtb1)
##    stuid   age
##    <num> <num>
## 1:     5    45
## 2:     3    67
## 3:     6    41
## 4:     7    43
## 5:     4    54
## 6:     2    67
## 7:     0    89
  • Selecting Specific Columns

You can select specific columns in a data.table. Lets select the age column

# Select the age column 
age_dt <- student_df[, .(age)]
print(age_dt)
##      age
##    <num>
## 1:    23
## 2:    45
## 3:    67
## 4:    23
## 5:    41
## 6:    43
## 7:    54
## 8:    67
## 9:    89
  • Aggregating data

Aggregation can be done with the help of functions like mean(), sum()

# Create a sample data.table
dt <- data.table(
  ID = 1:10,
  Name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Hannah", "Isaac", "Jack"),
  Age = c(25, 30, 35, 40, 28, 32, 30, 29, 33, 36),
  Score = c(85, 90, 88, 92, 76, 95, 89, 78, 82, 91)
)

# Print the data.table
dt
##        ID    Name   Age Score
##     <int>  <char> <num> <num>
##  1:     1   Alice    25    85
##  2:     2     Bob    30    90
##  3:     3 Charlie    35    88
##  4:     4   David    40    92
##  5:     5     Eve    28    76
##  6:     6   Frank    32    95
##  7:     7   Grace    30    89
##  8:     8  Hannah    29    78
##  9:     9   Isaac    33    82
## 10:    10    Jack    36    91
# Calculate the average score by Age
aggregated_dt <- dt[, .(Average_Score = mean(Score)), by = Age]
print(aggregated_dt)
##      Age Average_Score
##    <num>         <num>
## 1:    25          85.0
## 2:    30          89.5
## 3:    35          88.0
## 4:    40          92.0
## 5:    28          76.0
## 6:    32          95.0
## 7:    29          78.0
## 8:    33          82.0
## 9:    36          91.0
  • Chaining

Multiple opeerations can be chained together for efficiency. In this case, we are;

  • filtering to get the student with score of above 85
  • aggregate the average score by mean
# Chain operations: Filter and aggregate
result_dt <- dt[Score > 85, .(Average_Score = mean(Score)), by = Age]
print(result_dt)
##      Age Average_Score
##    <num>         <num>
## 1:    30          89.5
## 2:    35          88.0
## 3:    40          92.0
## 4:    32          95.0
## 5:    36          91.0

Alternatively, we can filter to get the records with a score of above 85 and select the Name and Score column.

# Filter and select columns
filtered_selected_dt <- dt[Score > 85, .(Name, Score)]
print(filtered_selected_dt)
##       Name Score
##     <char> <num>
## 1:     Bob    90
## 2: Charlie    88
## 3:   David    92
## 4:   Frank    95
## 5:   Grace    89
## 6:    Jack    91

Practical exercise

In this course, you will be required to use the US Accidents (2016 - 2023) data set which will be downloaded from here. Download the data and extract it to a specified destination path. Use the data.table package to;

  • read the data set into R using a fread("filepath") function.
  • filter to get the rows with Severity of more than 2
  • select the Source, Severity and the Distance(mi) columns
  • Aggregate to find the distance of each source

Solution

read the data set into R using a fread("filepath") function

library(data.table)

# Read the data 
us_accidents_dtb <- fread("big-data/us_accidents/US_Accidents_March23.csv")

# head(us_accidents_dtb) # uncomment to show the data table

filter to get the rows with Severity of more than 2

more_severe <- us_accidents_dtb[Severity > 2]
# head(more_severe) # uncomment to show the data table

select the Source, Severity and the Distance(mi) columns

selected_data <- us_accidents_dtb[, .(Source, Severity, `Distance(mi)`)]
head(selected_data)
##     Source Severity Distance(mi)
##     <char>    <int>        <num>
## 1: Source2        3         0.01
## 2: Source2        2         0.01
## 3: Source2        2         0.01
## 4: Source2        3         0.01
## 5: Source2        2         0.01
## 6: Source2        3         0.01

Aggregate to find the distance of each source

# Calculate the average distance by Source
aggregated_dt <- selected_data[, .(Average_Distance = mean(`Distance(mi)`)), by = Source]
print(aggregated_dt)
##     Source Average_Distance
##     <char>            <num>
## 1: Source2       0.21666965
## 2: Source3       0.05937252
## 3: Source1       0.83691407

________________________________________________________________________________

8.1.2.2 Memory Management in R

A clear grasp of memory management in R will help the analyst/statistician predict how much computer memory you will need for a given task and make you have the most out of the memory. Also, a proper memory management is crucial for effective use of the sytem resources hence the program running smoothly. The R objects are stored in the Random Access Memory (RAM).

Lets explore some of the key techniques of Memory Management;

  1. Removing unused objects

Objects like variables and packages that are no longer required need to be removed to free up the memory. The rm() function is used to delete them

# Create a variable x
x <- rnorm(11000)

# Remove the variable x
rm(x) # variable x is deleted

You can use the function ls() (with no argument) to confirm if x exists got deleted in the variable list

  1. Monitoring memory usage

R has several functions that help in monitoring the memory usage. This helps the programmer identify if the session is consuming too much memory.

object.size() is one of function used to compute the memory usage.

# Create a variable x
x <- rnorm(100000)

# Compute the memory usage
object.size(x)
## 800048 bytes

The variable x consumes approximately 800048 bytes of memory

Besides, one can apply memory profiling, that is, tracking the memory usage while the scripts executes. Rprof() is an inbuilt R profile that gets the job done in real time. Lets demonstrate

# Start memory profiling
Rprof(memory.profiling = TRUE)

# Run some computations
x <- rnorm(1e6)
y <- rnorm(1e6)

# Stop profiling
Rprof(NULL)

# Summary of memory usage
summaryRprof()
## $by.self
##         self.time self.pct total.time total.pct
## "rnorm"      0.12      100       0.12       100
## 
## $by.total
##                           total.time total.pct self.time self.pct
## "rnorm"                         0.12       100      0.12      100
## "block_exec"                    0.12       100      0.00        0
## "call_block"                    0.12       100      0.00        0
## "eng_r"                         0.12       100      0.00        0
## "eval_with_user_handlers"       0.12       100      0.00        0
## "eval"                          0.12       100      0.00        0
## "evaluate_call"                 0.12       100      0.00        0
## "evaluate::evaluate"            0.12       100      0.00        0
## "evaluate"                      0.12       100      0.00        0
## "generator$render"              0.12       100      0.00        0
## "handle"                        0.12       100      0.00        0
## "in_dir"                        0.12       100      0.00        0
## "in_input_dir"                  0.12       100      0.00        0
## "knitr::knit"                   0.12       100      0.00        0
## "process_file"                  0.12       100      0.00        0
## "process_group"                 0.12       100      0.00        0
## "render_book_script"            0.12       100      0.00        0
## "render_book"                   0.12       100      0.00        0
## "render_cur_session"            0.12       100      0.00        0
## "rmarkdown::render_site"        0.12       100      0.00        0
## "rmarkdown::render"             0.12       100      0.00        0
## "timing_fn"                     0.12       100      0.00        0
## "withCallingHandlers"           0.12       100      0.00        0
## "withVisible"                   0.12       100      0.00        0
## "xfun:::handle_error"           0.12       100      0.00        0
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 0.12

Additionally, the pryr package goes further to come up with additional functions to monitor memory usage. It can be installed by;

install.packages("pryr")

Below is a demonstration of how pryr library works

# Load the library
library(pryr)
## 
## Attaching package: 'pryr'
## The following object is masked from 'package:data.table':
## 
##     address
## The following object is masked from 'package:dplyr':
## 
##     where
## The following objects are masked from 'package:purrr':
## 
##     compose, partial
# Check memory usage of the entire R session
mem_used()
## 5.08 GB
# Create a variable 
y <- rnorm(1e6)

# Check memory usage of a specific object
object_size(y)
## 8.00 MB
  1. Freeing up the Memory

R has an automatic memory management system, however the programmer might request garbage collection. The gc() function helps you clean unused memory. This is helpful after removing large objects

# Collect the garbage to free up unused memory
gc()

The gc() function will accelerate the process of garbage collection rather than waiting for R to automatically free up the memory.

Practical exercise

After working on the previous practical exercise, you will manage the memory by;

  • Finding the objects that fill up the memory and list them.
  • Deleting the variables not required
  • Using gc to accelerate the process of garbage collection

Solution

  • Find the objects that fill up the memory
ls()

"more_severe", "selected_data", "us_accidents_dtb" are the objects that fill up the memory.

  • Delete the variables not required Answer: the three variables are required for future data analysis. If any is to be deleted for instance "selected_data" will be deleted by rm(selected_data) command.

  • Use gc to accelerate garbage collection

library(pryr)

# Compute the memory usage 
mem_used()

# Accelerate garbage collection
gc()

# Recalculate the memory usage to find the effect of garbage colection
mem_used()

The results below shows that the mem_used() function initially reported 5 GB of memory usage, and after running gc() (garbage collection), the memory usage remained the same. This indicates that there was no significant unused memory to free up. The output shows the memory statistics, with Ncells and Vcells representing internal R objects and vector cells, respectively. The gc() call didn’t reduce memory usage because no unnecessary objects were ready for removal, meaning R was already optimized for memory at that point

5 GB
            used   (Mb) gc trigger   (Mb)  max used
Ncells  13700827  731.8   28312494 1512.1  18470724
Vcells 529470462 4039.6  770624900 5879.5 639364455
         (Mb)
Ncells  986.5
Vcells 4878.0
5 GB

________________________________________________________________________________

8.1.3 Reading and Writing Large Files

8.1.3.1 Optimized File I/O

The data.table package offers highly optimized functions like fread() and fwrite() that are much faster for reading and writing large files. They provide significant speedups due to efficient file handling and parallel processing. The I/O processes can be optimized by;

  1. Efficient reading
  2. Efficient writing
  3. Parallel Processing for faster import/export
  1. Efficient Reading

The fread() function is designed to read large CSV files much faster than the read.csv() by using multiple threads and optimized parsing. Here is how you can read the file

# Load the library
library(data.table)

# Read large csv file 
data <- fread("GlobalLandTemperaturesByCity.csv")

Download the csv file from here. Extract the downloaded file and read the first csv file.

These are the perks of reading data using fread() function;

  • Parallel Reading: Uses multiple threads to read files faster.
  • Automatic Type Detection: Automatically detects column types without needing to pre-specify them.
  • Handling Big Data: Efficient for reading gigabytes of data.
  1. Efficient Writing

A large data.table can saved locally using the fwrite() function. It can handle large data volumes efficiently and supports various formats.

# Writing a data frame to a CSV file using fwrite()
fwrite(data, "output_file.csv")

The "output_file" is the destination path where the data.table is written.

These are the features of fwrite();

  • Fast Export: Writes data using optimized I/O to speed up the process.
  • Compression: Supports compression for efficient storage.
  • Parallel Writing: Uses multiple threads to write files faster.
  1. Parallel Processing for faster Import/Export

The fread() and fwrite() use parallel processing by default however you can configure the number of threads used for reading and writing. The setDTthreads() is used to achieve this. Lets now read and write the data using 4 threads

# Set the number of threads to use for parallel processing
setDTthreads(4)

# Now fread() and fwrite() will use 4 threads
data <- fread("large_dataset.csv")
fwrite(data, "output_file.csv")

Practical exercise

In this exercise, you are required to use the US Accidents (2016 - 2023) data set and solve the problems below;

  1. Read the data set using fread()
  2. Time the fread() when reading the file. Hint: replace the "large_dataset.csv" by the file path of the downloaded file
# Timing fread() for a large file
system.time(data <- fread("large_dataset.csv"))
  1. Add a new column into the data set. Just the convert the temperature from \(^oF\)(degrees Farenheit) to \(^oC\)(degrees Celcius)
  2. Write the file locally and check the time taken to write the file. Hint: Here is how you can achieve step iii and iv
# Do the Calculations

# Timing fwrite() for a large file
data[, new_column := rnorm(nrow(data))]  # Add a new column
system.time(fwrite(data, "large_output.csv"))

Solution

  1. Read the data set using fread()
  2. Time the fread() when reading the file. Hint: replace the "large_dataset.csv" by the file path of the downloaded file
# Timing fread() for a large file
system.time(data <- fread("big-data/us_accidents/US_Accidents_March23.csv"))
##    user  system elapsed 
##  34.849   1.982  22.882
  1. Add a new column into the data set. Just the convert the temperature from \(^oF\)(degrees Farenheit) to \(^oC\)(degrees Celcius)
data[, `Temperature(C)` := (5/9) * (`Temperature(F)` - 32)]

head(data$`Temperature(C)`)
## [1] 2.722222 3.277778 2.222222 1.722222 2.222222 3.277778
  1. Write the file locally and check the time taken to write the file. Hint: Here is how you can achieve step iii and iv
system.time(fwrite(data, "temp-data/us_accidents.csv"))
##    user  system elapsed 
##  18.610   2.192  12.563

________________________________________________________________________________

8.1.4 Hands-On Exercises

Download the Amazon Fine Food Reviews data set from here and work on the following problems;

  1. Remove any unwanted objects in the file system. Use gc to view any garbage collection statistics
  2. Read the data set using read.csv and fread and time each method. Compare the time taken for each function to read the file.
  3. Remove the data set from the memory using rm() and call gc() again
  4. Reread the data set using fread() however control the threads. Set different thread value (e.g 1, 2, 3, 4 and 5) the compare the time taken for processing.
  5. Write the file locally(use a different file name) and use gc() for garbage collection

Finally, discuss on the importance of scalability when dealing with real-world data.


Solution

  1. Remove any unwanted objects in the file system. Use gc to view any garbage collection statistics
library(data.table)

# Remove unwanted objects and show memory usage
gc()
##             used   (Mb) gc trigger    (Mb)  max used   (Mb)
## Ncells  14310300  764.3   29068151  1552.5  14697689  785.0
## Vcells 920369636 7021.9 1580223884 12056.2 920453284 7022.6
  1. Read the data set using read.csv and fread and time each method. Compare the time taken for each function to read the file.
  • Using read.csv
# time using when reading with `read.csv` function 
system.time(amazon_reviews_df <- 
              read.csv("big-data/amazon-fine-food-reviews/Reviews.csv"))
##    user  system elapsed 
##   6.022   0.452   6.651
  • Using fread()
# time using when reading with `read.csv` function 
system.time(amazon_reviews_dtB <- 
              fread("big-data/amazon-fine-food-reviews/Reviews.csv"))
##    user  system elapsed 
##   2.180   0.042   1.133

The fread() function reads the data 4 times faster than the fread() function.

  1. Remove the data set from the memory using rm() and call gc() again
# Remove the two data objects 
rm(amazon_reviews_df, amazon_reviews_dtB)

# Call gc() to trigger garbage collection and show memory usage
gc()
##             used   (Mb) gc trigger    (Mb)  max used   (Mb)
## Ncells  14310332  764.3   29068151  1552.5  16124320  861.2
## Vcells 920371936 7021.9 1580223884 12056.2 975727152 7444.3
  1. Reread the data set using fread() however control the threads. Set different thread value (e.g 1, 2, 3, 4 and 5) the compare the time taken for processing
library(data.table)

# The filepath 
file_path <- "big-data/amazon-fine-food-reviews/Reviews.csv"

# Loop through 1 to 5 threads
for (threads in 1:5) {
  
  # Set the number of threads
  setDTthreads(threads)
  
  # Print the time taken
  time_taken <- system.time(amazon_reviews_dtB <- fread(file_path))
  cat("Time taken with", threads, "thread(s):", time_taken, "\n")
  
  # Remove the data object to free memory
  rm(amazon_reviews_dtB)
  
  # Run garbage collection
  gc()
}
## Time taken with 1 thread(s): 1.886 0.035 1.92 0 0 
## Time taken with 2 thread(s): 2.969 0.074 1.548 0 0 
## Time taken with 3 thread(s): 4.344 0.081 1.504 0 0 
## Time taken with 4 thread(s): 5.959 0.061 1.686 0 0 
## Time taken with 5 thread(s): 5.489 0.031 1.488 0 0

The more the threads the faster it takes to read the data

  1. Write the file locally(use a different file name) and use gc() for garbage collection
# Set the threads and read the data 
setDTthreads(5)
amazon_reviews_dtB <-
  fread("big-data/amazon-fine-food-reviews/Reviews.csv")

# Write the file locally
fwrite(amazon_reviews_dtB, "temp-data/Reviews.csv")

# Garbage collection 
rm(amazon_reviews_dtB)
gc()
##             used   (Mb) gc trigger    (Mb)  max used   (Mb)
## Ncells  14310860  764.3   29068151  1552.5  16124320  861.2
## Vcells 920373405 7021.9 1580223884 12056.2 975727152 7444.3

Finally, discuss on the importance of scalability when dealing with real-world data.

Scalability is critical when handling real-world data because datasets can grow significantly in size and complexity, often comprising millions of rows and gigabytes of information. As data grows, processing it efficiently becomes more challenging, making scalability essential for:

- Performance: Scalable systems allow operations, such as reading, analyzing, and transforming data, to remain efficient even as data size increases, preventing long processing times.

- Resource Optimization: Scalable solutions can adjust resource usage, such as memory and CPU, to match the data’s size, ensuring optimal performance without overwhelming the system.

- Cost Efficiency: Scalable approaches prevent the need for costly hardware upgrades by making better use of existing resources, reducing infrastructure costs in the long term.

- Handling Big Data: Many industries, like finance, healthcare, and e-commerce, deal with vast amounts of data. Scalable systems ensure that businesses can process and extract value from this data effectively.

- Future-Proofing: As data needs continue to evolve, scalability ensures that systems can accommodate future growth without requiring significant overhauls or replacements.

In short, scalability ensures that as data grows, the system’s capacity to process it efficiently grows too, enabling sustainable data management and analysis.

________________________________________________________________________________

8.2 Data Cleaning and Transformation

8.2.1 Introduction to Data Cleaning

8.2.1.1 Common Data Issues

Raw data from the field can be inaccurate and inconsistent and pose threats to data operations leading to false analysis and decision making. This type of data is know as dirty, unclean or rogue data. Here are some of the common data issues that render data unclean;

  • Incomplete data: Some of the data points may be missing or left blank.
  • Duplicate data: Some records in the data sets may occur multiple times in different sources.
  • Outdated data: Some data values that may have relevant information times ago may now be obsolete and irrelevant.
  • Inconsistent data formats: Data may be in different ways across multiple data sources like records be presented in JSON formats, SQL tables, No SQL table or even in image formats(.png, .jpeg).
  • Outlier Values: Outliers are extremely high or low values in the data set like a data set containing age of students having one student that is 304 years old or body temperature data set containing values of \(-51^0C\). These values are always impossible and can indicate errors, anomalies and exceptions in the data sets.

Data integrity is crucial in data analysis and can hinder the organizations’ operations. The foundation of strategic planning, efficient operations and sound decision-making is based upon accurate, complete and consistent data. Here are some of the threats that unclean data pose to data analytics and businesses in general;

  • Inconsistent Results: For instance if a business finds conflicting or duplicate reports on customer behavior will lead to lose of trust or results’ interpretation.
  • Increased Costs: Companies spend a lot of time and money on data cleaning, processing and verification
  • Inaccurate Insights and Decision making: Poor data quality like incorrect values may lead to poor insights thereby leading to poor decision-making. For instance inaccurate sales data may lead to flawed forecasting, poor inventory management and poor marketing strategies.
  • Missed Opportunities: Poor data quality may prevent the company from identifying key trends and opportunity gaps in the market like customer segments or emerging markets

8.2.1.2 Data Cleaning techniques

8.2.1.2.1 Handling Missing Data

The presence of missing data values is inevitable especially when integrating data from different sources. While there is no specific set method(one-time solution) to handle the missing data points in R, researchers have come up with different methodologies to tackle this issue. Here are some of them;

  • Deletion; this is a simple method that involves deleting all the records/rows that have null values. Otherwise, all rows that have null values in important columns can be deleted. na.omit() is one of the method.
  • Imputation: There are packages in R that can fill the null values by imputation. They use the remaining values in the data to create a model that will find a value for the missing one. These packages include imputeR, Amelia and MissForest that use the automated variable selection, bootstrap multiple imputation and single imputation to handle null values.
  • Use of algorithms that support null values for instance K-Nearest Neighbors(kNN) and Naive Bayes.
  • Mean Imputation: This is simply filling the null values with the average of the remaining values.

Lets create different random data sets and handle the missing values. However, before the issue of null value is fixed, the null values need to be identified in the data set.

Create a random athlete data set

set.seed(123)  # Set seed for reproducibility

# Generate a data frame with random data for athletes
athlete_data <- data.frame(
  athlete_id = 1:100,  # 100 athletes
  name = paste0("Athlete", 1:100),
  age = sample(18:40, 100, replace = TRUE),
  sport = sample(c("Basketball", "Soccer", "Tennis", "Swimming", "Running"), 100, replace = TRUE),
  country = sample(c("USA", "Kenya", "China", "Brazil", "Germany"), 100, replace = TRUE),
  score = round(runif(100, 60, 100), 1)  # Scores between 60 and 100
)

# Introduce 15% missing values randomly across the entire data frame
total_cells <- prod(dim(athlete_data))  # Total number of cells
missing_cells <- round(0.15 * total_cells)  # 15% of cells will have missing values

# Randomly select indices to be set to NA
missing_indices <- sample(seq_len(total_cells), missing_cells)

# Convert the data frame to a matrix for easier manipulation of specific indices
athlete_data_matrix <- as.matrix(athlete_data)
athlete_data_matrix[missing_indices] <- NA

# Convert back to a data frame
athlete_data <- as.data.frame(athlete_data_matrix)

# View the dataset
head(athlete_data)
##   athlete_id     name  age      sport country score
## 1          1 Athlete1   32 Basketball Germany  95.6
## 2          2 Athlete2   36     Tennis    <NA>  92.5
## 3          3 Athlete3 <NA>       <NA>     USA  89.9
## 4          4 Athlete4   20       <NA>  Brazil  66.2
## 5          5     <NA>   27     Tennis   Kenya  65.0
## 6       <NA> Athlete6   35    Running Germany  99.0

Lets count the all null values. The is.na() from base R is used to identify the null values while the sum() sums up all the identified null values

sum(is.na(athlete_data))
## [1] 90

Alternatively, the total null values from the columns can be counted. The sapply() function along with the is.na() and sum() are used.

# Count the number of null values in each column
null_counts <- sapply(athlete_data, function(x) sum(is.na(x)))

# Display the counts of null values
null_counts
## athlete_id       name        age      sport    country      score 
##         17         18         10         14         18         13

Lets now delete the rows containing the null values by the na.omit() and rename the data frame athlete_data_clean

athlete_data_clean <- na.omit(athlete_data)

Confirm the operation by counting the null values in each column and overall

# Count the null values by column 
null_count_by_column <-  sapply(athlete_data_clean,
                                function(x) sum(is.na(x)))
null_count_by_column
## athlete_id       name        age      sport    country      score 
##          0          0          0          0          0          0
# Overall null count 
overall_null_count <- sum(is.na(athlete_data_clean))
print(paste("Overall Null Count: ", overall_null_count))
## [1] "Overall Null Count:  0"

Now that we have seen how null values can be handled by deleting the rows with the null values, therefore lets perform imputation on the original data set. The null values will not be deleted however, they will be replaced by filling the null values with the previous or next value

We will be using the original athlete_data that has null values. The tidyr package will be used which can be installed by;

# If not already installed run the command below on the console
install.packages("tidyr")

Count the null values once again and view sample of the data set

# Count the null values by column 
null_count_by_column <-  sapply(athlete_data,
                                function(x) sum(is.na(x)))
null_count_by_column
## athlete_id       name        age      sport    country      score 
##         17         18         10         14         18         13
# View the sample of the data - first 6 records 
head(athlete_data)
##   athlete_id     name  age      sport country score
## 1          1 Athlete1   32 Basketball Germany  95.6
## 2          2 Athlete2   36     Tennis    <NA>  92.5
## 3          3 Athlete3 <NA>       <NA>     USA  89.9
## 4          4 Athlete4   20       <NA>  Brazil  66.2
## 5          5     <NA>   27     Tennis   Kenya  65.0
## 6       <NA> Athlete6   35    Running Germany  99.0

Now fill the null values with the previous values. Specificall the sport. athlete_id and the age columns

# Import the package 
library(tidyr)

# Fill the null value with the previous value 
athlete_data_filled <- athlete_data %>%
  fill(athlete_id, age, sport, .direction = "down")

# Count the null values by column 
null_count_by_column <-  sapply(athlete_data_filled,
                                function(x) sum(is.na(x)))
null_count_by_column
## athlete_id       name        age      sport    country      score 
##          0         18          0          0         18         13
# View the first few rows 
head(athlete_data_filled)
##   athlete_id     name age      sport country score
## 1          1 Athlete1  32 Basketball Germany  95.6
## 2          2 Athlete2  36     Tennis    <NA>  92.5
## 3          3 Athlete3  36     Tennis     USA  89.9
## 4          4 Athlete4  20     Tennis  Brazil  66.2
## 5          5     <NA>  27     Tennis   Kenya  65.0
## 6          5 Athlete6  35    Running Germany  99.0

You can see the target columns(age, athlete_id and sport) are filled and have zero null values. In the fill() function, there is an argument, .direction that specify the direction in which to fill the missing value. In this case the value of the argument is "down" because we are filling based on the previous value. It can be either "up", "downup"(based on previous the later) or "updown"(based on later then previous).

Practical exercise

Download the Detailed NFL Play-by-Play Data 2009-2018 data from here. Read the data set using fread() and subset to have game_date , drive, home_timeouts_remaining, defteam columns. Name the data resultant data table subset_dT1. Perform the following operations on subset_df1.

  1. Count the total null values
  2. Create a copy of the data set and rename it subset_dT1_copy. Drop rows that contain any null value.
  3. Create another copy from the subset_dT1 and name it subset_dT1_copy2 and impute the missing values based on the previous value. .direction=down.
  4. Count the null values on the copy data set

Solution

Read the data set

library(data.table)

# Load the data 
nfl_dtB <- fread("big-data/NFL-Play-data/data.csv")


# Subset the data table to find the required columns
subset_dT1 <- nfl_dtB[, .(game_date , drive, home_timeouts_remaining, defteam)]

# remove the nfl_dtB to free up space 
rm(nfl_dtB)
gc()
##             used   (Mb) gc trigger    (Mb)   max used   (Mb)
## Ncells  14314646  764.5   29068151  1552.5   16124320  861.2
## Vcells 921515539 7030.7 1580223884 12056.2 1129940175 8620.8
  1. Count the null values
# Count the null values 
sum(is.na(subset_dT1))
## [1] 12879

There are more than 12 thousand missing values in the data set

  1. Create a copy of the data table and rename it subset_dT1_copy. Drop rows that contain any null value.
# Create a copy to avoid referencing issues 
subset_dT1_copy <- copy(subset_dT1)

# Drop rows that contain null values 
subset_dT1_copy <- na.omit(subset_dT1_copy)

# Confirm the operation 
sum(is.na(subset_dT1_copy))
## [1] 0
  1. Create another copy from the subset_dT1 and name it subset_dT1_copy2 and impute the missing values based on the previous value. .direction=down.
subset_dT1_copy2 <- copy(subset_dT1)

# Impute the missing values values based on the previous ones 
library(tidyr)

subset_dT1_copy2 <- subset_dT1_copy2 %>%
  fill(game_date , drive, home_timeouts_remaining, defteam,
       .direction="down")
  1. Count the null values on the copy data set
# Count the null values 
sum(is.na(subset_dT1_copy2))
## [1] 0

________________________________________________________________________________

8.2.1.2.2 Dealing with Outliers

Outliers are values that that are completely different from the other data points in the data set. They may be extremely smaller or extremely larger than the values in the data set. Also outliers can present some non-realistic values for instance a body temperature of \(143^0C\) or an athlete height of 0.23 meters. They are brought about by;

  • Measurement error like the human error when collecting the data.
  • When the sample picked does not accurately represent the population being studied for instance soil samples picked near a cow shed will have high levels of acidity.
  • There might be a mixture of distribution brought about collecting data from two distinct populations.
  • Nevertheless, outliers can indicate new behaviors in the data set, especially when the outliers values are the most recently collected data.

When cleaning data, it is advisable to remove outliers as they can have a significant impact on the aftermath analysis for instance; skewed and biased results, misleading interpretations, increased range and standard deviation(gives false variability of the data), and erroneous assumptions.

Let’s generate a random data set of health of villagers(like blood pressure, heart rate, body temperature etc) and use it to detect outliers within

  • Generate a random data set
# Set seed for reproducibility
set.seed(123)

# Create a data frame with health-related metrics for villagers
health_data <- data.frame(
  villager_id = 1:100,
  body_temperature = c(rnorm(95, mean = 98.6, sd = 1), rnorm(5, mean = 102, sd = 1)),  # Include 5 outliers
  blood_pressure = c(rnorm(95, mean = 120, sd = 10), rnorm(5, mean = 180, sd = 5)),   # Include 5 outliers
  heart_rate = c(rnorm(95, mean = 70, sd = 8), rnorm(5, mean = 110, sd = 5)),         # Include 5 outliers
  cholesterol = c(rnorm(95, mean = 200, sd = 30), rnorm(5, mean = 300, sd = 10))      # Include 5 outliers
)

# View the first few rows of the dataset
head(health_data)
##   villager_id body_temperature blood_pressure heart_rate cholesterol
## 1           1         98.03952       112.8959   87.59048    178.5427
## 2           2         98.36982       122.5688   80.49930    177.4193
## 3           3        100.15871       117.5331   67.87884    171.8438
## 4           4         98.67051       116.5246   74.34555    168.4246
## 5           5         98.72929       110.4838   66.68528    186.8852
## 6           6        100.31506       119.5497   66.19002    209.9354
  • Find outliers using IQR(Interquartile Range) Any values larger than \(1.5xIQR\) added to the upper quartile or smaller than the \(1.5xIQR\) subtracted to the lower quartile is considered an outlier.
# Function to identify outliers using IQR
identify_outliers_iqr <- function(data_column) {
  Q1 <- quantile(data_column, 0.25, na.rm = TRUE)
  Q3 <- quantile(data_column, 0.75, na.rm = TRUE)
  IQR_value <- IQR(data_column, na.rm = TRUE)
  
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  
  return(data_column < lower_bound | data_column > upper_bound)
}

# Apply IQR method to identify outliers in each health metric
health_data$outlier_body_temp_iqr <- identify_outliers_iqr(health_data$body_temperature)
health_data$outlier_blood_pressure_iqr <- identify_outliers_iqr(health_data$blood_pressure)
health_data$outlier_heart_rate_iqr <- identify_outliers_iqr(health_data$heart_rate)
health_data$outlier_cholesterol_iqr <- identify_outliers_iqr(health_data$cholesterol)

# View the data with outlier flags for body temperature
health_data[health_data$outlier_body_temp_iqr, c("villager_id", "body_temperature")]
##    villager_id body_temperature
## 96          96         101.3997
## 97          97         104.1873
## 98          98         103.5326
## 99          99         101.7643
  • Find the outliers using the z-score method. The Z-score method standardizes the data to have a mean of 0 and a standard deviation of 1. Outliers are values that have a z-score of greater than 3 or less than -3.
# Function to identify outliers using the Z-score
identify_outliers_zscore <- function(data_column) {
  z_scores <- scale(data_column)
  return(abs(z_scores) > 3)
}

# Apply Z-score method to identify outliers in each health metric
health_data$outlier_body_temp_zscore <- identify_outliers_zscore(health_data$body_temperature)
health_data$outlier_blood_pressure_zscore <- identify_outliers_zscore(health_data$blood_pressure)
health_data$outlier_heart_rate_zscore <- identify_outliers_zscore(health_data$heart_rate)
health_data$outlier_cholesterol_zscore <- identify_outliers_zscore(health_data$cholesterol)

# View the data with Z-score outlier flags for blood pressure
health_data[health_data$outlier_blood_pressure_zscore, c("villager_id", "blood_pressure")]
##     villager_id blood_pressure
## 96           96       189.9861
## 97           97       183.0035
## 98           98       173.7436
## 99           99       176.9442
## 100         100       174.0726

Practical exercise

Download the Brazil’s House of Deputies Reimbursements from here. Read the data set, identify numerical columns that contain the outliers and remove them


Solution

Read the data set

library(data.table)

# Load the data 
dtB <- fread("big-data/brazil-house-of-deputies/deputies_dataset.csv")

# Shape of the data set 
dim(dtB)
## [1] 3014902      10

Remove the outliers using the IQR method. The receipt_value column is the only numerical value with intrinsic value.

# Function to identify outliers using IQR
identify_outliers_iqr <- function(data_column) {
  Q1 <- quantile(data_column, 0.25, na.rm = TRUE)
  Q3 <- quantile(data_column, 0.75, na.rm = TRUE)
  IQR_value <- IQR(data_column, na.rm = TRUE)
  
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  
  return(data_column < lower_bound | data_column > upper_bound)
}

# Apply the IQR method
outliers <- identify_outliers_iqr(dtB$receipt_value)

# Remove the outliers
clean_dtB_1 <- dtB[!outliers, ]

# View the shape of the data set
dim(clean_dtB_1)
## [1] 2708043      10
rm(clean_dtB_1) # clean the memory

Use the zscore method

# Function to identify outliers using the Z-score
identify_outliers_zscore <- function(data_column) {
  z_scores <- scale(data_column)
  return(abs(z_scores) > 3)
}

# Apply the Z-score function
outliers <- identify_outliers_zscore(dtB$receipt_value)

# Convert outliers (matrix) to a logical vector
outliers_logical <- as.vector(outliers)

# Remove the outliers
clean_dtB_2 <- dtB[!outliers_logical, ]

# View the shape of the data set
dim(clean_dtB_2)
## [1] 2975608      10
# Clean the memory
rm(clean_dtB_2)

More outliers were identified using the IQR method than the Z-score method.

________________________________________________________________________________

8.2.1.2.3 Data Transformation Techniques
  • Normalization and Standardization

Normalization and standardization are both data preprocessing techniques that adjust the range of input values to make them easier to analyze and process. They are often interchangeably however, they have different applications and techniques. To be specific;

Normalization involves rescaling data within a specific range for instance between 0 and 1 where the maximum value will be 1 and the minimum be 0, all other values will range between zero and 1. Also, -1 and 1, 0 and 100 are among the most common ranges where statistician normalizes their data. This technique is useful where the distribution of the data is unknown or not normal but its likely to be affected by outliers due to its restricted range.

On the contrary, Standardization involves rescaling the data values to have a mean of zero and a standard deviation of one. This technique is useful when the distribution of the data is known and normally distributed. It is less likely to be affected by outliers because it does not have a restricted range. Z-score is of the most popular method of standardization. Here is the formula to calculate z-score; \[Z = {value-mean\over{standard deviation}}\]

Lets normalize/standardize a vector v1, v1<- c(1200,34567,3456,12,3456,0985,1211) in three easy steps

  1. Min-Max Scaling(normalization)

The vector will be scaled in a range of 0 to 1 using a preProcess function from the caret package.

# Load the library
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
v1 <- c(1200,34567,3456,12,3456,0985,1211)

# train the preprocesser(scaler)
minMaxScaler <- preProcess(as.data.frame(v1), method=c("range"))

# Apply the scaler
v1_scaled <- predict(minMaxScaler, as.data.frame(v1))
v1_scaled
##           v1
## 1 0.03437997
## 2 1.00000000
## 3 0.09966720
## 4 0.00000000
## 5 0.09966720
## 6 0.02815801
## 7 0.03469831
  1. Using the scale() function(standardization)
# Create the vector
v1 <- c(1200,34567,3456,12,3456,0985,1211)

v1_scaled <- as.data.frame(scale(v1))
v1_scaled
##           V1
## 1 -0.4175944
## 2  2.2556070
## 3 -0.2368546
## 4 -0.5127711
## 5 -0.2368546
## 6 -0.4348191
## 7 -0.4167131
  1. Log transform

This is used especially when the data is unevenly distributed for instance being skewed by “stretching out” the tail of a right-skewed distribution making it more symmetric.

# Create the vector
v1 <- c(1200,34567,3456,12,3456,0985,1211)

log_scaled_v1 <- log(as.data.frame(v1))
log_scaled_v1
##          v1
## 1  7.090077
## 2 10.450655
## 3  8.147867
## 4  2.484907
## 5  8.147867
## 6  6.892642
## 7  7.099202
  • Encoding Categorical Variables

Encoding is the process of converting categorical variables to numerical variables in data preprocessing. Machine learning models only understand numerical values, therefore encoding is a crucial step in data preprocessing. The most common methods are one-hot encoding and label encoding. We will discuss how to perform them in R

  1. One-hot encoding One hot encoding represents each category as a separate column and converts the categorical values into a 1 or 0 depending where the category is represented. This avoids the assumption of ordinality by the machine learning models. Lets perform one hot encoding on a color vector
library(caret)

# Sample categorical variable
color_df <- data.frame(
  Color = c("Red", "Blue", "Green", "Red", "Blue")
)

# dummify the data 
color_dmy <- dummyVars(" ~ .", data = color_df)
new_color_df <- data.frame(predict(color_dmy, newdata = color_df))
new_color_df
##   ColorBlue ColorGreen ColorRed
## 1         0          0        1
## 2         1          0        0
## 3         0          1        0
## 4         0          0        1
## 5         1          0        0

Alternatively you can use the model_matrix, a built-in R function

# Using model.matrix() to one-hot encode
color_df$Color <- as.factor(color_df$Color)  # Convert Color to a factor
one_hot_encoded <- model.matrix(~ Color - 1, color_df)

# View the encoded matrix
one_hot_encoded
##   ColorBlue ColorGreen ColorRed
## 1         0          0        1
## 2         1          0        0
## 3         0          1        0
## 4         0          0        1
## 5         1          0        0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Color
## [1] "contr.treatment"

One-hot encoding is preferred in cases where the categories are nominal for example colors, countries since it does not assume any order between the categories

  1. Label encoding

This method assigns a unique integer to each category. This method is simpler but it may impose unintended ordinality.

# Sample categorical variable
color_df <- data.frame(
  Color = c("Red", "Blue", "Green", "Red", "Blue")
)

# Label encoding
color_df$Color_encoded <- as.numeric(as.factor(color_df$Color))

# View the label encoded data
color_df
##   Color Color_encoded
## 1   Red             3
## 2  Blue             1
## 3 Green             2
## 4   Red             3
## 5  Blue             1

Label encoding is preferred where the categories are ordinal for instance education levels, living standards, age groups etc. Categories that have a natural order

Practical exercise

Using the same Brazil’s House of Deputies data set, encode the deputy_state and the political_party to have numerical values.


Solution

Read the data set

library(data.table)

# Load the data
dtB <- fread("big-data/brazil-house-of-deputies/deputies_dataset.csv")

# View the column 
head(dtB$political_party)
## [1] "PSB" "PSB" "PSB" "PSB" "PSB" "PSB"
# Label encode the column 
dtB$political_party_encoded <- as.numeric(factor(dtB$political_party))

# Confirm the encoding
head(dtB[, .(political_party, political_party_encoded)])
##    political_party political_party_encoded
##             <char>                   <num>
## 1:             PSB                      17
## 2:             PSB                      17
## 3:             PSB                      17
## 4:             PSB                      17
## 5:             PSB                      17
## 6:             PSB                      17

________________________________________________________________________________

8.2.2 Hands-On Exercises

In this course you will be required to use the NHANES data set from the NHANES package. Install the NHANES package by;

install.packages("NHANES")

then import the library and the data set

# Load the library 
library(NHANES)

# Load the data set
data("NHANES")

thereafter you are required to answer the following questions;

  1. Count the total number of missing values in each column in the data set and find which column has the most null values
sapply(NHANES, function(x) sum(is.na(x)))
  1. Remove the rows that have any missing values and store the results in a new data set
  2. For columns with numeric data, impute missing values by replacing them with the mean or median of the column
  3. For categorical columns (such as Gender or SmokeNow), impute missing values using the mode.
  4. For the BMI and Pulse columns, calculate the Interquartile Range (IQR) and identify any data points that fall outside the 1.5 * IQR range (both lower and upper bounds) as outliers
  5. Remove rows where outliers are found in the BMI and Pulse columns
  6. Apply Min-Max normalization to the BMI and Pulse columns, rescaling the data between 0 and 1
  7. Use one-hot encoding to convert the Gender column into a numerical format (i.e., create separate columns for male and female)

Solution

Install the NHANES package

install.packages("NHANES")

then import the library and the data set

# Load the library 
library(NHANES)

# Load the data set
data("NHANES")

# Count the total null values 
sum(is.na(NHANES))
## [1] 277677
  1. Count the total number of missing values in each column in the data set and find which column has the most null values
# Count the null values per column 
missing_counts <- sapply(NHANES, function(x) sum(is.na(x)))
print(missing_counts)
##               ID         SurveyYr           Gender              Age 
##                0                0                0                0 
##        AgeDecade        AgeMonths            Race1            Race3 
##              333             5038                0             5000 
##        Education    MaritalStatus         HHIncome      HHIncomeMid 
##             2779             2769              811              811 
##          Poverty        HomeRooms          HomeOwn             Work 
##              726               69               63             2229 
##           Weight           Length         HeadCirc           Height 
##               78             9457             9912              353 
##              BMI BMICatUnder20yrs          BMI_WHO            Pulse 
##              366             8726              397             1437 
##         BPSysAve         BPDiaAve           BPSys1           BPDia1 
##             1449             1449             1763             1763 
##           BPSys2           BPDia2           BPSys3           BPDia3 
##             1647             1647             1635             1635 
##     Testosterone       DirectChol          TotChol        UrineVol1 
##             5874             1526             1526              987 
##       UrineFlow1        UrineVol2       UrineFlow2         Diabetes 
##             1603             8522             8524              142 
##      DiabetesAge        HealthGen  DaysPhysHlthBad  DaysMentHlthBad 
##             9371             2461             2468             2466 
##   LittleInterest        Depressed     nPregnancies          nBabies 
##             3333             3327             7396             7584 
##       Age1stBaby    SleepHrsNight     SleepTrouble       PhysActive 
##             8116             2245             2228             1674 
##   PhysActiveDays         TVHrsDay       CompHrsDay    TVHrsDayChild 
##             5337             5141             5137             9347 
##  CompHrsDayChild  Alcohol12PlusYr       AlcoholDay      AlcoholYear 
##             9347             3420             5086             4078 
##         SmokeNow         Smoke100        Smoke100n         SmokeAge 
##             6789             2765             2765             6920 
##        Marijuana    AgeFirstMarij     RegularMarij      AgeRegMarij 
##             5059             7109             5059             8634 
##        HardDrugs          SexEver           SexAge  SexNumPartnLife 
##             4235             4233             4460             4275 
##   SexNumPartYear          SameSex   SexOrientation      PregnantNow 
##             5072             4232             5158             8304
# Find the column with the most missing values
column_with_most_na <- names(which.max(missing_counts))
print(paste("Column with the most missing values:", column_with_most_na))
## [1] "Column with the most missing values: HeadCirc"
  1. Remove the rows that have any missing values and store the results in a new data set
# Remove rows with missing values
clean_NHANES <- na.omit(NHANES)

# Check the new data set dimensions
dim(clean_NHANES)
## [1]  0 76

All the records are dropped therefore, removing null values is not an effective method to work on the missing values.

  1. For columns with numeric data, impute missing values by replacing them with the mean or median of the column
library(dplyr)

# Impute missing values for numeric columns with the mean
NHANES_imputed <- NHANES %>%
  mutate_if(is.numeric, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

# Count the null values 
sum(is.na(NHANES_imputed))
## [1] 98569

If you prefer median, replace mean(x) with median(x)

Null values are by more than two thirds.

  1. For categorical columns (such as Gender or SmokeNow), impute missing values using the mode.
# Mode function
get_mode <- function(v) {
  uniqv <- unique(na.omit(v))
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Impute missing values for categorical columns with the mode
NHANES_imputed <- NHANES_imputed %>%
  mutate_if(is.factor, function(x) ifelse(is.na(x), get_mode(x), x))


# Count the null values 
sum(is.na(NHANES_imputed))
## [1] 0

The null values are completely imputed. The data set is complete.

  1. For the BMI and Pulse columns, calculate the Interquartile Range (IQR) and dentify any data points that fall outside the 1.5 * IQR range (both lower and upper bounds) as outliers
# Calculate IQR for BMI
IQR_BMI <- IQR(NHANES$BMI, na.rm = TRUE)
Q1_BMI <- quantile(NHANES$BMI, 0.25, na.rm = TRUE)
Q3_BMI <- quantile(NHANES$BMI, 0.75, na.rm = TRUE)
lower_bound_BMI <- Q1_BMI - 1.5 * IQR_BMI
upper_bound_BMI <- Q3_BMI + 1.5 * IQR_BMI

# Calculate IQR for Pulse
IQR_Pulse <- IQR(NHANES$Pulse, na.rm = TRUE)
Q1_Pulse <- quantile(NHANES$Pulse, 0.25, na.rm = TRUE)
Q3_Pulse <- quantile(NHANES$Pulse, 0.75, na.rm = TRUE)
lower_bound_Pulse <- Q1_Pulse - 1.5 * IQR_Pulse
upper_bound_Pulse <- Q3_Pulse + 1.5 * IQR_Pulse

# Identify outliers for BMI and Pulse
outliers_BMI <- NHANES$BMI < lower_bound_BMI | NHANES$BMI > upper_bound_BMI
outliers_Pulse <- NHANES$Pulse < lower_bound_Pulse | NHANES$Pulse > upper_bound_Pulse
  1. Remove rows where outliers are found in the BMI and Pulse columns.
# Remove rows with outliers in BMI or Pulse
NHANES_no_outliers <- NHANES[!outliers_BMI & !outliers_Pulse, ]

# Check the new data set dimensions
dim(NHANES_no_outliers)
## [1] 9766   76

The removal of outliers in the BMI and Pulse column reduces the number of rows, therefore IQR method is not an effective method to deal with outliers in the NHANES data set.

  1. Apply Min-Max normalization to the `BMI and Pulse columns, rescaling the data between 0 and 1.
# Min-Max normalization
min_max_norm <- function(x) (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))

# Apply normalization to BMI and Pulse
NHANES_no_outliers$BMI <- min_max_norm(NHANES_no_outliers$BMI)
NHANES_no_outliers$Pulse <- min_max_norm(NHANES_no_outliers$Pulse)

# Create a temporary data set for that
#df_temp <- NHANES_no_outliers %>%
#  select(BMI, Pulse)
  1. Use one-hot encoding to convert the Gender column into a numerical format (i.e., create separate columns for male and female)
# One-hot encoding for Gender column
NHANES_encoded <- NHANES_no_outliers %>%
  mutate(Male = ifelse(Gender == "Male", 1, 0),
         Female = ifelse(Gender == "Female", 1, 0))

# Drop the original Gender column
# NHANES_encoded <- NHANES_encoded %>% select(-Gender)

#Check the Male column
head(NHANES_encoded$Male)
## [1]  0  0  0 NA  0  0
# Check the female column
head(NHANES_encoded$Female)
## [1]  0  0  0 NA  0  0

________________________________________________________________________________