Chapter 8 Working with Real-World Data
8.1 Handling Large Datasets
8.1.1 Challenges of Large Data sets
When working with data in R, one can encounter large data sets that are challenging to work on. These are some of the challenges;
- Data I/O takes a long time such that large files takes a long time to transfer data to or from a computers system which will slow down crucial processes like network operations, communication between the devices such as keyboard and microphone, and sharing data
- R has a file size limit of approximately 2 to 4gb, therefore it is a challenge to work on data sets above the limit like 5gb.
- There is more difficulty in indexing data sets which extremely large number of rows and columns.
- The processing speed of several algorithms and pre-trained model will reduce.
- Large data sets pose a threat on the memory management. R stores data entirely on the memory, therefore it can slow down or even crush the program under extreme cases.
As a data analyst/statistician, large data sets are inevitable therefore researchers have worked on the issue of large data sets and come up with the following solutions;
- Optimize the memory usage through data type conversion and row-wise processing
- Processing large data sets in batches or in chunks.
- Using memory efficient objects and programming tricks like nested functions, lazy loading(load data into memory when its needed) and use of garbage collection where objects that are no longer useful are disposed.
- Partitioning and streaming by loading only small pieces of data into memory at any point in time.
- Use specialized packages for large scale analysis like
data.table
. - Apply data sampling. Instead of processing the whole data at once. Take a random manageable sample from the data set and process it.
8.1.2 Efficient Data Handling Techniques
Lets dive deep on how to work on large data sets in R by applying the popular methods in data science.
8.1.2.1 Using data.table Package
The data.table
package delivers an improved version of a data.frame
structure. The data structure from this package(has the same name, data.table
) is high performance, memory efficient thus being more fit for large data sets than the data.frame
kind of a data structure.
Lets create a simple data.table
from a random data set.
# Load the library
library(data.table)
# Create the data.table
student_dtb <- data.table(stuid = c(2, 5, 3, 4, 6, 7, 4, 2, 0),# student id
age = c(23, 45, 67, 23, 41, 43, 54, 67, 89)
)
student_dtb
## stuid age
## <num> <num>
## 1: 2 23
## 2: 5 45
## 3: 3 67
## 4: 4 23
## 5: 6 41
## 6: 7 43
## 7: 4 54
## 8: 2 67
## 9: 0 89
A data.frame
is created the same as a data.table
# Create the data.frame
student_df <- data.table(stuid = c(2, 5, 3, 4, 6, 7, 4, 2, 0),# student id
age = c(23, 45, 67, 23, 41, 43, 54, 67, 89)
)
student_df
## stuid age
## <num> <num>
## 1: 2 23
## 2: 5 45
## 3: 3 67
## 4: 4 23
## 5: 6 41
## 6: 7 43
## 7: 4 54
## 8: 2 67
## 9: 0 89
They almost look similar but they are more different in operations. Here are some of the differences;
data.table |
data.frame |
---|---|
faster than data.frame | 20X slower than data.table |
Used for more complex data structures and big data | Used for smaller tables and matrices |
Has built-in features like rolling joins and overlapping range | Lacks more features but friendly to a beginner |
code efficient | Utilizes more code to get something done |
setDF(dt) function is used to convert it to a data.frame where argument dt is the data.table |
setDT(df) function is used to convert it to data.table where argument df is the data.frame |
Syntax: data.table() |
Syntax: data.frame() |
There are varieties of data.table operations that can be applied on large data sets. Let’s explore a few of them
- Filtering rows
Rows in a data.table can be filtered using conditions in data.table. For instance, lets filter the data above to find the ones above the age of 40
## stuid age
## <num> <num>
## 1: 5 45
## 2: 3 67
## 3: 6 41
## 4: 7 43
## 5: 4 54
## 6: 2 67
## 7: 0 89
- Selecting Specific Columns
You can select specific columns in a data.table. Lets select the age column
## age
## <num>
## 1: 23
## 2: 45
## 3: 67
## 4: 23
## 5: 41
## 6: 43
## 7: 54
## 8: 67
## 9: 89
- Aggregating data
Aggregation can be done with the help of functions like mean()
, sum()
…
# Create a sample data.table
dt <- data.table(
ID = 1:10,
Name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Hannah", "Isaac", "Jack"),
Age = c(25, 30, 35, 40, 28, 32, 30, 29, 33, 36),
Score = c(85, 90, 88, 92, 76, 95, 89, 78, 82, 91)
)
# Print the data.table
dt
## ID Name Age Score
## <int> <char> <num> <num>
## 1: 1 Alice 25 85
## 2: 2 Bob 30 90
## 3: 3 Charlie 35 88
## 4: 4 David 40 92
## 5: 5 Eve 28 76
## 6: 6 Frank 32 95
## 7: 7 Grace 30 89
## 8: 8 Hannah 29 78
## 9: 9 Isaac 33 82
## 10: 10 Jack 36 91
# Calculate the average score by Age
aggregated_dt <- dt[, .(Average_Score = mean(Score)), by = Age]
print(aggregated_dt)
## Age Average_Score
## <num> <num>
## 1: 25 85.0
## 2: 30 89.5
## 3: 35 88.0
## 4: 40 92.0
## 5: 28 76.0
## 6: 32 95.0
## 7: 29 78.0
## 8: 33 82.0
## 9: 36 91.0
- Chaining
Multiple opeerations can be chained together for efficiency. In this case, we are;
- filtering to get the student with score of above 85
- aggregate the average score by mean
# Chain operations: Filter and aggregate
result_dt <- dt[Score > 85, .(Average_Score = mean(Score)), by = Age]
print(result_dt)
## Age Average_Score
## <num> <num>
## 1: 30 89.5
## 2: 35 88.0
## 3: 40 92.0
## 4: 32 95.0
## 5: 36 91.0
Alternatively, we can filter to get the records with a score of above 85 and select the Name
and Score
column.
# Filter and select columns
filtered_selected_dt <- dt[Score > 85, .(Name, Score)]
print(filtered_selected_dt)
## Name Score
## <char> <num>
## 1: Bob 90
## 2: Charlie 88
## 3: David 92
## 4: Frank 95
## 5: Grace 89
## 6: Jack 91
Practical exercise
In this course, you will be required to use the US Accidents (2016 - 2023) data set which will be downloaded from here. Download the data and extract it to a specified destination path. Use the data.table
package to;
- read the data set into R using a
fread("filepath")
function. - filter to get the rows with
Severity
of more than 2 - select the
Source
,Severity
and theDistance(mi)
columns - Aggregate to find the distance of each source
8.1.2.2 Memory Management in R
A clear grasp of memory management in R will help the analyst/statistician predict how much computer memory you will need for a given task and make you have the most out of the memory. Also, a proper memory management is crucial for effective use of the sytem resources hence the program running smoothly. The R objects are stored in the Random Access Memory (RAM).
Lets explore some of the key techniques of Memory Management;
- Removing unused objects
Objects like variables and packages that are no longer required need to be removed to free up the memory. The rm()
function is used to delete them
You can use the function ls()
(with no argument) to confirm if x
exists got deleted in the variable list
- Monitoring memory usage
R has several functions that help in monitoring the memory usage. This helps the programmer identify if the session is consuming too much memory.
object.size()
is one of function used to compute the memory usage.
## 800048 bytes
The variable x
consumes approximately 800048 bytes of memory
Besides, one can apply memory profiling, that is, tracking the memory usage while the scripts executes. Rprof()
is an inbuilt R profile that gets the job done in real time. Lets demonstrate
# Start memory profiling
Rprof(memory.profiling = TRUE)
# Run some computations
x <- rnorm(1e6)
y <- rnorm(1e6)
# Stop profiling
Rprof(NULL)
# Summary of memory usage
summaryRprof()
## $by.self
## self.time self.pct total.time total.pct
## "rnorm" 0.34 100 0.34 100
##
## $by.total
## total.time total.pct self.time self.pct
## "rnorm" 0.34 100 0.34 100
## "block_exec" 0.34 100 0.00 0
## "call_block" 0.34 100 0.00 0
## "eng_r" 0.34 100 0.00 0
## "eval_with_user_handlers" 0.34 100 0.00 0
## "eval" 0.34 100 0.00 0
## "evaluate_call" 0.34 100 0.00 0
## "evaluate::evaluate" 0.34 100 0.00 0
## "evaluate" 0.34 100 0.00 0
## "generator$render" 0.34 100 0.00 0
## "handle" 0.34 100 0.00 0
## "in_dir" 0.34 100 0.00 0
## "in_input_dir" 0.34 100 0.00 0
## "knitr::knit" 0.34 100 0.00 0
## "process_file" 0.34 100 0.00 0
## "process_group" 0.34 100 0.00 0
## "render_book_script" 0.34 100 0.00 0
## "render_book" 0.34 100 0.00 0
## "render_cur_session" 0.34 100 0.00 0
## "rmarkdown::render_site" 0.34 100 0.00 0
## "rmarkdown::render" 0.34 100 0.00 0
## "timing_fn" 0.34 100 0.00 0
## "withCallingHandlers" 0.34 100 0.00 0
## "withVisible" 0.34 100 0.00 0
## "xfun:::handle_error" 0.34 100 0.00 0
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 0.34
Additionally, the pryr
package goes further to come up with additional functions to monitor memory usage. It can be installed by;
install.packages("pryr")
Below is a demonstration of how pryr
library works
##
## Attaching package: 'pryr'
## The following object is masked from 'package:data.table':
##
## address
## The following object is masked from 'package:dplyr':
##
## where
## The following objects are masked from 'package:purrr':
##
## compose, partial
## 245 MB
## 8.00 MB
- Freeing up the Memory
R has an automatic memory management system, however the programmer might request garbage collection. The gc()
function helps you clean unused memory. This is helpful after removing large objects
# Collect the garbage to free up unused memory
gc()
The gc()
function will accelerate the process of garbage collection rather than waiting for R to automatically free up the memory.
Practical exercise
After working on the previous practical exercise, you will manage the memory by;
- Finding the objects that fill up the memory and list them.
- Deleting the variables not required
- Using
gc
to accelerate the process of garbage collection
8.1.3 Reading and Writing Large Files
8.1.3.1 Optimized File I/O
The data.table
package offers highly optimized functions like fread()
and fwrite()
that are much faster for reading and writing large files. They provide significant speedups due to efficient file handling and parallel processing. The I/O processes can be optimized by;
- Efficient reading
- Efficient writing
- Parallel Processing for faster import/export
- Efficient Reading
The fread()
function is designed to read large CSV files much faster than the read.csv()
by using multiple threads and optimized parsing. Here is how you can read the file
# Load the library
library(data.table)
# Read large csv file
data <- fread("GlobalLandTemperaturesByCity.csv")
Download the csv file from here. Extract the downloaded file and read the first csv file.
These are the perks of reading data using fread()
function;
- Parallel Reading: Uses multiple threads to read files faster.
- Automatic Type Detection: Automatically detects column types without needing to pre-specify them.
- Handling Big Data: Efficient for reading gigabytes of data.
- Efficient Writing
A large data.table can saved locally using the fwrite()
function. It can handle large data volumes efficiently and supports various formats.
# Writing a data frame to a CSV file using fwrite()
fwrite(data, "output_file.csv")
The "output_file"
is the destination path where the data.table is written.
These are the features of fwrite()
;
- Fast Export: Writes data using optimized I/O to speed up the process.
- Compression: Supports compression for efficient storage.
- Parallel Writing: Uses multiple threads to write files faster.
- Parallel Processing for faster Import/Export
The fread()
and fwrite()
use parallel processing by default however you can configure the number of threads used for reading and writing. The setDTthreads()
is used to achieve this. Lets now read and write the data using 4 threads
# Set the number of threads to use for parallel processing
setDTthreads(4)
# Now fread() and fwrite() will use 4 threads
data <- fread("large_dataset.csv")
fwrite(data, "output_file.csv")
Practical exercise
In this exercise, you are required to use the US Accidents (2016 - 2023) data set and solve the problems below;
- Read the data set using
fread()
- Time the
fread()
when reading the file. Hint: replace the"large_dataset.csv"
by the file path of the downloaded file
# Timing fread() for a large file
system.time(data <- fread("large_dataset.csv"))
- Add a new column into the data set. Just the convert the temperature from \(^oF\)(degrees Farenheit) to \(^oC\)(degrees Celcius).
- Write the file locally and check the time taken to write the file. Hint: Here is how you can achieve step iii and iv
# Do the Calculations
# Timing fwrite() for a large file
data[, new_column := rnorm(nrow(data))] # Add a new column
system.time(fwrite(data, "large_output.csv"))
8.1.4 Hands-On Exercises
Download the Amazon Fine Food Reviews data set from here and work on the following problems;
- Remove any unwanted objects in the file system. Use
gc
to view any garbage collection statistics - Read the data set using
read.csv
andfread
and time each method. Compare the time taken for each function to read the file. - Remove the data set from the memory using
rm()
and callgc()
again - Reread the data set using
fread()
however control the threads. Set different thread value (e.g 1, 2, 3, 4 and 5) the compare the time taken for processing. - Write the file locally(use a different file name) and use
gc()
for garbage collection
Finally, discuss on the importance of scalability when dealing with real-world data.
8.2 Data Cleaning and Transformation
8.2.1 Introduction to Data Cleaning
8.2.1.1 Common Data Issues
Raw data from the field can be inaccurate and inconsistent and pose threats to data operations leading to false analysis and decision making. This type of data is know as dirty, unclean or rogue data. Here are some of the common data issues that render data unclean;
- Incomplete data: Some of the data points may be missing or left blank.
- Duplicate data: Some records in the data sets may occur multiple times in different sources.
- Outdated data: Some data values that may have relevant information times ago may now be obsolete and irrelevant.
- Inconsistent data formats: Data may be in different ways across multiple data sources like records be presented in JSON formats, SQL tables, No SQL table or even in image formats(.png, .jpeg).
- Outlier Values: Outliers are extremely high or low values in the data set like a data set containing age of students having one student that is 304 years old or body temperature data set containing values of \(-51^0C\). These values are always impossible and can indicate errors, anomalies and exceptions in the data sets.
Data integrity is crucial in data analysis and can hinder the organizations’ operations. The foundation of strategic planning, efficient operations and sound decision-making is based upon accurate, complete and consistent data. Here are some of the threats that unclean data pose to data analytics and businesses in general;
- Inconsistent Results: For instance if a business finds conflicting or duplicate reports on customer behavior will lead to lose of trust or results’ interpretation.
- Increased Costs: Companies spend a lot of time and money on data cleaning, processing and verification
- Inaccurate Insights and Decision making: Poor data quality like incorrect values may lead to poor insights thereby leading to poor decision-making. For instance inaccurate sales data may lead to flawed forecasting, poor inventory management and poor marketing strategies.
- Missed Opportunities: Poor data quality may prevent the company from identifying key trends and opportunity gaps in the market like customer segments or emerging markets
8.2.1.2 Data Cleaning techniques
8.2.1.2.1 Handling Missing Data
The presence of missing data values is inevitable especially when integrating data from different sources. While there is no specific set method(one-time solution) to handle the missing data points in R, researchers have come up with different methodologies to tackle this issue. Here are some of them;
- Deletion; this is a simple method that involves deleting all the records/rows that have null values. Otherwise, all rows that have null values in important columns can be deleted.
na.omit()
is one of the method. - Imputation: There are packages in R that can fill the null values by imputation. They use the remaining values in the data to create a model that will find a value for the missing one. These packages include
imputeR
,Amelia
andMissForest
that use the automated variable selection, bootstrap multiple imputation and single imputation to handle null values. - Use of algorithms that support null values for instance K-Nearest Neighbors(kNN) and Naive Bayes.
- Mean Imputation: This is simply filling the null values with the average of the remaining values.
Lets create different random data sets and handle the missing values. However, before the issue of null value is fixed, the null values need to be identified in the data set.
Create a random athlete data set
set.seed(123) # Set seed for reproducibility
# Generate a data frame with random data for athletes
athlete_data <- data.frame(
athlete_id = 1:100, # 100 athletes
name = paste0("Athlete", 1:100),
age = sample(18:40, 100, replace = TRUE),
sport = sample(c("Basketball", "Soccer", "Tennis", "Swimming", "Running"), 100, replace = TRUE),
country = sample(c("USA", "Kenya", "China", "Brazil", "Germany"), 100, replace = TRUE),
score = round(runif(100, 60, 100), 1) # Scores between 60 and 100
)
# Introduce 15% missing values randomly across the entire data frame
total_cells <- prod(dim(athlete_data)) # Total number of cells
missing_cells <- round(0.15 * total_cells) # 15% of cells will have missing values
# Randomly select indices to be set to NA
missing_indices <- sample(seq_len(total_cells), missing_cells)
# Convert the data frame to a matrix for easier manipulation of specific indices
athlete_data_matrix <- as.matrix(athlete_data)
athlete_data_matrix[missing_indices] <- NA
# Convert back to a data frame
athlete_data <- as.data.frame(athlete_data_matrix)
# View the dataset
head(athlete_data)
## athlete_id name age sport country score
## 1 1 Athlete1 32 Basketball Germany 95.6
## 2 2 Athlete2 36 Tennis <NA> 92.5
## 3 3 Athlete3 <NA> <NA> USA 89.9
## 4 4 Athlete4 20 <NA> Brazil 66.2
## 5 5 <NA> 27 Tennis Kenya 65.0
## 6 <NA> Athlete6 35 Running Germany 99.0
Lets count the all null values. The is.na()
from base R is used to identify the null values while the sum()
sums up all the identified null values
## [1] 90
Alternatively, the total null values from the columns can be counted. The sapply()
function along with the is.na()
and sum()
are used.
# Count the number of null values in each column
null_counts <- sapply(athlete_data, function(x) sum(is.na(x)))
# Display the counts of null values
null_counts
## athlete_id name age sport country score
## 17 18 10 14 18 13
Lets now delete the rows containing the null values by the na.omit()
and rename the data frame athlete_data_clean
Confirm the operation by counting the null values in each column and overall
# Count the null values by column
null_count_by_column <- sapply(athlete_data_clean,
function(x) sum(is.na(x)))
null_count_by_column
## athlete_id name age sport country score
## 0 0 0 0 0 0
# Overall null count
overall_null_count <- sum(is.na(athlete_data_clean))
print(paste("Overall Null Count: ", overall_null_count))
## [1] "Overall Null Count: 0"
Now that we have seen how null values can be handled by deleting the rows with the null values, therefore lets perform imputation on the original data set. The null values will not be deleted however, they will be replaced by filling the null values with the previous or next value
We will be using the original athlete_data
that has null values. The tidyr
package will be used which can be installed by;
# If not already installed run the command below on the console
install.packages("tidyr")
Count the null values once again and view sample of the data set
# Count the null values by column
null_count_by_column <- sapply(athlete_data,
function(x) sum(is.na(x)))
null_count_by_column
## athlete_id name age sport country score
## 17 18 10 14 18 13
## athlete_id name age sport country score
## 1 1 Athlete1 32 Basketball Germany 95.6
## 2 2 Athlete2 36 Tennis <NA> 92.5
## 3 3 Athlete3 <NA> <NA> USA 89.9
## 4 4 Athlete4 20 <NA> Brazil 66.2
## 5 5 <NA> 27 Tennis Kenya 65.0
## 6 <NA> Athlete6 35 Running Germany 99.0
Now fill the null values with the previous values. Specificall the sport
. athlete_id
and the age
columns
# Import the package
library(tidyr)
# Fill the null value with the previous value
athlete_data_filled <- athlete_data %>%
fill(athlete_id, age, sport, .direction = "down")
# Count the null values by column
null_count_by_column <- sapply(athlete_data_filled,
function(x) sum(is.na(x)))
null_count_by_column
## athlete_id name age sport country score
## 0 18 0 0 18 13
## athlete_id name age sport country score
## 1 1 Athlete1 32 Basketball Germany 95.6
## 2 2 Athlete2 36 Tennis <NA> 92.5
## 3 3 Athlete3 36 Tennis USA 89.9
## 4 4 Athlete4 20 Tennis Brazil 66.2
## 5 5 <NA> 27 Tennis Kenya 65.0
## 6 5 Athlete6 35 Running Germany 99.0
You can see the target columns(age
, athlete_id
and sport
) are filled and have zero null values. In the fill()
function, there is an argument, .direction
that specify the direction in which to fill the missing value. In this case the value of the argument is "down"
because we are filling based on the previous value. It can be either "up"
, "downup"
(based on previous the later) or "updown"
(based on later then previous).
Practical exercise
Download the Detailed NFL Play-by-Play Data 2009-2018 data from here. Read the data set using fread()
and subset to have game_date , drive, home_timeouts_remaining, defteam
columns. Name the data resultant data table subset_dT1
. Perform the following operations on subset_df1
.
- Count the total null values
- Create a copy of the data set and rename it
subset_dT1_copy
. Drop rows that contain any null value. - Create another copy,
subset_dT1_copy2
and impute the missing values based on the previous value..direction=down
. - Count the null values on the copy data set
8.2.1.2.2 Dealing with Outliers
Outliers are values that that are completely different from the other data points in the data set. They may be extremely smaller or extremely larger than the values in the data set. Also outliers can present some non-realistic values for instance a body temperature of \(143^0C\) or an athlete height of 0.23 meters. They are brought about by;
- Measurement error like the human error when collecting the data.
- When the sample picked does not accurately represent the population being studied for instance soil samples picked near a cow shed will have high levels of acidity.
- There might be a mixture of distribution brought about collecting data from two distinct populations.
- Nevertheless, outliers can indicate new behaviors in the data set, especially when the outliers values are the most recently collected data.
When cleaning data, it is advisable to remove outliers as they can have a significant impact on the aftermath analysis for instance; skewed and biased results, misleading interpretations, increased range and standard deviation(gives false variability of the data), and erroneous assumptions.
Let’s generate a random data set of health of villagers(like blood pressure, heart rate, body temperature etc) and use it to detect outliers within
- Generate a random data set
# Set seed for reproducibility
set.seed(123)
# Create a data frame with health-related metrics for villagers
health_data <- data.frame(
villager_id = 1:100,
body_temperature = c(rnorm(95, mean = 98.6, sd = 1), rnorm(5, mean = 102, sd = 1)), # Include 5 outliers
blood_pressure = c(rnorm(95, mean = 120, sd = 10), rnorm(5, mean = 180, sd = 5)), # Include 5 outliers
heart_rate = c(rnorm(95, mean = 70, sd = 8), rnorm(5, mean = 110, sd = 5)), # Include 5 outliers
cholesterol = c(rnorm(95, mean = 200, sd = 30), rnorm(5, mean = 300, sd = 10)) # Include 5 outliers
)
# View the first few rows of the dataset
head(health_data)
## villager_id body_temperature blood_pressure heart_rate cholesterol
## 1 1 98.03952 112.8959 87.59048 178.5427
## 2 2 98.36982 122.5688 80.49930 177.4193
## 3 3 100.15871 117.5331 67.87884 171.8438
## 4 4 98.67051 116.5246 74.34555 168.4246
## 5 5 98.72929 110.4838 66.68528 186.8852
## 6 6 100.31506 119.5497 66.19002 209.9354
- Find outliers using IQR(Interquartile Range) Any values larger than \(1.5xIQR\) added to the upper quartile or smaller than the \(1.5xIQR\) subtracted to the lower quartile is considered an outlier.
# Function to identify outliers using IQR
identify_outliers_iqr <- function(data_column) {
Q1 <- quantile(data_column, 0.25, na.rm = TRUE)
Q3 <- quantile(data_column, 0.75, na.rm = TRUE)
IQR_value <- IQR(data_column, na.rm = TRUE)
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
return(data_column < lower_bound | data_column > upper_bound)
}
# Apply IQR method to identify outliers in each health metric
health_data$outlier_body_temp_iqr <- identify_outliers_iqr(health_data$body_temperature)
health_data$outlier_blood_pressure_iqr <- identify_outliers_iqr(health_data$blood_pressure)
health_data$outlier_heart_rate_iqr <- identify_outliers_iqr(health_data$heart_rate)
health_data$outlier_cholesterol_iqr <- identify_outliers_iqr(health_data$cholesterol)
# View the data with outlier flags for body temperature
health_data[health_data$outlier_body_temp_iqr, c("villager_id", "body_temperature")]
## villager_id body_temperature
## 96 96 101.3997
## 97 97 104.1873
## 98 98 103.5326
## 99 99 101.7643
- Find the outliers using the z-score method. The Z-score method standardizes the data to have a mean of 0 and a standard deviation of 1. Outliers are values that have a z-score of greater than 3 or less than -3.
# Function to identify outliers using the Z-score
identify_outliers_zscore <- function(data_column) {
z_scores <- scale(data_column)
return(abs(z_scores) > 3)
}
# Apply Z-score method to identify outliers in each health metric
health_data$outlier_body_temp_zscore <- identify_outliers_zscore(health_data$body_temperature)
health_data$outlier_blood_pressure_zscore <- identify_outliers_zscore(health_data$blood_pressure)
health_data$outlier_heart_rate_zscore <- identify_outliers_zscore(health_data$heart_rate)
health_data$outlier_cholesterol_zscore <- identify_outliers_zscore(health_data$cholesterol)
# View the data with Z-score outlier flags for blood pressure
health_data[health_data$outlier_blood_pressure_zscore, c("villager_id", "blood_pressure")]
## villager_id blood_pressure
## 96 96 189.9861
## 97 97 183.0035
## 98 98 173.7436
## 99 99 176.9442
## 100 100 174.0726
Practical exercise
Download the Brazil’s House of Deputies Reimbursements from here. Read the data set, identify numerical columns that contain the outliers and remove them
8.2.1.2.3 Data Transformation Techniques
- Normalization and Standardization
Normalization and standardization are both data preprocessing techniques that adjust the range of input values to make them easier to analyze and process. They are often interchangeably however, they have different applications and techniques. To be specific;
Normalization involves rescaling data within a specific range for instance between 0 and 1 where the maximum value will be 1 and the minimum be 0, all other values will range between zero and 1. Also, -1 and 1, 0 and 100 are among the most common ranges where statistician normalizes their data. This technique is useful where the distribution of the data is unknown or not normal but its likely to be affected by outliers due to its restricted range.
On the contrary, Standardization involves rescaling the data values to have a mean of zero and a standard deviation of one. This technique is useful when the distribution of the data is known and normally distributed. It is less likely to be affected by outliers because it does not have a restricted range. Z-score is of the most popular method of standardization. Here is the formula to calculate z-score; \[Z = {value-mean\over{standard deviation}}\]
Lets normalize/standardize a vector v1
, v1<- c(1200,34567,3456,12,3456,0985,1211)
in three easy steps
- Min-Max Scaling(normalization)
The vector will be scaled in a range of 0 to 1 using a preProcess
function from the caret
package.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
v1 <- c(1200,34567,3456,12,3456,0985,1211)
# train the preprocesser(scaler)
minMaxScaler <- preProcess(as.data.frame(v1), method=c("range"))
# Apply the scaler
v1_scaled <- predict(minMaxScaler, as.data.frame(v1))
v1_scaled
## v1
## 1 0.03437997
## 2 1.00000000
## 3 0.09966720
## 4 0.00000000
## 5 0.09966720
## 6 0.02815801
## 7 0.03469831
- Using the
scale()
function(standardization)
# Create the vector
v1 <- c(1200,34567,3456,12,3456,0985,1211)
v1_scaled <- as.data.frame(scale(v1))
v1_scaled
## V1
## 1 -0.4175944
## 2 2.2556070
## 3 -0.2368546
## 4 -0.5127711
## 5 -0.2368546
## 6 -0.4348191
## 7 -0.4167131
- Log transform
This is used especially when the data is unevenly distributed for instance being skewed by “stretching out” the tail of a right-skewed distribution making it more symmetric.
# Create the vector
v1 <- c(1200,34567,3456,12,3456,0985,1211)
log_scaled_v1 <- log(as.data.frame(v1))
log_scaled_v1
## v1
## 1 7.090077
## 2 10.450655
## 3 8.147867
## 4 2.484907
## 5 8.147867
## 6 6.892642
## 7 7.099202
- Encoding Categorical Variables
Encoding is the process of converting categorical variables to numerical variables in data preprocessing. Machine learning models only understand numerical values, therefore encoding is a crucial step in data preprocessing. The most common methods are one-hot encoding and label encoding. We will discuss how to perform them in R
- One-hot encoding One hot encoding represents each category as a separate column and converts the categorical values into a 1 or 0 depending where the category is represented. This avoids the assumption of ordinality by the machine learning models. Lets perform one hot encoding on a color vector
library(caret)
# Sample categorical variable
color_df <- data.frame(
Color = c("Red", "Blue", "Green", "Red", "Blue")
)
# dummify the data
color_dmy <- dummyVars(" ~ .", data = color_df)
new_color_df <- data.frame(predict(color_dmy, newdata = color_df))
new_color_df
## ColorBlue ColorGreen ColorRed
## 1 0 0 1
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
## 5 1 0 0
Alternatively you can use the model_matrix
, a built-in R function
# Using model.matrix() to one-hot encode
color_df$Color <- as.factor(color_df$Color) # Convert Color to a factor
one_hot_encoded <- model.matrix(~ Color - 1, color_df)
# View the encoded matrix
one_hot_encoded
## ColorBlue ColorGreen ColorRed
## 1 0 0 1
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
## 5 1 0 0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Color
## [1] "contr.treatment"
One-hot encoding is preferred in cases where the categories are nominal for example colors, countries since it does not assume any order between the categories
- Label encoding
This method assigns a unique integer to each category. This method is simpler but it may impose unintended ordinality.
# Sample categorical variable
color_df <- data.frame(
Color = c("Red", "Blue", "Green", "Red", "Blue")
)
# Label encoding
color_df$Color_encoded <- as.numeric(as.factor(color_df$Color))
# View the label encoded data
color_df
## Color Color_encoded
## 1 Red 3
## 2 Blue 1
## 3 Green 2
## 4 Red 3
## 5 Blue 1
Label encoding is preferred where the categories are ordinal for instance education levels, living standards, age groups etc. Categories that have a natural order
Practical exercise
Using the same Brazil’s House of Deputies data set, encode the deputy_state
and the political_party
to have numerical values.
8.2.2 Hands-On Exercises
In this course you will be required to use the NHANES data set from the NHANES
package. Install the NHANES package by;
install.packages("NHANES")
then import the library and the data set
# Load the library
library(NHANES)
# Load the data set
data("NHANES")
thereafter you are required to answer the following questions;
- Count the total number of missing values in each column in the data set and find which column has the most null values
sapply(NHANES, function(x) sum(is.na(x)))
- Remove the rows that have any missing values and store the results in a new data set
- For columns with numeric data, impute missing values by replacing them with the mean or median of the column
- For categorical columns (such as
Gender
orSmokeNow
), impute missing values using the mode. - For the
BMI
andPulse
columns, calculate the Interquartile Range (IQR) and identify any data points that fall outside the 1.5 * IQR range (both lower and upper bounds) as outliers - Remove rows where outliers are found in the
BMI
andPulse
columns - Apply Min-Max normalization to the
BMI
andPulse
columns, rescaling the data between 0 and 1 - Use one-hot encoding to convert the Gender column into a numerical format (i.e., create separate columns for male and female)