Chapter 4 Data Manipulation
4.1 Basic Data Manipulation
4.1.1 Introduction to Data Manipulation
Data Manipulation is the adjusting, organizing and transforming of the raw data is not a more useful and suitable format for data analysis. These are some of the reasons that make data manipulation mandatory in the data analysis process;
Improves the data quality
Raw data may be incomplete, messy, containing irrelevant information, errors ,or duplicates that need to be cleaned and rectified. This will ensure the data is reliable thereby preventing incorrect conclusions or decisions.Making Data Usable Sometimes data is collected from different sources that is not ready for analysis. Data Manipulation will transform the data into a structured and consistent format for easy analysis.
Enhancing Data Exploration
By cleaning the data, analysts explore the data thereby understanding different concepts of the data.Enabling Complex Analysis
Some types of analysis require data to be in specific format or structure, for instance the time series analysis require data to be sorted out by date.Supporting Decision Making
Data Manipulation ensures that the data that is fed into the system is timely, accurate and reliable for informed decision-making models and relevant reports
These are the key tasks in the data manipulation;
- Cleaning: by removing inaccurate and incomplete data entries.
- Filtering the data by selecting certain rows or columns based on a certain criteria.
- Reshaping: Changing the structure of the data for instance pivoting.
- Merging: Combine multiple data sets into one.
- Transforming: Modify existing data by mathematical or logical operations.
- Aggregation: Summarizing the data by performing operations like sum,average and count.
4.1.2 Subsetting and Filtering Data:
Subsetting is a data management strategy that involves creating a coherent slice data from different data set for specific use cases. This topic will better be explained practically, therefore we will use the titanic data set. The data set contains information about the passengers on the Titanic, including their age, gender, passenger class, whether they survived and other details. Since the titanic dataset is absent in baseR, the titanic
library will be installed by;
install.packages("titanic")
load the library
The data set will indexed using different indexing techniques such as indexing of a single element, row and column indexing.
First we load the data set and view the first few records before indexing
data("titanic_train")
titanic <- titanic_train
head(titanic) # view the first few rows of the titanic data set
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
Subsetting by indexing can be done in three different ways that include
- Extracting a Row
- Extracting a column
- Extracting a Single Element
- Extract a row
When subsetting to extract data for a single row the square brackets [ ]
will be used with the position of the index you want to extract. Lets extract all the information of the 10th passenger.
## PassengerId Survived Pclass Name Sex Age
## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14
## SibSp Parch Ticket Fare Cabin Embarked
## 10 1 0 237736 30.0708 C
Also, more indices can be subsetted in the format [i:j, ]
where the i
is the starting index while j
is the ending index respectively. Lets extract the information by subsetting the titanic data from index 7 to 10.
## PassengerId Survived Pclass
## 7 7 0 1
## 8 8 0 3
## 9 9 1 3
## 10 10 1 2
## Name Sex Age SibSp Parch
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 8 Palsson, Master. Gosta Leonard male 2 3 1
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## Ticket Fare Cabin Embarked
## 7 17463 51.8625 E46 S
## 8 349909 21.0750 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
- Extract a column
When subsetting to extract data for a single column the square brackets [ ]
will be used as before, with the position of the index or column name you want to extract. Lets extract all the information of the column Name
.
titanic[, "Name"] # note the comma before "Name"
An index of the column can be used in place of the column name. For instance, the column, “PassengerId” is the first column therefore its index will be 1
. Lets subset the column by calling the index.
titanic[, 1] # note the comma before the column index
- Extracting a single element
A single element that has a defined position in a data frame, both the row index and the column name/index are called.
dataframe[row_index, column index/name]
Lets extract the age of the Name of the fifth passenger.
## [1] "Allen, Mr. William Henry"
Instead of using the column name. Lets use the column index. In the above context, the column “Name” appears at index(is the fourth column).
## [1] "Allen, Mr. William Henry"
Subsetting a data set can be done by filtering data based on logical conditions to extract rows that meet certain criteria. They involve comparisons operators such as >, <, ==, !=
or logical operators like &
(and), |
(or), !
(not).
In this titanic data set we:-
- Filter based on a single condition
Lets find the passengers who survived on the titanic.
survivors <- titanic[titanic$Survived == 1, ]
head(survivors) # view the first few rows of survivors
## PassengerId Survived Pclass
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 9 9 1 3
## 10 10 1 2
## 11 11 1 3
## Name Sex Age SibSp Parch
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## 11 Sandstrom, Miss. Marguerite Rut female 4 1 1
## Ticket Fare Cabin Embarked
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
## 11 PP 9549 16.7000 G6 S
The above data set consists of titanic passengers who survived.
Who were the passengers who boarded the first class on the Titan?
## PassengerId Survived Pclass
## 2 2 1 1
## 4 4 1 1
## 7 7 0 1
## 12 12 1 1
## 24 24 1 1
## 28 28 0 1
## Name Sex Age SibSp Parch
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 12 Bonnell, Miss. Elizabeth female 58 0 0
## 24 Sloper, Mr. William Thompson male 28 0 0
## 28 Fortune, Mr. Charles Alexander male 19 3 2
## Ticket Fare Cabin Embarked
## 2 PC 17599 71.2833 C85 C
## 4 113803 53.1000 C123 S
## 7 17463 51.8625 E46 S
## 12 113783 26.5500 C103 S
## 24 113788 35.5000 A6 S
## 28 19950 263.0000 C23 C25 C27 S
The above examples, the extracted data set met a single condition.
- Filtering based on Multiple Conditions
Data can be subsetted by filtering based on more than one condition. To demonstrate this, lets find the female passengers who survived. Here there are two conditions;-
- the passenger must be a female,
- the passenger must have survived.
The resultant data set must meet the above conditions
female_survivors <- titanic[titanic$Sex == "female" & titanic$Survived == 1, ]
head(female_survivors) #view the first few rows
## PassengerId Survived Pclass
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 9 9 1 3
## 10 10 1 2
## 11 11 1 3
## Name Sex Age SibSp Parch
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## 11 Sandstrom, Miss. Marguerite Rut female 4 1 1
## Ticket Fare Cabin Embarked
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 9 347742 11.1333 S
## 10 237736 30.0708 C
## 11 PP 9549 16.7000 G6 S
Lets also add one more condition, the female survivor must be under 18.
minor_female_survivors <- titanic[titanic$Sex == "female" &
titanic$Survived == 1 &
titanic$Age < 18, ] # comma should be after the conditons
head(minor_female_survivors)
## PassengerId Survived Pclass Name Sex Age
## 10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14
## 11 11 1 3 Sandstrom, Miss. Marguerite Rut female 4
## NA NA NA NA <NA> <NA> NA
## 23 23 1 3 McGowan, Miss. Anna "Annie" female 15
## NA.1 NA NA NA <NA> <NA> NA
## NA.2 NA NA NA <NA> <NA> NA
## SibSp Parch Ticket Fare Cabin Embarked
## 10 1 0 237736 30.0708 C
## 11 1 1 PP 9549 16.7000 G6 S
## NA NA NA <NA> NA <NA> <NA>
## 23 0 0 330923 8.0292 Q
## NA.1 NA NA <NA> NA <NA> <NA>
## NA.2 NA NA <NA> NA <NA> <NA>
- Filtering using Negation
The !=
sign a logical operator that is used to negate a condition. Lets use it to find the passengers who did not survive.
## PassengerId Survived Pclass Name Sex
## 884 884 0 2 Banfield, Mr. Frederick James male
## 885 885 0 3 Sutehall, Mr. Henry Jr male
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 884 28 0 0 C.A./SOTON 34068 10.500 S
## 885 25 0 0 SOTON/OQ 392076 7.050 S
## 886 39 0 5 382652 29.125 Q
## 887 27 0 0 211536 13.000 S
## 889 NA 1 2 W./C. 6607 23.450 S
## 891 32 0 0 370376 7.750 Q
Alternatively you can use
non_survivors <- titanic[!titanic$Survived == 1, ]
Also, lets find the passengers who were not in the third class
## PassengerId Survived Pclass
## 2 2 1 1
## 4 4 1 1
## 7 7 0 1
## 10 10 1 2
## 12 12 1 1
## 16 16 1 2
## Name Sex Age SibSp Parch
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 7 McCarthy, Mr. Timothy J male 54 0 0
## 10 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0
## 12 Bonnell, Miss. Elizabeth female 58 0 0
## 16 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0
## Ticket Fare Cabin Embarked
## 2 PC 17599 71.2833 C85 C
## 4 113803 53.1000 C123 S
## 7 17463 51.8625 E46 S
## 10 237736 30.0708 C
## 12 113783 26.5500 C103 S
## 16 248706 16.0000 S
Alternatively
not_third_class <- titanic[!titanic$Pclass == 3, ]
4.1.3 Sorting Data
Sorting is the ordering of elements in a data set (vectors, lists, matrix and data frames) based on a particular criteria. This is a fundamental operation data analysis, as it enables data organization in a meaningful way for easier visualization and interpretation. These are the several functions in Base R that are used in sorting;-
sort()
- Lets create a vector v with five elements
- Sort the elements in a descending order
v = c(43, 82, 11, 73, 34) # Create a vector
v1 = sort(v, decreasing = TRUE) #sort the elements in a descending order
v1
## [1] 82 73 43 34 11
- to order the same vector in an ascending order the
decreasing
argument is set toFALSE
.
v = c(43, 82, 11, 73, 34) # Create a vector
v2 = sort(v, decreasing = FALSE) #sort the elements in an ascending order
v2
## [1] 11 34 43 73 82
- Also character vectors can be sorted in alphabetical order for instance lets sort the the names,
"Alice", "Charlie", "Bob"
in the alphabetical order.
## [1] "Alice" "Bob" "Charlie"
- Alternatively, the names can be ordered in the reverse alphabetical order when the
decreasing
argument is set toTRUE
.
names <- c("Alice", "Charlie", "Bob", "Zach")
names_1 <- sort(names, decreasing = TRUE) # order in reverse alphabetical order
names_1
## [1] "Zach" "Charlie" "Bob" "Alice"
order()
This function returns the indices that would sort the vectors. For instance lets sort the vector v = c(43, 82, 11, 73, 34)
in an ascending order(from smallest to the largest). The smallest number in this case is 11, therefore, the order()
function will return 1 while 82 is the largest(5th smallest) number in this case, it will be returned as 5.
## [1] 3 5 1 4 2
Lets repeat the process but this time we order the indices of the vector v
in a descending order
## [1] 2 4 1 5 3
You can see that index 2
which has a value 82
on the original vector comes first while index 3
which has a value of 11
comes last
The default value for the decreasing
argument is FALSE
.
## [1] 3 5 1 4 2
rank()
Returns of the rank of the element in a vector, list. The smallest element is ranked as 1(in this case its 11) while largest element is ranked last(82 is ranked 5 here)
## [1] 3 5 1 4 2
rev()
This function simply reverse the order of elements. The first element in a vector will be last while the last one will be first.
## [1] 34 73 11 82 43
- Sorting Data Frames
A data frame can be sorted in descending/ascending order of a certain column. For instance, we will sort the titanic data set in the order of age in ascending order.
## PassengerId Survived Pclass Name Sex Age
## 804 804 1 3 Thomas, Master. Assad Alexander male 0.42
## 756 756 1 2 Hamalainen, Master. Viljo male 0.67
## 470 470 1 3 Baclini, Miss. Helene Barbara female 0.75
## 645 645 1 3 Baclini, Miss. Eugenie female 0.75
## 79 79 1 2 Caldwell, Master. Alden Gates male 0.83
## 832 832 1 2 Richards, Master. George Sibley male 0.83
## SibSp Parch Ticket Fare Cabin Embarked
## 804 0 1 2625 8.5167 C
## 756 1 1 250649 14.5000 S
## 470 2 1 2666 19.2583 C
## 645 2 1 2666 19.2583 C
## 79 0 2 248738 29.0000 S
## 832 1 1 29106 18.7500 S
Sorting the titanic data by age in descending order, -
will be added infront of argument titanic$Age
to be -titanic$Age
## PassengerId Survived Pclass Name Sex Age
## 631 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80.0
## 852 852 0 3 Svensson, Mr. Johan male 74.0
## 97 97 0 1 Goldschmidt, Mr. George B male 71.0
## 494 494 0 1 Artagaveytia, Mr. Ramon male 71.0
## 117 117 0 3 Connors, Mr. Patrick male 70.5
## 673 673 0 2 Mitchell, Mr. Henry Michael male 70.0
## SibSp Parch Ticket Fare Cabin Embarked
## 631 0 0 27042 30.0000 A23 S
## 852 0 0 347060 7.7750 S
## 97 0 0 PC 17754 34.6542 A5 C
## 494 0 0 PC 17609 49.5042 C
## 117 0 0 370369 7.7500 Q
## 673 0 0 C.A. 24580 10.5000 S
Also, dataframes can be sorted based by multiple columns. Lets sort the titanic data set by Passenger class (Pclass
) in ascending order and by age in descending order at once.
# Pclass in ascending order, Age in descending order
titanic_sorted_by_class_and_age <- titanic[order(titanic$Pclass, -titanic$Age), ]
head(titanic_sorted_by_class_and_age)
## PassengerId Survived Pclass Name Sex Age
## 631 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 80
## 97 97 0 1 Goldschmidt, Mr. George B male 71
## 494 494 0 1 Artagaveytia, Mr. Ramon male 71
## 746 746 0 1 Crosby, Capt. Edward Gifford male 70
## 55 55 0 1 Ostby, Mr. Engelhart Cornelius male 65
## 457 457 0 1 Millet, Mr. Francis Davis male 65
## SibSp Parch Ticket Fare Cabin Embarked
## 631 0 0 27042 30.0000 A23 S
## 97 0 0 PC 17754 34.6542 A5 C
## 494 0 0 PC 17609 49.5042 C
## 746 1 1 WE/P 5735 71.0000 B22 S
## 55 0 1 113509 61.9792 B30 C
## 457 0 0 13509 26.5500 E38 S
4.1.4 Basic Data Cleaning
Data Cleaning is the process of fixing, removing incorrect, incomplete or otherwise problematic data from a data set. This is a crucial step in data analysis as it leads to more reliable analyses and insights.
Here some data cleaning techniques used;
- Handling Missing Data
The null values are identified by is.na()
function. If there exists null values, they are removed by na.omit()
function. The null values can also be replaced by appropriate substitutes such as the mean, median, zero value or a placeholder, this is referred to as imputation.
Lets create a vector myvector
will null values, identify the null values, remove/impute them.
# Create a vector that has missing values
my_vector <- c(12, 43, NA, 32, 65, 11, NA, NA, 34, 98, 57) # NA is the missing value
# Identify existence of the null values
is.na(my_vector)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [1] 3
## [1] 12 43 32 65 11 34 98 57
## attr(,"na.action")
## [1] 3 7 8
## attr(,"class")
## [1] "omit"
## [1] 12 43 44 32 65 11 44 44 34 98 57
- Removing Duplicates
In a raw data set there may exist some duplicated entries/records/rows that will give false results in data analysis thereby leading to poor insights and decision-making. The duplicates are identified by duplicated()
function. If there exists any duplicates, they are removed by subsetting or calling unique()
function.
Lets create a data frame with duplicate values and remove them
# Creating a simple data frame
data <- data.frame(
Position = c(1, 2, 3, 4, 5, 3), # Notice the duplicate position
Name = c("Alice", "Bob", "Charlie", "David", "Eva", "Charlie"),
Age = c(25, 30, NA, 40, 29, NA), # NA represents null values
Score = c(85, 90, 88, NA, 92, 88) # NA represents null values
) # Position 3 is duplicated
# Display the data frame
print(data)
## Position Name Age Score
## 1 1 Alice 25 85
## 2 2 Bob 30 90
## 3 3 Charlie NA 88
## 4 4 David 40 NA
## 5 5 Eva 29 92
## 6 3 Charlie NA 88
## [1] "CLEAN DATA"
## Position Name Age Score
## 1 1 Alice 25 85
## 2 2 Bob 30 90
## 3 3 Charlie NA 88
## 4 4 David 40 NA
## 5 5 Eva 29 92
Also the duplicated rows can be removed by
## Position Name Age Score
## 1 1 Alice 25 85
## 2 2 Bob 30 90
## 3 3 Charlie NA 88
## 4 4 David 40 NA
## 5 5 Eva 29 92
- Handling Outliers
Outliers are extreme values that differ from most other data points in a data set. hey can be detected by identifying the upper bound and lower bound using boxplots. They are removed by;
clean_data <- raw_data[raw_data$column < upper_bound & raw_data$column > lower_bound, ]
- Standardizing Data
There are numerous ways that data can be standardized such ensuring a consistent date format across the data set and correcting case.
Lets create a vector with different date formats and make it consistent
# Creating a vector with different date formats
date_vector <- c("2023-08-01", "01/08/2023", "August 1, 2023", "20230801", "08-01-2023")
# Converting the date_vector to a consistent format
clean_date_vector <- as.Date(date_vector, format="%Y-%m-%d")
# Display the clean date vector
print(clean_date_vector)
## [1] "2023-08-01" NA NA NA "8-01-20"
The lubridate
package can do it better
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Correcting each date format using lubridate functions
clean_date_vector <- c(
as.Date(date_vector[1], format="%Y-%m-%d"), # "2023-08-01"
as.Date(date_vector[2], format="%d/%m/%Y"), # "01/08/2023"
as.Date(date_vector[3], format="%B %d, %Y"), # "August 1, 2023"
as.Date(date_vector[4], format="%Y%m%d"), # "20230801"
as.Date(date_vector[5], format="%m-%d-%Y") # "08-01-2023"
)
# Display the clean date vector
print(clean_date_vector)
## [1] "2023-08-01" "2023-08-01" NA "2023-08-01" "2023-08-01"
Practical exercise
You will use the car sales data set found here to;
- Delete all the records that have any null value
- Remove duplicated observations
4.1.5 Hands-on Exercises
Download the Bengaluru Restaurants Dataset from here and use it to for the activities below.
- Subsetting and Filtering
- Subset the data set to include only restaurants with a rating of 4.0 and above.
- Filter out restaurants that have zero reviews or null ratings.
- Sorting
- Sort the data set by restaurant ratings in descending order.
- Sort the data set by the number of reviews a restaurant received, with the highest reviews appearing last
- Data Cleaning
- Identify and remove rows with missing or null values in any key columns (e.g., ratings, cost, cuisine type).
4.2 Data Manipulation with Dplyr
4.2.1 Introduction to Dplyr package
Dplyr is a package designed for data manipulation equipped with a set of intuitive functions to perform tasks like filtering rows, selecting columns, rearranging data and summarizing information. The package is part of a larger library, tidyverse
. The tidyverse package is a package designed for data science that share an underlying design philosophy, grammar and data structures. The packages within the tidyverse are widely used for data manipulation, exploration, and visualization in R. Here are some of the core packages in tidyverse;
ggplot2
dplyr
tidyr
readr
purrr
tibble
The tidyverse package is installed by
install.packages("tidyverse")
To invoke the package into the system, the below command is invoked
## ── Attaching core tidyverse packages ───────────────────
## ✔ dplyr 1.1.4 ✔ stringr 1.5.1
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In this course, we will discuss on dplyr
since it is an essential tool in data analysis. If you want to use dplyr alone then it can be installed by;
install.packages("dplyr")
To load the library into the system;
4.2.2 Key Functions in dplyr
There are some functions that are use by data scientist when working with dplyr
, they are referred to as dplyr verbs. To explain these verbs better, we will use an example data set to explain. A good example is the famous iris data set, it is always used by beginners in data science. The data set contains measurements of various characteristics of iris flowers. These characteristics include sepal length, sepal width, petal length, and petal width. There are three species of iris flowers in the data set: setosa, versicolor, and virginica.
The data will be invoked to R before assessment and wrangling;
Lets load the iris data and explore the first few records.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
4.2.3 select
This dplyr verb is used when selecting or dropping specific columns. In this lesson we will find the iris column names and select two of them using select
.
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Remember the data frame to work on need to be specified in the arguments such that
selected_data = select(data_frame, col1, col2, col3)
Therefore, we will select the columns; Species, Petal length and petal width.
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
selected_iris_data = select(iris,
Petal.Length, Petal.Width, Species)
# view the first few rows of the selected data
head(selected_iris_data)
## Petal.Length Petal.Width Species
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## 4 1.5 0.2 setosa
## 5 1.4 0.2 setosa
## 6 1.7 0.4 setosa
The three selected columns are displayed in the data frame above.
Specific columns can be dropped by putting -
before the column name as
# Drop specified columns
remaining_data = select(data_frame, -col1_to_drop, -col2_to_drop)
In this lesson, we will drop petal length, petal width and Species columns;
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Drop some columns
remaining_iris_data = select(iris,
-Petal.Length, -Petal.Width, -Species)
# view the first few rows of the selected data
head(remaining_iris_data)
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5.0 3.6
## 6 5.4 3.9
Practical exercise
You will be required to use the car_sales
data set from https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv. Read the data using read.csv
and select the car, price, body, mileage, engV, engType, year, model
. Save the data frame from the selected columns as selected_cars_df
. Show the first few rows of the selected_cars_df
.
4.2.4 filter
Is a verb/function from dplyr
used to filter records in a data frame based on a specific condition. It allows the analyst to retrieve the records he/she is interested in and work easier with the subset.
With filter()
, the data frame and the condion are passed as a arguments;
# Filtering rows where a certain column meets a condition
filtered_data = filter(data_frame,
column_name > 5 # This is the condition)
Lets select the species ‘setosa’ from the iris data set
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Filter to select Setosa
setosa_iris = filter(iris, # the data frame
Species == "setosa" # the condition
)
# First few records of setosa data
head(setosa_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Records with sepal width of more than 3.0 can be filtered. Here is how we achieve such a subset
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Filtered to select records with more than 3.0 sepal width
wide_sepal_iris = filter(iris, #the data frame
Sepal.Width>3.0 # the condition
)
head(wide_sepal_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
## 5 5.4 3.9 1.7 0.4 setosa
## 6 4.6 3.4 1.4 0.3 setosa
Practical exercise
With the car_sales
data set that you used above, use filter()
function to get the cars that were sold from the year 2008 to present and name them latest_car_sales
. Count the number of observations made and show the first few rows.
4.2.5 arrange
This is dplyr
verb/function used for sorting rows by rearranging in a specific order. here is how to use arrange()
function;
arranged_data = arrange(data_frame, column_name)
This allows the analyst to arrange the data in a default ascending order. To arrange in a descending order a desc()
function is added as;
# Note the additional desc function
arranged_data = arrange(data_frame, desc(column_name))
Now lets order the iris data in an ascending order based on Petal length and view the first 6 records with the shortest petal.
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Sort the data
by_petal_length = arrange(iris, # data frame
Petal.Length # order by column
)
# View the data
head(by_petal_length)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.6 3.6 1.0 0.2 setosa
## 2 4.3 3.0 1.1 0.1 setosa
## 3 5.8 4.0 1.2 0.2 setosa
## 4 5.0 3.2 1.2 0.2 setosa
## 5 4.7 3.2 1.3 0.2 setosa
## 6 5.4 3.9 1.3 0.4 setosa
Lets repeat the same process but now we order the data in a descending order.
# Sort the data
by_petal_length = arrange(iris, # data frame
desc(Petal.Length) # order by column
)
# View the data
head(by_petal_length)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.7 2.6 6.9 2.3 virginica
## 2 7.7 3.8 6.7 2.2 virginica
## 3 7.7 2.8 6.7 2.0 virginica
## 4 7.6 3.0 6.6 2.1 virginica
## 5 7.9 3.8 6.4 2.0 virginica
## 6 7.3 2.9 6.3 1.8 virginica
Practical exercise
Arrange the columns in the car_sales
data set according to mileage
in descending order. Show the last few rows
4.2.6 mutate
mutate()
is a dplyr verb used to modifying the existing variables or creating new variables in a data set.
In this case we can calculate the log off Sepal length in the iris data
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# modify Sepal.Length
new_iris = mutate(iris, Sepal.Length=log(Sepal.Length))
head(new_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1.629241 3.5 1.4 0.2 setosa
## 2 1.589235 3.0 1.4 0.2 setosa
## 3 1.547563 3.2 1.3 0.2 setosa
## 4 1.526056 3.1 1.5 0.2 setosa
## 5 1.609438 3.6 1.4 0.2 setosa
## 6 1.686399 3.9 1.7 0.4 setosa
Additionally, we can create an entirely new variable by mutate()
. In this case we will find the ratio between petal length and petal width. The new variable will be called “Petal.Length.Width.Ratio”
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Create a new column in the data set
new_iris = mutate(iris,
Petal.Length.Width.Ratio = Petal.Length/Petal.Width)
head(new_iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Petal.Length.Width.Ratio
## 1 7.00
## 2 7.00
## 3 6.50
## 4 7.50
## 5 7.00
## 6 4.25
The “Petal.Length.Width.Ratio” is found by dividing the Petal.Length and the Petal.Width variables.
Practical exercise
Using the car_sales
data set, create a new column, "distance_covered_km"
, calculated from the mileage
. Just multiply mileage
with 1.609
. Show the first few rows of the mutated data frame.
4.2.7 Summarise
To calculate summary statistics such as average, median and maximum the summarise()
is used. This function collapses multiple rows into a summary row. For instance calculating the mean Petal width;
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Calculate the mean petal width
summarise(iris,
mean_petal_width=mean(Petal.Width))
## mean_petal_width
## 1 1.199333
To find the mean petal width for each iris species; - the iris data will be grouped by species - a mean value for each group will be calculated
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# To find the mean petal width for each iris species;
# - the iris data will be grouped by species
# - a mean value for each group will be calculated
grouped_iris = group_by(iris, Species)
mean_petal_widths = summarise(grouped_iris, mean_value=mean(Petal.Width))
mean_petal_widths
## # A tibble: 3 × 2
## Species mean_value
## <fct> <dbl>
## 1 setosa 0.246
## 2 versicolor 1.33
## 3 virginica 2.03
Practical exercise
You will be required to use the car_sales
data set once again. Calculate the descriptive statistics using summarise()
command.
4.2.8 group_by
The group_by()
is a function used to group records in a data frame by one or more variables. It allows the analyst to create a group based on a certain criteria. It is always chained together with summarise()
.
Lets group the iris data based on the Species variable and find the mean of each variable;
# Load the required libraries
library(dplyr)
# Load the iris data
data(iris)
# Group the iris based on their Species
iris_groups = group_by(iris, Species) %>%
summarise(across(everything(), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), mean, na.rm =
## TRUE)`.
## ℹ In group 1: `Species = setosa`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of
## dplyr 1.1.0.
## Supply arguments directly to `.fns` through an
## anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 3 × 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
This groupings allow the analyst to retrieve insights at more base level and uncover more insights that could not have been possible when analyzing the entire data set
Practical exercise
Use the car_sales
data set provided before to work on this activity. Load the data and group the sales by model
to get the sum of every quantitative feature/variable. Name the resultant data frame, car_sales_by_model
. Display the resultant data frame. Hint: Use across(where(is.numeric), sum)
as an argument to summarise instead of across(everything(), mean, na.rm = TRUE)
to find the sum of quantitative variables.
4.3 Chaining
Chaining is the process of combining several operations together using the %>%
or forward pipe operator. The chained workflow succeeds each other until the whole process is done. To understand chaining, the mtcars(Motor Trend cars) data set will be used. Mtcars is also a well-known data set containing several attributes of 32 different cars from 1974. Here’s a brief explanation of the variables in the mtcars data set:
mpg
: Miles per gallon (fuel efficiency).cyl
: Number of cylinders.disp
: Displacement (cubic inches).hp
: Horsepower.drat
: Rear axle ratio.wt
: Weight (in 1000 lbs).qsec
: Quarter mile time (in seconds).vs
: Engine type (0 = V-shaped, 1 = Straight).am
: Transmission type (0 = Automatic, 1 = Manual).gear
: Number of forward gears.carb
: Number of carburetors.
Lets start by loading the data into the program and view its first few records;
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Lets select
6 most important columns in this analysis
# Load the library
library(dplyr)
# Load the data
data(mtcars)
# Lets `select` 6 most important columns in this analysis
cars1 = mtcars %>% select(mpg, cyl, disp, hp, qsec, am)
head(cars1)
## mpg cyl disp hp qsec am
## Mazda RX4 21.0 6 160 110 16.46 1
## Mazda RX4 Wag 21.0 6 160 110 17.02 1
## Datsun 710 22.8 4 108 93 18.61 1
## Hornet 4 Drive 21.4 6 258 110 19.44 0
## Hornet Sportabout 18.7 8 360 175 17.02 0
## Valiant 18.1 6 225 105 20.22 0
Lets now filter
to find vehicles with an automatic transmission type. The filter
verb will be chained to select
verb with %>%
.
# Load the library
library(dplyr)
# Load the data
data(mtcars)
# Selct and filter chained together
cars2 = mtcars %>%select(mpg, cyl, disp, hp, qsec, am) %>%
filter(am==0)
head(cars2)
## mpg cyl disp hp qsec am
## Hornet 4 Drive 21.4 6 258.0 110 19.44 0
## Hornet Sportabout 18.7 8 360.0 175 17.02 0
## Valiant 18.1 6 225.0 105 20.22 0
## Duster 360 14.3 8 360.0 245 15.84 0
## Merc 240D 24.4 4 146.7 62 20.00 0
## Merc 230 22.8 4 140.8 95 22.90 0
All these vehicles are of automatic transmission type, lets rank them according to the horsepower in descending order.
# Load the library
library(dplyr)
# Load the data
data(mtcars)
# Select, filter and arrange chained together
cars3= mtcars %>%select(mpg, cyl, disp, hp, qsec, am, wt) %>%
filter(am==0) %>%
arrange(desc(hp))
head(cars3)
## mpg cyl disp hp qsec am wt
## Duster 360 14.3 8 360.0 245 15.84 0 3.570
## Camaro Z28 13.3 8 350.0 245 15.41 0 3.840
## Chrysler Imperial 14.7 8 440.0 230 17.42 0 5.345
## Lincoln Continental 10.4 8 460.0 215 17.82 0 5.424
## Cadillac Fleetwood 10.4 8 472.0 205 17.98 0 5.250
## Merc 450SE 16.4 8 275.8 180 17.40 0 4.070
A new column of weight in 1000kgs (wt_1000kgs) can be created by diving weight in 1000lbs by 2.20462. mutate
verb will be chained also.
# Load the library
library(dplyr)
# Load the data
data(mtcars)
# Multiple chains
cars4= mtcars %>%select(mpg, cyl, disp, hp, qsec, am, wt) %>%
filter(am==0) %>%
arrange(desc(hp)) %>%
mutate(wt_1000kgs=wt/2.20462)
head(cars4)
## mpg cyl disp hp qsec am wt wt_1000kgs
## Duster 360 14.3 8 360.0 245 15.84 0 3.570 1.619327
## Camaro Z28 13.3 8 350.0 245 15.41 0 3.840 1.741797
## Chrysler Imperial 14.7 8 440.0 230 17.42 0 5.345 2.424454
## Lincoln Continental 10.4 8 460.0 215 17.82 0 5.424 2.460288
## Cadillac Fleetwood 10.4 8 472.0 205 17.98 0 5.250 2.381363
## Merc 450SE 16.4 8 275.8 180 17.40 0 4.070 1.846123
The above process has explained how chained works in dplyr
. Many functions/processed can be chained together to manipulate data to the desired output. The next section will apply chaining to biology and be used to answer a few questions that will cement your understanding in R as a biologist.
4.3.1 Hands-on Exercises
You will be required to download the Furniture Sales Dataset from here. Use the data set to answer the questions below;
1. Perform the following tasks using a combination of select()
, filter()
, arrange()
, mutate()
, and group_by()
.
- Select the columns
"category"
,"sales"
, and"profit_margin"
. Filter the data set to include only orders with a"sales"
amount greater than 25, and arrange the data by"profit_margin"
in descending order. - Group the data by
"category"
and calculate the totalsales
andprofit_margin
for each category. Arrange the results by total sales in ascending order.
- Answer the following questions by chaining two or three
dplyr
verbs:
- What store type had products made of fabric material sold?
- What is the average
"profit margin"
for each"category"
of products?