Chapter 4 Data Manipulation

4.1 Basic Data Manipulation

4.1.1 Introduction to Data Manipulation

Data Manipulation is the adjusting, organizing and transforming of the raw data is not a more useful and suitable format for data analysis. These are some of the reasons that make data manipulation mandatory in the data analysis process;

  1. Improves the data quality
    Raw data may be incomplete, messy, containing irrelevant information, errors ,or duplicates that need to be cleaned and rectified. This will ensure the data is reliable thereby preventing incorrect conclusions or decisions.

  2. Making Data Usable Sometimes data is collected from different sources that is not ready for analysis. Data Manipulation will transform the data into a structured and consistent format for easy analysis.

  3. Enhancing Data Exploration
    By cleaning the data, analysts explore the data thereby understanding different concepts of the data.

  4. Enabling Complex Analysis
    Some types of analysis require data to be in specific format or structure, for instance the time series analysis require data to be sorted out by date.

  5. Supporting Decision Making
    Data Manipulation ensures that the data that is fed into the system is timely, accurate and reliable for informed decision-making models and relevant reports

These are the key tasks in the data manipulation;

  • Cleaning: by removing inaccurate and incomplete data entries.
  • Filtering the data by selecting certain rows or columns based on a certain criteria.
  • Reshaping: Changing the structure of the data for instance pivoting.
  • Merging: Combine multiple data sets into one.
  • Transforming: Modify existing data by mathematical or logical operations.
  • Aggregation: Summarizing the data by performing operations like sum,average and count.

4.1.2 Subsetting and Filtering Data:

Subsetting is a data management strategy that involves creating a coherent slice data from different data set for specific use cases. This topic will better be explained practically, therefore we will use the titanic data set. The data set contains information about the passengers on the Titanic, including their age, gender, passenger class, whether they survived and other details. Since the titanic dataset is absent in baseR, the titanic library will be installed by;

install.packages("titanic")

load the library

library("titanic")

The data set will indexed using different indexing techniques such as indexing of a single element, row and column indexing.

First we load the data set and view the first few records before indexing

data("titanic_train")

titanic <- titanic_train 
head(titanic) # view the first few rows of the titanic data set
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Subsetting by indexing can be done in three different ways that include

  1. Extracting a Row
  2. Extracting a column
  3. Extracting a Single Element
  1. Extract a row

When subsetting to extract data for a single row the square brackets [ ] will be used with the position of the index you want to extract. Lets extract all the information of the 10th passenger.

titanic[10, ] # note the comma after the index 10
##    PassengerId Survived Pclass                                Name    Sex Age
## 10          10        1      2 Nasser, Mrs. Nicholas (Adele Achem) female  14
##    SibSp Parch Ticket    Fare Cabin Embarked
## 10     1     0 237736 30.0708              C

Also, more indices can be subsetted in the format [i:j, ] where the i is the starting index while j is the ending index respectively. Lets extract the information by subsetting the titanic data from index 7 to 10.

titanic[7:10, ]
##    PassengerId Survived Pclass
## 7            7        0      1
## 8            8        0      3
## 9            9        1      3
## 10          10        1      2
##                                                 Name    Sex Age SibSp Parch
## 7                            McCarthy, Mr. Timothy J   male  54     0     0
## 8                     Palsson, Master. Gosta Leonard   male   2     3     1
## 9  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10               Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
##    Ticket    Fare Cabin Embarked
## 7   17463 51.8625   E46        S
## 8  349909 21.0750              S
## 9  347742 11.1333              S
## 10 237736 30.0708              C
  1. Extract a column

When subsetting to extract data for a single column the square brackets [ ] will be used as before, with the position of the index or column name you want to extract. Lets extract all the information of the column Name.

titanic[, "Name"] # note the comma before "Name"

An index of the column can be used in place of the column name. For instance, the column, “PassengerId” is the first column therefore its index will be 1. Lets subset the column by calling the index.

titanic[, 1] # note the comma before the column index

  1. Extracting a single element

A single element that has a defined position in a data frame, both the row index and the column name/index are called.

dataframe[row_index, column index/name]

Lets extract the age of the Name of the fifth passenger.

titanic[5, "Name"]
## [1] "Allen, Mr. William Henry"

Instead of using the column name. Lets use the column index. In the above context, the column “Name” appears at index(is the fourth column).

titanic[5, 4]
## [1] "Allen, Mr. William Henry"

Subsetting a data set can be done by filtering data based on logical conditions to extract rows that meet certain criteria. They involve comparisons operators such as >, <, ==, != or logical operators like &(and), |(or), ! (not).

In this titanic data set we:-

  1. Filter based on a single condition

Lets find the passengers who survived on the titanic.

survivors <- titanic[titanic$Survived == 1, ]
head(survivors) # view the first few rows of survivors
##    PassengerId Survived Pclass
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 9            9        1      3
## 10          10        1      2
## 11          11        1      3
##                                                   Name    Sex Age SibSp Parch
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
## 11                     Sandstrom, Miss. Marguerite Rut female   4     1     1
##              Ticket    Fare Cabin Embarked
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 9            347742 11.1333              S
## 10           237736 30.0708              C
## 11          PP 9549 16.7000    G6        S

The above data set consists of titanic passengers who survived.

Who were the passengers who boarded the first class on the Titan?

first_class_passengers <- titanic[titanic$Pclass == 1, ]
head(first_class_passengers)
##    PassengerId Survived Pclass
## 2            2        1      1
## 4            4        1      1
## 7            7        0      1
## 12          12        1      1
## 24          24        1      1
## 28          28        0      1
##                                                   Name    Sex Age SibSp Parch
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 12                            Bonnell, Miss. Elizabeth female  58     0     0
## 24                        Sloper, Mr. William Thompson   male  28     0     0
## 28                      Fortune, Mr. Charles Alexander   male  19     3     2
##      Ticket     Fare       Cabin Embarked
## 2  PC 17599  71.2833         C85        C
## 4    113803  53.1000        C123        S
## 7     17463  51.8625         E46        S
## 12   113783  26.5500        C103        S
## 24   113788  35.5000          A6        S
## 28    19950 263.0000 C23 C25 C27        S

The above examples, the extracted data set met a single condition.

  1. Filtering based on Multiple Conditions

Data can be subsetted by filtering based on more than one condition. To demonstrate this, lets find the female passengers who survived. Here there are two conditions;-

  • the passenger must be a female,
  • the passenger must have survived.

The resultant data set must meet the above conditions

female_survivors <- titanic[titanic$Sex == "female" & titanic$Survived == 1, ]

head(female_survivors) #view the first few rows 
##    PassengerId Survived Pclass
## 2            2        1      1
## 3            3        1      3
## 4            4        1      1
## 9            9        1      3
## 10          10        1      2
## 11          11        1      3
##                                                   Name    Sex Age SibSp Parch
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                               Heikkinen, Miss. Laina female  26     0     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0     2
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
## 11                     Sandstrom, Miss. Marguerite Rut female   4     1     1
##              Ticket    Fare Cabin Embarked
## 2          PC 17599 71.2833   C85        C
## 3  STON/O2. 3101282  7.9250              S
## 4            113803 53.1000  C123        S
## 9            347742 11.1333              S
## 10           237736 30.0708              C
## 11          PP 9549 16.7000    G6        S

Lets also add one more condition, the female survivor must be under 18.

minor_female_survivors <- titanic[titanic$Sex == "female" & 
                                    titanic$Survived == 1 &
                                    titanic$Age < 18, ] # comma should be after the conditons

head(minor_female_survivors)
##      PassengerId Survived Pclass                                Name    Sex Age
## 10            10        1      2 Nasser, Mrs. Nicholas (Adele Achem) female  14
## 11            11        1      3     Sandstrom, Miss. Marguerite Rut female   4
## NA            NA       NA     NA                                <NA>   <NA>  NA
## 23            23        1      3         McGowan, Miss. Anna "Annie" female  15
## NA.1          NA       NA     NA                                <NA>   <NA>  NA
## NA.2          NA       NA     NA                                <NA>   <NA>  NA
##      SibSp Parch  Ticket    Fare Cabin Embarked
## 10       1     0  237736 30.0708              C
## 11       1     1 PP 9549 16.7000    G6        S
## NA      NA    NA    <NA>      NA  <NA>     <NA>
## 23       0     0  330923  8.0292              Q
## NA.1    NA    NA    <NA>      NA  <NA>     <NA>
## NA.2    NA    NA    <NA>      NA  <NA>     <NA>
  1. Filtering using Negation

The != sign a logical operator that is used to negate a condition. Lets use it to find the passengers who did not survive.

non_survivors <- titanic[titanic$Survived != 1, ]
tail(non_survivors) # view the last few records 
##     PassengerId Survived Pclass                                     Name    Sex
## 884         884        0      2            Banfield, Mr. Frederick James   male
## 885         885        0      3                   Sutehall, Mr. Henry Jr   male
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch           Ticket   Fare Cabin Embarked
## 884  28     0     0 C.A./SOTON 34068 10.500              S
## 885  25     0     0  SOTON/OQ 392076  7.050              S
## 886  39     0     5           382652 29.125              Q
## 887  27     0     0           211536 13.000              S
## 889  NA     1     2       W./C. 6607 23.450              S
## 891  32     0     0           370376  7.750              Q

Alternatively you can use

non_survivors <- titanic[!titanic$Survived == 1, ]

Also, lets find the passengers who were not in the third class

not_third_class <- titanic[titanic$Pclass != 3, ]
head(not_third_class)
##    PassengerId Survived Pclass
## 2            2        1      1
## 4            4        1      1
## 7            7        0      1
## 10          10        1      2
## 12          12        1      1
## 16          16        1      2
##                                                   Name    Sex Age SibSp Parch
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 7                              McCarthy, Mr. Timothy J   male  54     0     0
## 10                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1     0
## 12                            Bonnell, Miss. Elizabeth female  58     0     0
## 16                    Hewlett, Mrs. (Mary D Kingcome)  female  55     0     0
##      Ticket    Fare Cabin Embarked
## 2  PC 17599 71.2833   C85        C
## 4    113803 53.1000  C123        S
## 7     17463 51.8625   E46        S
## 10   237736 30.0708              C
## 12   113783 26.5500  C103        S
## 16   248706 16.0000              S

Alternatively

not_third_class <- titanic[!titanic$Pclass == 3, ]

4.1.3 Sorting Data

Sorting is the ordering of elements in a data set (vectors, lists, matrix and data frames) based on a particular criteria. This is a fundamental operation data analysis, as it enables data organization in a meaningful way for easier visualization and interpretation. These are the several functions in Base R that are used in sorting;-

  1. sort()
  • Lets create a vector v with five elements
  • Sort the elements in a descending order
v = c(43, 82, 11, 73, 34) # Create a vector
v1 = sort(v, decreasing = TRUE) #sort the elements in a descending order
v1
## [1] 82 73 43 34 11
  • to order the same vector in an ascending order the decreasing argument is set to FALSE.
v = c(43, 82, 11, 73, 34) # Create a vector
v2 = sort(v, decreasing = FALSE) #sort the elements in an ascending order
v2
## [1] 11 34 43 73 82
  • Also character vectors can be sorted in alphabetical order for instance lets sort the the names, "Alice", "Charlie", "Bob" in the alphabetical order.
names <- c("Alice", "Charlie", "Bob")
sorted_names <- sort(names)
sorted_names
## [1] "Alice"   "Bob"     "Charlie"
  • Alternatively, the names can be ordered in the reverse alphabetical order when the decreasing argument is set to TRUE.
names <- c("Alice", "Charlie", "Bob", "Zach")
names_1 <- sort(names, decreasing = TRUE) # order in reverse alphabetical order
names_1
## [1] "Zach"    "Charlie" "Bob"     "Alice"
  1. order()

This function returns the indices that would sort the vectors. For instance lets sort the vector v = c(43, 82, 11, 73, 34) in an ascending order(from smallest to the largest). The smallest number in this case is 11, therefore, the order() function will return 1 while 82 is the largest(5th smallest) number in this case, it will be returned as 5.

v = c(43, 82, 11, 73, 34)
order(v, decreasing = FALSE)
## [1] 3 5 1 4 2

Lets repeat the process but this time we order the indices of the vector v in a descending order

v = c(43, 82, 11, 73, 34)
order(v, decreasing = TRUE)
## [1] 2 4 1 5 3

You can see that index 2 which has a value 82 on the original vector comes first while index 3 which has a value of 11 comes last

The default value for the decreasing argument is FALSE.

v = c(43, 82, 11, 73, 34)
order(v) # no `decreasing` argument
## [1] 3 5 1 4 2
  1. rank()

Returns of the rank of the element in a vector, list. The smallest element is ranked as 1(in this case its 11) while largest element is ranked last(82 is ranked 5 here)

v = c(43, 82, 11, 73, 34)
rank(v, ties.method = "average", na.last = TRUE)
## [1] 3 5 1 4 2
  1. rev()

This function simply reverse the order of elements. The first element in a vector will be last while the last one will be first.

v = c(43, 82, 11, 73, 34)
rev(v)
## [1] 34 73 11 82 43
  1. Sorting Data Frames

A data frame can be sorted in descending/ascending order of a certain column. For instance, we will sort the titanic data set in the order of age in ascending order.

titanic_by_age <- titanic[order(titanic$Age), ]
head(titanic_by_age)
##     PassengerId Survived Pclass                            Name    Sex  Age
## 804         804        1      3 Thomas, Master. Assad Alexander   male 0.42
## 756         756        1      2       Hamalainen, Master. Viljo   male 0.67
## 470         470        1      3   Baclini, Miss. Helene Barbara female 0.75
## 645         645        1      3          Baclini, Miss. Eugenie female 0.75
## 79           79        1      2   Caldwell, Master. Alden Gates   male 0.83
## 832         832        1      2 Richards, Master. George Sibley   male 0.83
##     SibSp Parch Ticket    Fare Cabin Embarked
## 804     0     1   2625  8.5167              C
## 756     1     1 250649 14.5000              S
## 470     2     1   2666 19.2583              C
## 645     2     1   2666 19.2583              C
## 79      0     2 248738 29.0000              S
## 832     1     1  29106 18.7500              S

Sorting the titanic data by age in descending order, - will be added infront of argument titanic$Age to be -titanic$Age

titanic_by_age <- titanic[order(-titanic$Age), ] # note the - sign
head(titanic_by_age)
##     PassengerId Survived Pclass                                 Name  Sex  Age
## 631         631        1      1 Barkworth, Mr. Algernon Henry Wilson male 80.0
## 852         852        0      3                  Svensson, Mr. Johan male 74.0
## 97           97        0      1            Goldschmidt, Mr. George B male 71.0
## 494         494        0      1              Artagaveytia, Mr. Ramon male 71.0
## 117         117        0      3                 Connors, Mr. Patrick male 70.5
## 673         673        0      2          Mitchell, Mr. Henry Michael male 70.0
##     SibSp Parch     Ticket    Fare Cabin Embarked
## 631     0     0      27042 30.0000   A23        S
## 852     0     0     347060  7.7750              S
## 97      0     0   PC 17754 34.6542    A5        C
## 494     0     0   PC 17609 49.5042              C
## 117     0     0     370369  7.7500              Q
## 673     0     0 C.A. 24580 10.5000              S

Also, dataframes can be sorted based by multiple columns. Lets sort the titanic data set by Passenger class (Pclass) in ascending order and by age in descending order at once.

# Pclass in ascending order, Age in descending order
titanic_sorted_by_class_and_age <- titanic[order(titanic$Pclass, -titanic$Age), ]

head(titanic_sorted_by_class_and_age)
##     PassengerId Survived Pclass                                 Name  Sex Age
## 631         631        1      1 Barkworth, Mr. Algernon Henry Wilson male  80
## 97           97        0      1            Goldschmidt, Mr. George B male  71
## 494         494        0      1              Artagaveytia, Mr. Ramon male  71
## 746         746        0      1         Crosby, Capt. Edward Gifford male  70
## 55           55        0      1       Ostby, Mr. Engelhart Cornelius male  65
## 457         457        0      1            Millet, Mr. Francis Davis male  65
##     SibSp Parch    Ticket    Fare Cabin Embarked
## 631     0     0     27042 30.0000   A23        S
## 97      0     0  PC 17754 34.6542    A5        C
## 494     0     0  PC 17609 49.5042              C
## 746     1     1 WE/P 5735 71.0000   B22        S
## 55      0     1    113509 61.9792   B30        C
## 457     0     0     13509 26.5500   E38        S

4.1.4 Basic Data Cleaning

Data Cleaning is the process of fixing, removing incorrect, incomplete or otherwise problematic data from a data set. This is a crucial step in data analysis as it leads to more reliable analyses and insights.

Here some data cleaning techniques used;

  1. Handling Missing Data

The null values are identified by is.na() function. If there exists null values, they are removed by na.omit() function. The null values can also be replaced by appropriate substitutes such as the mean, median, zero value or a placeholder, this is referred to as imputation.
Lets create a vector myvector will null values, identify the null values, remove/impute them.

# Create a vector that has missing values
my_vector <- c(12, 43, NA, 32, 65, 11, NA, NA, 34, 98, 57) # NA is the missing value

# Identify existence of the null values
is.na(my_vector)
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
# Count the null values 
sum(is.na(my_vector))
## [1] 3
# remove null values 
clean_vector <- na.omit(my_vector)
clean_vector
## [1] 12 43 32 65 11 34 98 57
## attr(,"na.action")
## [1] 3 7 8
## attr(,"class")
## [1] "omit"
# impute missing values
my_vector[is.na(my_vector)] <- mean(my_vector, na.rm = TRUE)
my_vector
##  [1] 12 43 44 32 65 11 44 44 34 98 57
  1. Removing Duplicates

In a raw data set there may exist some duplicated entries/records/rows that will give false results in data analysis thereby leading to poor insights and decision-making. The duplicates are identified by duplicated() function. If there exists any duplicates, they are removed by subsetting or calling unique() function.

Lets create a data frame with duplicate values and remove them

# Creating a simple data frame
data <- data.frame(
  Position = c(1, 2, 3, 4, 5, 3),  # Notice the duplicate position
  Name = c("Alice", "Bob", "Charlie", "David", "Eva", "Charlie"),
  Age = c(25, 30, NA, 40, 29, NA),  # NA represents null values
  Score = c(85, 90, 88, NA, 92, 88)  # NA represents null values
) # Position 3 is duplicated

# Display the data frame
print(data)
##   Position    Name Age Score
## 1        1   Alice  25    85
## 2        2     Bob  30    90
## 3        3 Charlie  NA    88
## 4        4   David  40    NA
## 5        5     Eva  29    92
## 6        3 Charlie  NA    88
print("CLEAN DATA")
## [1] "CLEAN DATA"
# Remove the duplicate row
clean_data <- unique(data)
clean_data
##   Position    Name Age Score
## 1        1   Alice  25    85
## 2        2     Bob  30    90
## 3        3 Charlie  NA    88
## 4        4   David  40    NA
## 5        5     Eva  29    92

Also the duplicated rows can be removed by

clean_data2 <- data[!duplicated(data), ]
clean_data2
##   Position    Name Age Score
## 1        1   Alice  25    85
## 2        2     Bob  30    90
## 3        3 Charlie  NA    88
## 4        4   David  40    NA
## 5        5     Eva  29    92
  1. Handling Outliers

Outliers are extreme values that differ from most other data points in a data set. hey can be detected by identifying the upper bound and lower bound using boxplots. They are removed by;

clean_data <- raw_data[raw_data$column < upper_bound & raw_data$column > lower_bound, ]
  1. Standardizing Data

There are numerous ways that data can be standardized such ensuring a consistent date format across the data set and correcting case.

Lets create a vector with different date formats and make it consistent

# Creating a vector with different date formats
date_vector <- c("2023-08-01", "01/08/2023", "August 1, 2023", "20230801", "08-01-2023")

# Converting the date_vector to a consistent format
clean_date_vector <- as.Date(date_vector, format="%Y-%m-%d")

# Display the clean date vector
print(clean_date_vector)
## [1] "2023-08-01" NA           NA           NA           "8-01-20"

The lubridate package can do it better

# Load necessary library
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# Correcting each date format using lubridate functions
clean_date_vector <- c(
  as.Date(date_vector[1], format="%Y-%m-%d"),       # "2023-08-01"
  as.Date(date_vector[2], format="%d/%m/%Y"),       # "01/08/2023"
  as.Date(date_vector[3], format="%B %d, %Y"),      # "August 1, 2023"
  as.Date(date_vector[4], format="%Y%m%d"),         # "20230801"
  as.Date(date_vector[5], format="%m-%d-%Y")        # "08-01-2023"
)

# Display the clean date vector
print(clean_date_vector)
## [1] "2023-08-01" "2023-08-01" NA           "2023-08-01" "2023-08-01"

Practical exercise

You will use the car sales data set found here to;

  1. Delete all the records that have any null value
  2. Remove duplicated observations

4.1.5 Hands-on Exercises

Download the Bengaluru Restaurants Dataset from here and use it to for the activities below.

  1. Subsetting and Filtering
  • Subset the data set to include only restaurants with a rating of 4.0 and above.
  • Filter out restaurants that have zero reviews or null ratings.
  1. Sorting
  • Sort the data set by restaurant ratings in descending order.
  • Sort the data set by the number of reviews a restaurant received, with the highest reviews appearing last
  1. Data Cleaning
  • Identify and remove rows with missing or null values in any key columns (e.g., ratings, cost, cuisine type).

4.2 Data Manipulation with Dplyr

4.2.1 Introduction to Dplyr package

Dplyr is a package designed for data manipulation equipped with a set of intuitive functions to perform tasks like filtering rows, selecting columns, rearranging data and summarizing information. The package is part of a larger library, tidyverse. The tidyverse package is a package designed for data science that share an underlying design philosophy, grammar and data structures. The packages within the tidyverse are widely used for data manipulation, exploration, and visualization in R. Here are some of the core packages in tidyverse;

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • purrr
  • tibble

The tidyverse package is installed by

install.packages("tidyverse")

To invoke the package into the system, the below command is invoked

library(tidyverse)
## ── Attaching core tidyverse packages ───────────────────
## ✔ dplyr   1.1.4     ✔ stringr 1.5.1
## ✔ forcats 1.0.0     ✔ tibble  3.2.1
## ✔ ggplot2 3.5.1     ✔ tidyr   1.3.1
## ✔ purrr   1.0.2     
## ── Conflicts ────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

In this course, we will discuss on dplyr since it is an essential tool in data analysis. If you want to use dplyr alone then it can be installed by;

install.packages("dplyr")

To load the library into the system;

library(dplyr)

4.2.2 Key Functions in dplyr

There are some functions that are use by data scientist when working with dplyr, they are referred to as dplyr verbs. To explain these verbs better, we will use an example data set to explain. A good example is the famous iris data set, it is always used by beginners in data science. The data set contains measurements of various characteristics of iris flowers. These characteristics include sepal length, sepal width, petal length, and petal width. There are three species of iris flowers in the data set: setosa, versicolor, and virginica. The data will be invoked to R before assessment and wrangling;

Lets load the iris data and explore the first few records.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

4.2.3 select

This dplyr verb is used when selecting or dropping specific columns. In this lesson we will find the iris column names and select two of them using select.

data(iris)

# Find the column names 
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Remember the data frame to work on need to be specified in the arguments such that

selected_data = select(data_frame, col1, col2, col3)

Therefore, we will select the columns; Species, Petal length and petal width.

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

selected_iris_data = select(iris,
                       Petal.Length, Petal.Width, Species)

# view the first few rows of the selected data
head(selected_iris_data)
##   Petal.Length Petal.Width Species
## 1          1.4         0.2  setosa
## 2          1.4         0.2  setosa
## 3          1.3         0.2  setosa
## 4          1.5         0.2  setosa
## 5          1.4         0.2  setosa
## 6          1.7         0.4  setosa

The three selected columns are displayed in the data frame above.

Specific columns can be dropped by putting - before the column name as

# Drop specified columns 
remaining_data = select(data_frame, -col1_to_drop, -col2_to_drop)

In this lesson, we will drop petal length, petal width and Species columns;

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Drop some columns 
remaining_iris_data = select(iris,
                       -Petal.Length, -Petal.Width, -Species)

# view the first few rows of the selected data
head(remaining_iris_data)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

Practical exercise

You will be required to use the car_sales data set from https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv. Read the data using read.csv and select the car, price, body, mileage, engV, engType, year, model. Save the data frame from the selected columns as selected_cars_df. Show the first few rows of the selected_cars_df.

# Load the dplyr library
# CODE HERE

# Read the data 
# CODE HERE

# select the required columns 
# CODE HERE

# Show the first few rows of the data frame
# CODE HERE

4.2.4 filter

Is a verb/function from dplyr used to filter records in a data frame based on a specific condition. It allows the analyst to retrieve the records he/she is interested in and work easier with the subset. With filter(), the data frame and the condion are passed as a arguments;

# Filtering rows where a certain column meets a condition
filtered_data = filter(data_frame, 
column_name > 5 # This is the condition)

Lets select the species ‘setosa’ from the iris data set

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Filter to select Setosa 
setosa_iris = filter(iris, # the data frame
                     Species == "setosa" # the condition
                     )
# First few records of setosa data
head(setosa_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Records with sepal width of more than 3.0 can be filtered. Here is how we achieve such a subset

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Filtered to select records with more than 3.0 sepal width
wide_sepal_iris = filter(iris, #the data frame
                         Sepal.Width>3.0 # the condition
                         )

head(wide_sepal_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.7         3.2          1.3         0.2  setosa
## 3          4.6         3.1          1.5         0.2  setosa
## 4          5.0         3.6          1.4         0.2  setosa
## 5          5.4         3.9          1.7         0.4  setosa
## 6          4.6         3.4          1.4         0.3  setosa

Practical exercise

With the car_sales data set that you used above, use filter() function to get the cars that were sold from the year 2008 to present and name them latest_car_sales. Count the number of observations made and show the first few rows.

# Load the dplyr library
# CODE HERE 

# Read the data 
# CODE HERE

# Filter to find cars sold from 2008
# CODE HERE 

# Count the observations made. Use nrows function
# CODE HERE 

# Show the first few rows
# CODE HERE 

4.2.5 arrange

This is dplyr verb/function used for sorting rows by rearranging in a specific order. here is how to use arrange() function;

arranged_data = arrange(data_frame, column_name)

This allows the analyst to arrange the data in a default ascending order. To arrange in a descending order a desc() function is added as;

# Note the additional desc function
arranged_data = arrange(data_frame, desc(column_name))

Now lets order the iris data in an ascending order based on Petal length and view the first 6 records with the shortest petal.

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Sort the data 
by_petal_length = arrange(iris, # data frame 
                          Petal.Length # order by column
                          )

# View the data 
head(by_petal_length)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.6         3.6          1.0         0.2  setosa
## 2          4.3         3.0          1.1         0.1  setosa
## 3          5.8         4.0          1.2         0.2  setosa
## 4          5.0         3.2          1.2         0.2  setosa
## 5          4.7         3.2          1.3         0.2  setosa
## 6          5.4         3.9          1.3         0.4  setosa

Lets repeat the same process but now we order the data in a descending order.

# Sort the data 
by_petal_length = arrange(iris, # data frame 
                          desc(Petal.Length) # order by column
                          )

# View the data 
head(by_petal_length)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          7.7         2.6          6.9         2.3 virginica
## 2          7.7         3.8          6.7         2.2 virginica
## 3          7.7         2.8          6.7         2.0 virginica
## 4          7.6         3.0          6.6         2.1 virginica
## 5          7.9         3.8          6.4         2.0 virginica
## 6          7.3         2.9          6.3         1.8 virginica

Practical exercise

Arrange the columns in the car_sales data set according to mileage in descending order. Show the last few rows

# Load the dplyr library
# CODE HERE 

# Read the data into a data frame
# CODE HERE 

# Order according to mileage in descending order
# CODE HERE 

# Show the last few rows of the data set
# CODE HERE

4.2.6 mutate

mutate() is a dplyr verb used to modifying the existing variables or creating new variables in a data set. In this case we can calculate the log off Sepal length in the iris data

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# modify Sepal.Length
new_iris = mutate(iris, Sepal.Length=log(Sepal.Length))

head(new_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     1.629241         3.5          1.4         0.2  setosa
## 2     1.589235         3.0          1.4         0.2  setosa
## 3     1.547563         3.2          1.3         0.2  setosa
## 4     1.526056         3.1          1.5         0.2  setosa
## 5     1.609438         3.6          1.4         0.2  setosa
## 6     1.686399         3.9          1.7         0.4  setosa

Additionally, we can create an entirely new variable by mutate(). In this case we will find the ratio between petal length and petal width. The new variable will be called “Petal.Length.Width.Ratio”

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Create a new column in the data set 
new_iris = mutate(iris,
                  Petal.Length.Width.Ratio = Petal.Length/Petal.Width)

head(new_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Petal.Length.Width.Ratio
## 1                     7.00
## 2                     7.00
## 3                     6.50
## 4                     7.50
## 5                     7.00
## 6                     4.25

The “Petal.Length.Width.Ratio” is found by dividing the Petal.Length and the Petal.Width variables.

Practical exercise

Using the car_sales data set, create a new column, "distance_covered_km", calculated from the mileage. Just multiply mileage with 1.609. Show the first few rows of the mutated data frame.

# Load the dplyr library
# CODE HERE 

# Read the data set 
# CODE HERE 

# Create new column "distance_covered_km
# CODE HERE 

# Show first few rows 
# CODE HERE 

4.2.7 Summarise

To calculate summary statistics such as average, median and maximum the summarise() is used. This function collapses multiple rows into a summary row. For instance calculating the mean Petal width;

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Calculate the mean petal width
summarise(iris, 
          mean_petal_width=mean(Petal.Width))
##   mean_petal_width
## 1         1.199333

To find the mean petal width for each iris species; - the iris data will be grouped by species - a mean value for each group will be calculated

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# To find the mean petal width for each iris species;
# - the iris data will be grouped by species
# - a mean value for each group will be calculated
grouped_iris = group_by(iris, Species)
mean_petal_widths = summarise(grouped_iris, mean_value=mean(Petal.Width))
mean_petal_widths
## # A tibble: 3 × 2
##   Species    mean_value
##   <fct>           <dbl>
## 1 setosa          0.246
## 2 versicolor      1.33 
## 3 virginica       2.03

Practical exercise

You will be required to use the car_sales data set once again. Calculate the descriptive statistics using summarise() command.

# Load the dplyr library
# CODE HERE 

# Load the car sales data 
# CODE HERE 

# Calculate the summary statistics
# CODE HERE

4.2.8 group_by

The group_by() is a function used to group records in a data frame by one or more variables. It allows the analyst to create a group based on a certain criteria. It is always chained together with summarise().

Lets group the iris data based on the Species variable and find the mean of each variable;

# Load the required libraries 
library(dplyr)

# Load the iris data 
data(iris)

# Group the iris based on their Species 
iris_groups = group_by(iris, Species) %>%
  summarise(across(everything(), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), mean, na.rm =
##   TRUE)`.
## ℹ In group 1: `Species = setosa`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of
##   dplyr 1.1.0.
## Supply arguments directly to `.fns` through an
## anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
head(iris_groups)
## # A tibble: 3 × 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa             5.01        3.43         1.46       0.246
## 2 versicolor         5.94        2.77         4.26       1.33 
## 3 virginica          6.59        2.97         5.55       2.03

This groupings allow the analyst to retrieve insights at more base level and uncover more insights that could not have been possible when analyzing the entire data set

Practical exercise

Use the car_sales data set provided before to work on this activity. Load the data and group the sales by model to get the sum of every quantitative feature/variable. Name the resultant data frame, car_sales_by_model. Display the resultant data frame. Hint: Use across(where(is.numeric), sum) as an argument to summarise instead of across(everything(), mean, na.rm = TRUE) to find the sum of quantitative variables.

4.3 Chaining

Chaining is the process of combining several operations together using the %>% or forward pipe operator. The chained workflow succeeds each other until the whole process is done. To understand chaining, the mtcars(Motor Trend cars) data set will be used. Mtcars is also a well-known data set containing several attributes of 32 different cars from 1974. Here’s a brief explanation of the variables in the mtcars data set:

  1. mpg: Miles per gallon (fuel efficiency).
  2. cyl: Number of cylinders.
  3. disp: Displacement (cubic inches).
  4. hp: Horsepower.
  5. drat: Rear axle ratio.
  6. wt: Weight (in 1000 lbs).
  7. qsec: Quarter mile time (in seconds).
  8. vs: Engine type (0 = V-shaped, 1 = Straight).
  9. am: Transmission type (0 = Automatic, 1 = Manual).
  10. gear: Number of forward gears.
  11. carb: Number of carburetors.

Lets start by loading the data into the program and view its first few records;

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Lets select 6 most important columns in this analysis

# Load the library
library(dplyr)

# Load the data
data(mtcars)

# Lets `select` 6 most important columns in this analysis
cars1 = mtcars %>% select(mpg, cyl, disp, hp, qsec, am)

head(cars1)
##                    mpg cyl disp  hp  qsec am
## Mazda RX4         21.0   6  160 110 16.46  1
## Mazda RX4 Wag     21.0   6  160 110 17.02  1
## Datsun 710        22.8   4  108  93 18.61  1
## Hornet 4 Drive    21.4   6  258 110 19.44  0
## Hornet Sportabout 18.7   8  360 175 17.02  0
## Valiant           18.1   6  225 105 20.22  0

Lets now filter to find vehicles with an automatic transmission type. The filter verb will be chained to select verb with %>%.

# Load the library
library(dplyr)

# Load the data
data(mtcars)

# Selct and filter chained together
cars2 = mtcars %>%select(mpg, cyl, disp, hp, qsec, am) %>%
  filter(am==0)
head(cars2)
##                    mpg cyl  disp  hp  qsec am
## Hornet 4 Drive    21.4   6 258.0 110 19.44  0
## Hornet Sportabout 18.7   8 360.0 175 17.02  0
## Valiant           18.1   6 225.0 105 20.22  0
## Duster 360        14.3   8 360.0 245 15.84  0
## Merc 240D         24.4   4 146.7  62 20.00  0
## Merc 230          22.8   4 140.8  95 22.90  0

All these vehicles are of automatic transmission type, lets rank them according to the horsepower in descending order.

# Load the library
library(dplyr)

# Load the data
data(mtcars)

# Select, filter and arrange chained together
cars3= mtcars %>%select(mpg, cyl, disp, hp, qsec, am, wt) %>%
  filter(am==0) %>%
  arrange(desc(hp))

head(cars3)
##                      mpg cyl  disp  hp  qsec am    wt
## Duster 360          14.3   8 360.0 245 15.84  0 3.570
## Camaro Z28          13.3   8 350.0 245 15.41  0 3.840
## Chrysler Imperial   14.7   8 440.0 230 17.42  0 5.345
## Lincoln Continental 10.4   8 460.0 215 17.82  0 5.424
## Cadillac Fleetwood  10.4   8 472.0 205 17.98  0 5.250
## Merc 450SE          16.4   8 275.8 180 17.40  0 4.070

A new column of weight in 1000kgs (wt_1000kgs) can be created by diving weight in 1000lbs by 2.20462. mutate verb will be chained also.

# Load the library
library(dplyr)

# Load the data
data(mtcars)

# Multiple chains 
cars4= mtcars %>%select(mpg, cyl, disp, hp, qsec, am, wt) %>%
  filter(am==0) %>%
  arrange(desc(hp)) %>%
  mutate(wt_1000kgs=wt/2.20462)

head(cars4)
##                      mpg cyl  disp  hp  qsec am    wt wt_1000kgs
## Duster 360          14.3   8 360.0 245 15.84  0 3.570   1.619327
## Camaro Z28          13.3   8 350.0 245 15.41  0 3.840   1.741797
## Chrysler Imperial   14.7   8 440.0 230 17.42  0 5.345   2.424454
## Lincoln Continental 10.4   8 460.0 215 17.82  0 5.424   2.460288
## Cadillac Fleetwood  10.4   8 472.0 205 17.98  0 5.250   2.381363
## Merc 450SE          16.4   8 275.8 180 17.40  0 4.070   1.846123

The above process has explained how chained works in dplyr. Many functions/processed can be chained together to manipulate data to the desired output. The next section will apply chaining to biology and be used to answer a few questions that will cement your understanding in R as a biologist.

4.3.1 Hands-on Exercises

You will be required to download the Furniture Sales Dataset from here. Use the data set to answer the questions below;
1. Perform the following tasks using a combination of select(), filter(), arrange(), mutate(), and group_by().

  1. Select the columns "category", "sales", and "profit_margin". Filter the data set to include only orders with a "sales" amount greater than 25, and arrange the data by "profit_margin" in descending order.
  2. Group the data by "category" and calculate the total sales and profit_margin for each category. Arrange the results by total sales in ascending order.

  1. Answer the following questions by chaining two or three dplyr verbs:
  1. What store type had products made of fabric material sold?
  2. What is the average "profit margin" for each "category" of products?