Chapter 3 Group Manipulation
Group manipulation in R refers to the process of grouping data based on certain categories and then performing operations based on each group separately. This is useful when you want to summarize, analyze or transform subsets of your data independently. In simple terms, group manipulation involves splitting the data into groups, applying a function to each group, and then combining the results. We will explore different methods designed by researchers for group manipulation. They are group manipulation using;
- The apply
family,
- The aggregate
from plyr
package,
- Data reshaping
3.1 Apply Family
The apply family in R is a collection of functions that helps you apply operations to data structures like vectors, lists, matrices and data frames in a more efficient way than using loops. Think of these functions as a way to give commands to your data in bulk, telling each piece what to do without repeating yourself.
Let’s make this fun! Imagine you’re running a café, and you have tables (rows of data) with customer orders (columns of data). You want to calculate the total for each table or find out how much each customer spent on average. The apply
family is like hiring a helper who goes to each table and collects information without you having to ask each customer individually!
We will have a quick overview of the members of the apply family;
apply()
- Works with matrices or data frames, applying a function to rows or columns.lapply()
- Loops over elements in a list, applying a function to each element and returning a list.sapply()
- Similar tolapply
, but it returns a vector or matrix when possible.tapply()
- Applies a function over subsets of data, especially useful for factors or groups.mapply()
- Applies a function to multiple arguments simultaneously.
Try it:
Here is the apply
family in action using the built-in R data set that contains information about flowers.
- Use apply to calculate the mean of each column in the iris data set at once(No need of specifying the columns)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Calculate the mean of each numeric column
col_means <- apply(iris[, 1:4], 2, mean)
print(col_means)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
- The
2
inapply
means “apply the function to columns” and themean
was used to find the average of each column. This is simple as asking a helper to calculate the the average for all types of flowers for each characteristic (sepal length, petal length, etc.). Let’s repeat the same for a each row, instead of argument value2
we will put argument value1
in the second position.
row_means <- apply(iris[, 1:4], 1, mean) # Calculate the mean for each row
head(row_means, 15) # Show the first fifteen averages of the row
## [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800
- Now lets use the
lapply
function to find the range for each numeric column. This function applies to each element and returns a list. No need to specify if its a column or a row
# Calculate the range of each numeric column in the iris dataset
column_ranges <- lapply(iris[, 1:4], range)
print(column_ranges)
## $Sepal.Length
## [1] 4.3 7.9
##
## $Sepal.Width
## [1] 2.0 4.4
##
## $Petal.Length
## [1] 1.0 6.9
##
## $Petal.Width
## [1] 0.1 2.5
Repeating the function with mean
function instead of the range
function.
# Calculate the mean of each numeric column in the iris dataset
col_means <- lapply(iris[, 1:4], mean)
print(col_means)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
You see! lapply
function works column wise instead of row wise when working with data frames. Lets create a function that will add 10 to the input value and use the lapply
function to work on a vector.
# Create a vector
current_ages <- c(21, 43, 12, 56, 32)
# Create a function that adds 10 to an input value
add_10 <- function(value){
return(value + 10)
}
# Test the function
add_10(27)
## [1] 37
# Apply the function to vector ages
ages_10_years_later <- lapply(current_ages, add_10)
ages_10_years_later # Show the result
## [[1]]
## [1] 31
##
## [[2]]
## [1] 53
##
## [[3]]
## [1] 22
##
## [[4]]
## [1] 66
##
## [[5]]
## [1] 42
It returns a list
with values in the vector current_ages
add 10 to each value.
- The
sapply()
function works similarly tolapply()
, but it tries to simplify the output. If possible, it will return a vector or matrix instead of a list. Let`s calculate the variance for each numeric column;
# Calculate the variance for each numeric column
col_variance <- sapply(iris[, 1:4], var)
print(col_variance)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.6856935 0.1899794 3.1162779 0.5810063
Remember that we created a function add_10
that adds 10 to the current ages of the clients. Lets repeat the same using the sapply
function instead of lapply
function.
# Calculate the variance for each numeric column
ages_10_years_later <- sapply(current_ages, add_10)
print(ages_10_years_later)
## [1] 31 53 22 66 42
It is now evident that sapply
has a simpler output than the lapply
function.
- The
tapply()
function applies a function to subsets of data grouped by a factor (e.g., species in our case). Let’s calculate the average sepal length for each species:
# Calculate the average Sepal.Length for each Species
avg_sepal_by_species <- tapply(iris$Sepal.Length, iris$Species, mean)
print(avg_sepal_by_species)
## setosa versicolor virginica
## 5.006 5.936 6.588
This is like sending your helper to collect the sepal lengths for each species separately, and then calculating the average for each group.
- Finally the
mapply()
function is useful when you want to apply a function to multiple sets of arguments at once. Let’s calculate the sum ofSepal.Length
andSepal.Width
for each row:
# Sum Sepal.Length and Sepal.Width for each row
sepal_sum <- mapply(sum, iris$Sepal.Length, iris$Sepal.Width)
head(sepal_sum)
## [1] 8.6 7.9 7.9 7.7 8.6 9.3
This function adds the sepal length and width for each flower row by row. It’s like your helper asking every customer for two values and summing them up together.
Practical Exercise
Now it’s time to test your skills!
- Use
apply()
to calculate the maximum for each column in the iris data set. - Use
lapply()
to find the summary statistics (use thesummary()
function) for each numeric column in the iris data set. - Use
tapply()
to find the average petal width for each species in the iris data set.
3.2 Aggregate Plyr
The aggregate()
function from plyr
package is a powerful tool for grouping and summarizing data in R. This is similar to the SQL GROUP BY
command or the tapply()
that we have discussed above. The difference is that aggregate()
allows to summarize data based on one or more grouping factors.
Try it!
Let’s explore an example using the built-in mtcars
data set to show how to use the aggregate()
from the plyr
package. The plyr
package can be installed by:
install.packages("plyr")
Lets start
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
# Load the data set
data("mtcars")
# Use aggregate to find the average 'mpg' (miles per gallon) grouped by the number of cylinders ('cyl')
avg_mpg_by_cyl <- aggregate(mpg ~ cyl,
data = mtcars,
FUN = mean)
avg_mpg_by_cyl
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
If we break done the code;
mpg ~ cyl
tells R to calculate the averagempg
(dependent variable) for each unique value ofcyl
(grouping factor).data = mtcars
specifies the data set.FUN = mean
applies themean
function to compute the averagempg
for each group ofcyl
.
We have just calculated the average mpg
(miles per gallon) grouped by the number of cyl
(cylinders). Let’s make it a little bit more complex by grouping with multiple variables and summarize multiple columns as well. We will calculate the mean horsepower(hp
) and the weight(wt
) by the number of cylinders(cyl
) and the number of transmission(am
).
# Use aggregate to find the mean hp and wt by cylinders and transmission type
avg_hp_wt_by_cyl_am <- aggregate(cbind(hp, wt) ~ cyl + am,
data = mtcars,
FUN = mean)
avg_hp_wt_by_cyl_am
## cyl am hp wt
## 1 4 0 84.66667 2.935000
## 2 6 0 115.25000 3.388750
## 3 8 0 194.16667 4.104083
## 4 4 1 81.87500 2.042250
## 5 6 1 131.66667 2.755000
## 6 8 1 299.50000 3.370000
If we breakdown the code;
cbind(hp, wt)
allows you to summarize multiple columns (hp
andwt
).cyl + am
groups the data by the number of cylinders and the transmission type (am
= 0 for automatic, 1 for manual`).- The argument
FUN
defines the function to be used here therefore,FUN = mean
calculates the mean values forhp
andwt
for each group ofcyl
andam
.
Practical Exercise
Try using the aggregate()
with the iris
data set to find the mean sepal length (Sepal.Length
) and petal length(Petal.Length
) for each species.
3.3 Data Reshaping
Data reshaping is the process of transforming the layout or structure of a data set without changing the actual data. You typically reshape data to suit different analyses, visualizations, or reporting formats. Common operations for reshaping include pivoting data between wide and long formats.
- Wide format: Each subject(row) has its own columns for measurements at different time points or categories.
- Long format: The data has one measurement per row, making it easier to analyze in some cases, especially with repeated measures.
In R, the most common function for reshaping data include;
pivot_longer()
andpivot_wider()
from thetidyr
package.melt()
anddcast()
from thereshape2
package.
Try it!
Let’s have some fun by working on the mtcars
data set where we will demonstrate reshaping between wide and long formats
Step 1: Inspect the Data
The mtcars
data set is already in a wide format where each row represents a car, and columns represent different variables for instance mpg, cyl, hp
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step2: Converting from Wide to Long Format
We will use the pivot_longer()
function from the tidyr
package to convert the data set from wide to long format. In this case, we will shape the mpg
, hp
and wt
columns into a longer format making it easier to work with.
library(tidyr)
# Reshape the data from wide to long format
mtcars_long <- mtcars %>%
pivot_longer(cols=c(mpg, hp, wt),
names_to = "variable",
values_to = "value")
# View the respaed data
head(mtcars_long)
## # A tibble: 6 × 10
## cyl disp drat qsec vs am gear carb variable value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 mpg 21
## 2 6 160 3.9 16.5 0 1 4 4 hp 110
## 3 6 160 3.9 16.5 0 1 4 4 wt 2.62
## 4 6 160 3.9 17.0 0 1 4 4 mpg 21
## 5 6 160 3.9 17.0 0 1 4 4 hp 110
## 6 6 160 3.9 17.0 0 1 4 4 wt 2.88
If we break down the code;
pivot_longer()
function moves the selected columns (mpg
,hp
,wt
) into a new “long” format, with eah row representing a unique combination of car characteristics(variable
) and their correspondingvalue
.names_to = "variable"
: The variable names (e.g.,mpg
,hp
,wt
) are moved to a column named “variable”.values_to = "value"
: The data for each variable is placed in a column named"value"
.
Also, data in long format can be converted to a wide format. The pivot_wider
function from dplyr
gets the work done.
Try it!
Lets put the pivot_wider
function into practice. We will convert the ntcars_long
data set that we just recently generated to a wider format.
# Reshape from long to wide format
mtcars_wide <- mtcars_long %>%
pivot_wider(names_from = "variable", values_from = "value")
# View the reshaped data
head(mtcars_wide)
## # A tibble: 6 × 11
## cyl disp drat qsec vs am gear carb mpg hp wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 21 110 2.62
## 2 6 160 3.9 17.0 0 1 4 4 21 110 2.88
## 3 4 108 3.85 18.6 1 1 4 1 22.8 93 2.32
## 4 6 258 3.08 19.4 1 0 3 1 21.4 110 3.22
## 5 8 360 3.15 17.0 0 0 3 2 18.7 175 3.44
## 6 6 225 2.76 20.2 1 0 3 1 18.1 105 3.46
If we break down the code;
pivot_wider()
converts the long format back into the wide format, with separate columns for each variable (mpg
,hp
,wt
).names_from = "variable"
: Moves the unique values from the"variable
” column into their own columns (e.g.,mpg
,hp
,wt
).values_from = "value"
: Populates the new columns with values from the “value” column.
Practical Exercise
Use the pivot_longer() function to convert the iris dataset (which contains measurements for different flower features) into a long format. Focus on converting the numeric columns like Sepal.Length and Sepal.Width.
Then, use pivot_wider() to convert it back to a wide format.
3.4 Hands-on Exercise
In this practical exercise, you will be required to download and import the Furniture sales data set from here and perform the following;
- Calculate the average profit margin for each material using the
tapply
function. - Find the difference between the cost and the price of each item on the
mapply
function. - Calculate the maximum price, cost and sales columns using the
sapply
andlapply
function. Note the difference between the two functions. - Use aggregation to find the maximum sales for each product category.
- Subset the
brand
,sales
columns and name itbrand_df
. Convert the data to a longer format and store in a variablepbrand_df_long
.