Chapter 3 Group Manipulation
Group manipulation in R refers to the process of grouping data based on certain categories and then performing operations based on each group separately. This is useful when you want to summarize, analyze or transform subsets of your data independently. In simple terms, group manipulation involves splitting the data into groups, applying a function to each group, and then combining the results. We will explore different methods designed by researchers for group manipulation. They are group manipulation using;
- The
apply
family, - The
aggregate
fromplyr
package, - Data reshaping
3.1 Apply Family
The apply family in R is a collection of functions that helps you apply operations to data structures like vectors, lists, matrices and data frames in a more efficient way than using loops. Think of these functions as a way to give commands to your data in bulk, telling each piece what to do without repeating yourself.
Let’s make this fun! Imagine you’re running a café, and you have tables (rows of data) with customer orders (columns of data). You want to calculate the total for each table or find out how much each customer spent on average. The apply
family is like hiring a helper who goes to each table and collects information without you having to ask each customer individually!
We will have a quick overview of the members of the apply family;
apply()
- Works with matrices or data frames, applying a function to rows or columns.lapply()
- Loops over elements in a list, applying a function to each element and returning a list.sapply()
- Similar tolapply
, but it returns a vector or matrix when possible.tapply()
- Applies a function over subsets of data, especially useful for factors or groups.mapply()
- Applies a function to multiple arguments simultaneously.
Try it:
Here is the apply
family in action using the built-in R data set that contains information about flowers.
- Use apply to calculate the mean of each column in the iris data set at once(No need of specifying the columns)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Calculate the mean of each numeric column
col_means <- apply(iris[, 1:4], 2, mean)
print(col_means)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
- The
2
inapply
means “apply the function to columns” and themean
was used to find the average of each column. This is simple as asking a helper to calculate the the average for all types of flowers for each characteristic (sepal length, petal length, etc.). Let’s repeat the same for a each row, instead of argument value2
we will put argument value1
in the second position.
row_means <- apply(iris[, 1:4], 1, mean) # Calculate the mean for each row
head(row_means, 15) # Show the first fifteen averages of the row
## [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800
- Now lets use the
lapply
function to find the range for each numeric column. This function applies to each element and returns a list. No need to specify if its a column or a row
# Calculate the range of each numeric column in the iris dataset
column_ranges <- lapply(iris[, 1:4], range)
print(column_ranges)
## $Sepal.Length
## [1] 4.3 7.9
##
## $Sepal.Width
## [1] 2.0 4.4
##
## $Petal.Length
## [1] 1.0 6.9
##
## $Petal.Width
## [1] 0.1 2.5
Repeating the function with mean
function instead of the range
function.
# Calculate the mean of each numeric column in the iris dataset
col_means <- lapply(iris[, 1:4], mean)
print(col_means)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
You see! lapply
function works column wise instead of row wise when working with data frames. Lets create a function that will add 10 to the input value and use the lapply
function to work on a vector.
# Create a vector
current_ages <- c(21, 43, 12, 56, 32)
# Create a function that adds 10 to an input value
add_10 <- function(value){
return(value + 10)
}
# Test the function
add_10(27)
## [1] 37
# Apply the function to vector ages
ages_10_years_later <- lapply(current_ages, add_10)
ages_10_years_later # Show the result
## [[1]]
## [1] 31
##
## [[2]]
## [1] 53
##
## [[3]]
## [1] 22
##
## [[4]]
## [1] 66
##
## [[5]]
## [1] 42
It returns a list
with values in the vector current_ages
add 10 to each value.
- The
sapply()
function works similarly tolapply()
, but it tries to simplify the output. If possible, it will return a vector or matrix instead of a list. Let`s calculate the variance for each numeric column;
# Calculate the variance for each numeric column
col_variance <- sapply(iris[, 1:4], var)
print(col_variance)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.6856935 0.1899794 3.1162779 0.5810063
Remember that we created a function add_10
that adds 10 to the current ages of the clients. Lets repeat the same using the sapply
function instead of lapply
function.
# Calculate the variance for each numeric column
ages_10_years_later <- sapply(current_ages, add_10)
print(ages_10_years_later)
## [1] 31 53 22 66 42
It is now evident that sapply
has a simpler output than the lapply
function.
- The
tapply()
function applies a function to subsets of data grouped by a factor (e.g., species in our case). Let’s calculate the average sepal length for each species:
# Calculate the average Sepal.Length for each Species
avg_sepal_by_species <- tapply(iris$Sepal.Length, iris$Species, mean)
print(avg_sepal_by_species)
## setosa versicolor virginica
## 5.006 5.936 6.588
This is like sending your helper to collect the sepal lengths for each species separately, and then calculating the average for each group.
- Finally the
mapply()
function is useful when you want to apply a function to multiple sets of arguments at once. Let’s calculate the sum ofSepal.Length
andSepal.Width
for each row:
# Sum Sepal.Length and Sepal.Width for each row
sepal_sum <- mapply(sum, iris$Sepal.Length, iris$Sepal.Width)
head(sepal_sum)
## [1] 8.6 7.9 7.9 7.7 8.6 9.3
This function adds the sepal length and width for each flower row by row. It’s like your helper asking every customer for two values and summing them up together.
Practical Exercise
Now it’s time to test your skills!
- Use
apply()
to calculate the maximum for each column in the iris data set. - Use
lapply()
to find the summary statistics (use thesummary()
function) for each numeric column in the iris data set. - Use
tapply()
to find the average petal width for each species in the iris data set.
Solution
- Use
apply()
to calculate the maximum for each column in the iris data set.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 7.9 4.4 6.9 2.5
- Use
lapply()
to find the summary statistics (use thesummary()
function) for each numeric column in the iris data set.
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
- Use
tapply()
to find the average petal width for each species in the iris data set.
# Calculate the average Petal.Width for each Species
avg_petal_width_by_species <- tapply(iris$Petal.Width, iris$Species, mean)
print(avg_petal_width_by_species)
## setosa versicolor virginica
## 0.246 1.326 2.026
________________________________________________________________________________
3.2 Aggregate Plyr
The aggregate()
function from plyr
package is a powerful tool for grouping and summarizing data in R. This is similar to the SQL GROUP BY
command or the tapply()
that we have discussed above. The difference is that aggregate()
allows to summarize data based on one or more grouping factors.
Try it!
Let’s explore an example using the built-in mtcars
data set to show how to use the aggregate()
from the plyr
package. The plyr
package can be installed by:
install.packages("plyr")
Lets start
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
# Load the data set
data("mtcars")
# Use aggregate to find the average 'mpg' (miles per gallon) grouped by the number of cylinders ('cyl')
avg_mpg_by_cyl <- aggregate(mpg ~ cyl,
data = mtcars,
FUN = mean)
avg_mpg_by_cyl
## cyl mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
If we break done the code;
mpg ~ cyl
tells R to calculate the averagempg
(dependent variable) for each unique value ofcyl
(grouping factor).data = mtcars
specifies the data set.FUN = mean
applies themean
function to compute the averagempg
for each group ofcyl
.
We have just calculated the average mpg
(miles per gallon) grouped by the number of cyl
(cylinders). Let’s make it a little bit more complex by grouping with multiple variables and summarize multiple columns as well. We will calculate the mean horsepower(hp
) and the weight(wt
) by the number of cylinders(cyl
) and the number of transmission(am
).
# Use aggregate to find the mean hp and wt by cylinders and transmission type
avg_hp_wt_by_cyl_am <- aggregate(cbind(hp, wt) ~ cyl + am,
data = mtcars,
FUN = mean)
avg_hp_wt_by_cyl_am
## cyl am hp wt
## 1 4 0 84.66667 2.935000
## 2 6 0 115.25000 3.388750
## 3 8 0 194.16667 4.104083
## 4 4 1 81.87500 2.042250
## 5 6 1 131.66667 2.755000
## 6 8 1 299.50000 3.370000
If we breakdown the code;
cbind(hp, wt)
allows you to summarize multiple columns (hp
andwt
).cyl + am
groups the data by the number of cylinders and the transmission type (am
= 0 for automatic, 1 for manual`).- The argument
FUN
defines the function to be used here therefore,FUN = mean
calculates the mean values forhp
andwt
for each group ofcyl
andam
.
Practical Exercise
Try using the aggregate()
with the iris
data set to find the mean sepal length (Sepal.Length
) and petal length(Petal.Length
) for each species.
Solution
library(plyr)
# Load the iris data set
data(iris)
# Calculate the averages as per the instructions
avg_sepal_petal_by_species <- aggregate(cbind(Sepal.Length, Petal.Length) ~ Species,
data = iris,
FUN = mean)
avg_sepal_petal_by_species
## Species Sepal.Length Petal.Length
## 1 setosa 5.006 1.462
## 2 versicolor 5.936 4.260
## 3 virginica 6.588 5.552
________________________________________________________________________________
3.3 Data Reshaping
Data reshaping is the process of transforming the layout or structure of a data set without changing the actual data. You typically reshape data to suit different analyses, visualizations, or reporting formats. Common operations for reshaping include pivoting data between wide and long formats.
- Wide format: Each subject(row) has its own columns for measurements at different time points or categories.
- Long format: The data has one measurement per row, making it easier to analyze in some cases, especially with repeated measures.
In R, the most common function for reshaping data include;
pivot_longer()
andpivot_wider()
from thetidyr
package.melt()
anddcast()
from thereshape2
package.
Try it!
Let’s have some fun by working on the mtcars
data set where we will demonstrate reshaping between wide and long formats
Step 1: Inspect the Data
The mtcars
data set is already in a wide format where each row represents a car, and columns represent different variables for instance mpg, cyl, hp
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Step2: Converting from Wide to Long Format
We will use the pivot_longer()
function from the tidyr
package to convert the data set from wide to long format. In this case, we will shape the mpg
, hp
and wt
columns into a longer format making it easier to work with.
library(tidyr)
# Reshape the data from wide to long format
mtcars_long <- mtcars %>%
pivot_longer(cols=c(mpg, hp, wt),
names_to = "variable",
values_to = "value")
# View the respaed data
head(mtcars_long)
## # A tibble: 6 × 10
## cyl disp drat qsec vs am gear carb variable value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 mpg 21
## 2 6 160 3.9 16.5 0 1 4 4 hp 110
## 3 6 160 3.9 16.5 0 1 4 4 wt 2.62
## 4 6 160 3.9 17.0 0 1 4 4 mpg 21
## 5 6 160 3.9 17.0 0 1 4 4 hp 110
## 6 6 160 3.9 17.0 0 1 4 4 wt 2.88
If we break down the code;
pivot_longer()
function moves the selected columns (mpg
,hp
,wt
) into a new “long” format, with eah row representing a unique combination of car characteristics(variable
) and their correspondingvalue
.names_to = "variable"
: The variable names (e.g.,mpg
,hp
,wt
) are moved to a column named “variable”.values_to = "value"
: The data for each variable is placed in a column named"value"
.
Also, data in long format can be converted to a wide format. The pivot_wider
function from dplyr
gets the work done.
Try it!
Lets put the pivot_wider
function into practice. We will convert the ntcars_long
data set that we just recently generated to a wider format.
# Reshape from long to wide format
mtcars_wide <- mtcars_long %>%
pivot_wider(names_from = "variable", values_from = "value")
# View the reshaped data
head(mtcars_wide)
## # A tibble: 6 × 11
## cyl disp drat qsec vs am gear carb mpg hp wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 160 3.9 16.5 0 1 4 4 21 110 2.62
## 2 6 160 3.9 17.0 0 1 4 4 21 110 2.88
## 3 4 108 3.85 18.6 1 1 4 1 22.8 93 2.32
## 4 6 258 3.08 19.4 1 0 3 1 21.4 110 3.22
## 5 8 360 3.15 17.0 0 0 3 2 18.7 175 3.44
## 6 6 225 2.76 20.2 1 0 3 1 18.1 105 3.46
If we break down the code;
pivot_wider()
converts the long format back into the wide format, with separate columns for each variable (mpg
,hp
,wt
).names_from = "variable"
: Moves the unique values from the"variable
” column into their own columns (e.g.,mpg
,hp
,wt
).values_from = "value"
: Populates the new columns with values from the “value” column.
Practical Exercise
Use the pivot_longer() function to convert the iris dataset (which contains measurements for different flower features) into a long format. Focus on converting the numeric columns like Sepal.Length and Sepal.Width.
Then, use pivot_wider() to convert it back to a wide format.
Solution
Convert to long format
library(tidyr)
# Load the data
data(iris)
# Load the iris dataset and reshape it
iris_long <- iris %>%
pivot_longer(cols = starts_with("Sepal"),
names_to = "feature",
values_to = "measurement")
# View the reshaped data
head(iris_long)
## # A tibble: 6 × 5
## Petal.Length Petal.Width Species feature measurement
## <dbl> <dbl> <fct> <chr> <dbl>
## 1 1.4 0.2 setosa Sepal.Length 5.1
## 2 1.4 0.2 setosa Sepal.Width 3.5
## 3 1.4 0.2 setosa Sepal.Length 4.9
## 4 1.4 0.2 setosa Sepal.Width 3
## 5 1.3 0.2 setosa Sepal.Length 4.7
## 6 1.3 0.2 setosa Sepal.Width 3.2
Back to wide
# Now reshape it back to wide format
iris_wide <- iris_long %>%
pivot_wider(names_from = "feature", values_from = "measurement")
# View the reshaped data
head(iris_wide)
## # A tibble: 6 × 5
## Petal.Length Petal.Width Species Sepal.Length Sepal.Width
## <dbl> <dbl> <fct> <list> <list>
## 1 1.4 0.2 setosa <dbl [8]> <dbl [8]>
## 2 1.3 0.2 setosa <dbl [4]> <dbl [4]>
## 3 1.5 0.2 setosa <dbl [7]> <dbl [7]>
## 4 1.7 0.4 setosa <dbl [1]> <dbl [1]>
## 5 1.4 0.3 setosa <dbl [3]> <dbl [3]>
## 6 1.5 0.1 setosa <dbl [2]> <dbl [2]>
3.4 Hands-on Exercise
In this practical exercise, you will be required to download and import the Furniture sales data set from here and perform the following;
- Calculate the average profit margin for each material using the
tapply
function. - Find the difference between the cost and the price of each item on the
mapply
function. - Calculate the maximum price, cost and sales columns using the
sapply
andlapply
function. Note the difference between the two functions. - Use aggregation to find the maximum sales for each product category.
- Subset the
brand
,sales
columns and name itbrand_df
. Convert the data to a longer format and store in a variablepbrand_df_long
.
Solution
Import the data set
## price cost sales profit_margin inventory discount_percentage
## 1 218.5431 181.61093 40 16.89924 105 27.796433
## 2 477.8214 385.03383 7 19.41889 192 26.943715
## 3 379.3973 276.73677 32 27.05884 59 21.948130
## 4 319.3963 281.84133 48 11.75811 45 11.009944
## 5 120.2084 69.74368 19 41.98102 35 3.183763
## 6 120.1975 65.35208 6 45.62943 185 20.659352
## delivery_days category material color location season store_type brand
## 1 9 Bed Plastic Red Rural Spring Online BrandA
## 2 6 Chair Glass Blue Rural Summer Online BrandD
## 3 2 Table Metal Black Suburban Fall Online BrandD
## 4 2 Table Glass Green Rural Summer Retail BrandD
## 5 9 Chair Glass Brown Rural Fall Online BrandD
## 6 8 Table Plastic Brown Urban Fall Retail BrandB
## revenue
## 1 3949.165
## 2 -3521.002
## 3 14285.560
## 4 12261.074
## 5 -4588.256
## 6 9136.302
- Calculate the average profit margin for each material using the
tapply
function.
## Fabric Glass Metal Plastic Wood
## 29.23317 30.31741 30.67792 30.09190 30.57714
- Find the difference between the cost and the price of each item on the
mapply
function.
# Create a function that finds difference between two numbers
diff_a_b <- function(a, b){
d = a - b
return(d)
}
# Difference between cost and price
cost_price_difference <- mapply(diff_a_b, df$price, df$cost)
head(cost_price_difference)
## [1] 36.93212 92.78761 102.66051 37.55498 50.46471 54.84545
- Calculate the maximum
price
,cost
andsales
columns using thesapply
andlapply
function. Note the difference between the two functions.
# Subset the data set
library(dplyr)
subset_df <- df %>%
select(price, cost, sales)
# Using sapply function
sapply(subset_df, max)
## price cost sales
## 499.8730 447.0229 49.0000
## $price
## [1] 499.873
##
## $cost
## [1] 447.0229
##
## $sales
## [1] 49
- Use aggregation to find the maximum sales for each product category.
library(plyr)
# Perform aggregation
aggregated_sales <- aggregate(sales ~ category,
data = df,
FUN = max)
aggregated_sales
## category sales
## 1 Bed 49
## 2 Chair 49
## 3 Desk 49
## 4 Sofa 49
## 5 Table 49
- Subset the
brand
,sales
columns and name itbrand_df
. Convert the data to a longer format and store in a variablepbrand_df_long
.
library(tidyr)
library(dplyr)
# Subset the data set
brand_df <- df %>%
select(brand, sales)
head(brand_df)
## brand sales
## 1 BrandA 40
## 2 BrandD 7
## 3 BrandD 32
## 4 BrandD 48
## 5 BrandD 19
## 6 BrandB 6
# To a longer format
brand_df_long <- brand_df %>%
pivot_longer( cols = "brand")
head(brand_df_long)
## # A tibble: 6 × 3
## sales name value
## <int> <chr> <chr>
## 1 40 brand BrandA
## 2 7 brand BrandD
## 3 32 brand BrandD
## 4 48 brand BrandD
## 5 19 brand BrandD
## 6 6 brand BrandB
________________________________________________________________________________