Chapter 3 Group Manipulation

Group manipulation in R refers to the process of grouping data based on certain categories and then performing operations based on each group separately. This is useful when you want to summarize, analyze or transform subsets of your data independently. In simple terms, group manipulation involves splitting the data into groups, applying a function to each group, and then combining the results. We will explore different methods designed by researchers for group manipulation. They are group manipulation using;

  • The apply family,
  • The aggregate from plyr package,
  • Data reshaping

3.1 Apply Family

The apply family in R is a collection of functions that helps you apply operations to data structures like vectors, lists, matrices and data frames in a more efficient way than using loops. Think of these functions as a way to give commands to your data in bulk, telling each piece what to do without repeating yourself.

Let’s make this fun! Imagine you’re running a café, and you have tables (rows of data) with customer orders (columns of data). You want to calculate the total for each table or find out how much each customer spent on average. The apply family is like hiring a helper who goes to each table and collects information without you having to ask each customer individually!

We will have a quick overview of the members of the apply family;

  • apply() - Works with matrices or data frames, applying a function to rows or columns.
  • lapply() - Loops over elements in a list, applying a function to each element and returning a list.
  • sapply() - Similar to lapply, but it returns a vector or matrix when possible.
  • tapply() - Applies a function over subsets of data, especially useful for factors or groups.
  • mapply() - Applies a function to multiple arguments simultaneously.

Try it:

Here is the apply family in action using the built-in R data set that contains information about flowers.

  • Use apply to calculate the mean of each column in the iris data set at once(No need of specifying the columns)
#  Load and view the first few rows of the iris data set
data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Calculate the mean of each numeric column
col_means <- apply(iris[, 1:4], 2, mean)
print(col_means)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
  • The 2 in apply means “apply the function to columns” and the mean was used to find the average of each column. This is simple as asking a helper to calculate the the average for all types of flowers for each characteristic (sepal length, petal length, etc.). Let’s repeat the same for a each row, instead of argument value 2 we will put argument value 1 in the second position.
row_means <- apply(iris[, 1:4], 1, mean) # Calculate the mean for each row 
head(row_means, 15) # Show the first fifteen averages of the row
##  [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800
  • Now lets use the lapply function to find the range for each numeric column. This function applies to each element and returns a list. No need to specify if its a column or a row
# Calculate the range of each numeric column in the iris dataset
column_ranges <- lapply(iris[, 1:4], range)
print(column_ranges)
## $Sepal.Length
## [1] 4.3 7.9
## 
## $Sepal.Width
## [1] 2.0 4.4
## 
## $Petal.Length
## [1] 1.0 6.9
## 
## $Petal.Width
## [1] 0.1 2.5

Repeating the function with mean function instead of the range function.

# Calculate the mean of each numeric column in the iris dataset
col_means <- lapply(iris[, 1:4], mean)
print(col_means)
## $Sepal.Length
## [1] 5.843333
## 
## $Sepal.Width
## [1] 3.057333
## 
## $Petal.Length
## [1] 3.758
## 
## $Petal.Width
## [1] 1.199333

You see! lapply function works column wise instead of row wise when working with data frames. Lets create a function that will add 10 to the input value and use the lapply function to work on a vector.

# Create a vector
current_ages <- c(21, 43, 12, 56, 32)

# Create a function that adds 10 to an input value 
add_10 <- function(value){
  return(value + 10)
}

# Test the function 
add_10(27)
## [1] 37
# Apply the function to vector ages 
ages_10_years_later <- lapply(current_ages, add_10)
ages_10_years_later # Show the result
## [[1]]
## [1] 31
## 
## [[2]]
## [1] 53
## 
## [[3]]
## [1] 22
## 
## [[4]]
## [1] 66
## 
## [[5]]
## [1] 42

It returns a list with values in the vector current_ages add 10 to each value.

  • The sapply() function works similarly to lapply(), but it tries to simplify the output. If possible, it will return a vector or matrix instead of a list. Let`s calculate the variance for each numeric column;
# Calculate the variance for each numeric column
col_variance <- sapply(iris[, 1:4], var)
print(col_variance)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063

Remember that we created a function add_10 that adds 10 to the current ages of the clients. Lets repeat the same using the sapply function instead of lapply function.

# Calculate the variance for each numeric column
ages_10_years_later <- sapply(current_ages, add_10)
print(ages_10_years_later)
## [1] 31 53 22 66 42

It is now evident that sapply has a simpler output than the lapply function.

  • The tapply() function applies a function to subsets of data grouped by a factor (e.g., species in our case). Let’s calculate the average sepal length for each species:
# Calculate the average Sepal.Length for each Species
avg_sepal_by_species <- tapply(iris$Sepal.Length, iris$Species, mean)
print(avg_sepal_by_species)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

This is like sending your helper to collect the sepal lengths for each species separately, and then calculating the average for each group.

  • Finally the mapply() function is useful when you want to apply a function to multiple sets of arguments at once. Let’s calculate the sum of Sepal.Length and Sepal.Width for each row:
# Sum Sepal.Length and Sepal.Width for each row
sepal_sum <- mapply(sum, iris$Sepal.Length, iris$Sepal.Width)
head(sepal_sum)
## [1] 8.6 7.9 7.9 7.7 8.6 9.3

This function adds the sepal length and width for each flower row by row. It’s like your helper asking every customer for two values and summing them up together.

Practical Exercise

Now it’s time to test your skills!

  1. Use apply() to calculate the maximum for each column in the iris data set.
  2. Use lapply() to find the summary statistics (use the summary() function) for each numeric column in the iris data set.
  3. Use tapply() to find the average petal width for each species in the iris data set.

Solution

  1. Use apply() to calculate the maximum for each column in the iris data set.
max_values <- apply(iris[, 1:4], 2, max)
print(max_values)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5
  1. Use lapply() to find the summary statistics (use the summary() function) for each numeric column in the iris data set.
sum_stats <- lapply(iris[,1:4], summary)
print(sum_stats)
## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500
  1. Use tapply() to find the average petal width for each species in the iris data set.
# Calculate the average Petal.Width for each Species
avg_petal_width_by_species <- tapply(iris$Petal.Width, iris$Species, mean)
print(avg_petal_width_by_species)
##     setosa versicolor  virginica 
##      0.246      1.326      2.026

________________________________________________________________________________

3.2 Aggregate Plyr

The aggregate() function from plyr package is a powerful tool for grouping and summarizing data in R. This is similar to the SQL GROUP BY command or the tapply() that we have discussed above. The difference is that aggregate() allows to summarize data based on one or more grouping factors.

Try it!

Let’s explore an example using the built-in mtcars data set to show how to use the aggregate() from the plyr package. The plyr package can be installed by:

install.packages("plyr")

Lets start

library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
# Load the data set
data("mtcars")

# Use aggregate to find the average 'mpg' (miles per gallon) grouped by the number of cylinders ('cyl')
avg_mpg_by_cyl <- aggregate(mpg ~ cyl, 
                            data = mtcars, 
                            FUN = mean)
avg_mpg_by_cyl
##   cyl      mpg
## 1   4 26.66364
## 2   6 19.74286
## 3   8 15.10000

If we break done the code;

  1. mpg ~ cyl tells R to calculate the average mpg(dependent variable) for each unique value of cyl(grouping factor).
  2. data = mtcars specifies the data set.
  3. FUN = mean applies the mean function to compute the average mpg for each group of cyl.

We have just calculated the average mpg (miles per gallon) grouped by the number of cyl(cylinders). Let’s make it a little bit more complex by grouping with multiple variables and summarize multiple columns as well. We will calculate the mean horsepower(hp) and the weight(wt) by the number of cylinders(cyl) and the number of transmission(am).

# Use aggregate to find the mean hp and wt by cylinders and transmission type
avg_hp_wt_by_cyl_am <- aggregate(cbind(hp, wt) ~ cyl + am, 
                                 data = mtcars, 
                                 FUN = mean)

avg_hp_wt_by_cyl_am
##   cyl am        hp       wt
## 1   4  0  84.66667 2.935000
## 2   6  0 115.25000 3.388750
## 3   8  0 194.16667 4.104083
## 4   4  1  81.87500 2.042250
## 5   6  1 131.66667 2.755000
## 6   8  1 299.50000 3.370000

If we breakdown the code;

  1. cbind(hp, wt) allows you to summarize multiple columns (hp and wt).
  2. cyl + am groups the data by the number of cylinders and the transmission type (am = 0 for automatic, 1 for manual`).
  3. The argument FUN defines the function to be used here therefore, FUN = mean calculates the mean values for hp and wt for each group of cyl and am.

Practical Exercise

Try using the aggregate() with the iris data set to find the mean sepal length (Sepal.Length) and petal length(Petal.Length) for each species.


Solution

library(plyr)

# Load the iris data set
data(iris)

# Calculate the averages as per the instructions
avg_sepal_petal_by_species <- aggregate(cbind(Sepal.Length, Petal.Length) ~ Species, 
                                        data = iris,
                                        FUN = mean)

avg_sepal_petal_by_species
##      Species Sepal.Length Petal.Length
## 1     setosa        5.006        1.462
## 2 versicolor        5.936        4.260
## 3  virginica        6.588        5.552

________________________________________________________________________________

3.3 Data Reshaping

Data reshaping is the process of transforming the layout or structure of a data set without changing the actual data. You typically reshape data to suit different analyses, visualizations, or reporting formats. Common operations for reshaping include pivoting data between wide and long formats.

  • Wide format: Each subject(row) has its own columns for measurements at different time points or categories.
  • Long format: The data has one measurement per row, making it easier to analyze in some cases, especially with repeated measures.

In R, the most common function for reshaping data include;

  • pivot_longer() and pivot_wider() from the tidyr package.
  • melt() and dcast() from the reshape2 package.

Try it!

Let’s have some fun by working on the mtcars data set where we will demonstrate reshaping between wide and long formats

Step 1: Inspect the Data

The mtcars data set is already in a wide format where each row represents a car, and columns represent different variables for instance mpg, cyl, hp.

data(mtcars) # Load the data set

# First few records of the data set
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step2: Converting from Wide to Long Format

We will use the pivot_longer() function from the tidyr package to convert the data set from wide to long format. In this case, we will shape the mpg, hp and wt columns into a longer format making it easier to work with.

library(tidyr)

# Reshape the data from wide to long format
mtcars_long <- mtcars %>% 
  pivot_longer(cols=c(mpg, hp, wt),
               names_to = "variable",
               values_to = "value")

# View the respaed data 
head(mtcars_long)
## # A tibble: 6 × 10
##     cyl  disp  drat  qsec    vs    am  gear  carb variable  value
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>     <dbl>
## 1     6   160   3.9  16.5     0     1     4     4 mpg       21   
## 2     6   160   3.9  16.5     0     1     4     4 hp       110   
## 3     6   160   3.9  16.5     0     1     4     4 wt         2.62
## 4     6   160   3.9  17.0     0     1     4     4 mpg       21   
## 5     6   160   3.9  17.0     0     1     4     4 hp       110   
## 6     6   160   3.9  17.0     0     1     4     4 wt         2.88

If we break down the code;

  1. pivot_longer() function moves the selected columns (mpg, hp, wt) into a new “long” format, with eah row representing a unique combination of car characteristics(variable) and their corresponding value.
  2. names_to = "variable": The variable names (e.g., mpg, hp, wt) are moved to a column named “variable”.
  3. values_to = "value": The data for each variable is placed in a column named "value".

Also, data in long format can be converted to a wide format. The pivot_wider function from dplyr gets the work done.

Try it!

Lets put the pivot_wider function into practice. We will convert the ntcars_long data set that we just recently generated to a wider format.

# Reshape from long to wide format
mtcars_wide <- mtcars_long %>%
  pivot_wider(names_from = "variable", values_from = "value")

# View the reshaped data
head(mtcars_wide)
## # A tibble: 6 × 11
##     cyl  disp  drat  qsec    vs    am  gear  carb   mpg    hp    wt
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     6   160  3.9   16.5     0     1     4     4  21     110  2.62
## 2     6   160  3.9   17.0     0     1     4     4  21     110  2.88
## 3     4   108  3.85  18.6     1     1     4     1  22.8    93  2.32
## 4     6   258  3.08  19.4     1     0     3     1  21.4   110  3.22
## 5     8   360  3.15  17.0     0     0     3     2  18.7   175  3.44
## 6     6   225  2.76  20.2     1     0     3     1  18.1   105  3.46

If we break down the code;

  1. pivot_wider() converts the long format back into the wide format, with separate columns for each variable (mpg, hp, wt).
  2. names_from = "variable": Moves the unique values from the "variable” column into their own columns (e.g., mpg, hp, wt).
  3. values_from = "value": Populates the new columns with values from the “value” column.

Practical Exercise

Use the pivot_longer() function to convert the iris dataset (which contains measurements for different flower features) into a long format. Focus on converting the numeric columns like Sepal.Length and Sepal.Width.

Then, use pivot_wider() to convert it back to a wide format.


Solution

Convert to long format

library(tidyr)

# Load the data 
data(iris)

# Load the iris dataset and reshape it
iris_long <- iris %>%
  pivot_longer(cols = starts_with("Sepal"), 
               names_to = "feature", 
               values_to = "measurement")

# View the reshaped data
head(iris_long)
## # A tibble: 6 × 5
##   Petal.Length Petal.Width Species feature      measurement
##          <dbl>       <dbl> <fct>   <chr>              <dbl>
## 1          1.4         0.2 setosa  Sepal.Length         5.1
## 2          1.4         0.2 setosa  Sepal.Width          3.5
## 3          1.4         0.2 setosa  Sepal.Length         4.9
## 4          1.4         0.2 setosa  Sepal.Width          3  
## 5          1.3         0.2 setosa  Sepal.Length         4.7
## 6          1.3         0.2 setosa  Sepal.Width          3.2

Back to wide

# Now reshape it back to wide format
iris_wide <- iris_long %>%
  pivot_wider(names_from = "feature", values_from = "measurement")

# View the reshaped data
head(iris_wide)
## # A tibble: 6 × 5
##   Petal.Length Petal.Width Species Sepal.Length Sepal.Width
##          <dbl>       <dbl> <fct>   <list>       <list>     
## 1          1.4         0.2 setosa  <dbl [8]>    <dbl [8]>  
## 2          1.3         0.2 setosa  <dbl [4]>    <dbl [4]>  
## 3          1.5         0.2 setosa  <dbl [7]>    <dbl [7]>  
## 4          1.7         0.4 setosa  <dbl [1]>    <dbl [1]>  
## 5          1.4         0.3 setosa  <dbl [3]>    <dbl [3]>  
## 6          1.5         0.1 setosa  <dbl [2]>    <dbl [2]>

3.4 Hands-on Exercise

In this practical exercise, you will be required to download and import the Furniture sales data set from here and perform the following;

  1. Calculate the average profit margin for each material using the tapply function.
  2. Find the difference between the cost and the price of each item on the mapply function.
  3. Calculate the maximum price, cost and sales columns using the sapply and lapply function. Note the difference between the two functions.
  4. Use aggregation to find the maximum sales for each product category.
  5. Subset the brand, sales columns and name it brand_df. Convert the data to a longer format and store in a variable pbrand_df_long.

Solution

Import the data set

# Load the data set 
df <- read.csv("data/Furniture.csv")
head(df)
##      price      cost sales profit_margin inventory discount_percentage
## 1 218.5431 181.61093    40      16.89924       105           27.796433
## 2 477.8214 385.03383     7      19.41889       192           26.943715
## 3 379.3973 276.73677    32      27.05884        59           21.948130
## 4 319.3963 281.84133    48      11.75811        45           11.009944
## 5 120.2084  69.74368    19      41.98102        35            3.183763
## 6 120.1975  65.35208     6      45.62943       185           20.659352
##   delivery_days category material color location season store_type  brand
## 1             9      Bed  Plastic   Red    Rural Spring     Online BrandA
## 2             6    Chair    Glass  Blue    Rural Summer     Online BrandD
## 3             2    Table    Metal Black Suburban   Fall     Online BrandD
## 4             2    Table    Glass Green    Rural Summer     Retail BrandD
## 5             9    Chair    Glass Brown    Rural   Fall     Online BrandD
## 6             8    Table  Plastic Brown    Urban   Fall     Retail BrandB
##     revenue
## 1  3949.165
## 2 -3521.002
## 3 14285.560
## 4 12261.074
## 5 -4588.256
## 6  9136.302
  1. Calculate the average profit margin for each material using the tapply function.
average_profit_margin <- tapply(df$profit_margin, df$material,
                                mean)

average_profit_margin
##   Fabric    Glass    Metal  Plastic     Wood 
## 29.23317 30.31741 30.67792 30.09190 30.57714
  1. Find the difference between the cost and the price of each item on the mapply function.
# Create a function that finds difference between two numbers 
diff_a_b <- function(a, b){
  d = a - b
  return(d)
}

# Difference between cost and price 
cost_price_difference <- mapply(diff_a_b, df$price, df$cost)

head(cost_price_difference)
## [1]  36.93212  92.78761 102.66051  37.55498  50.46471  54.84545
  1. Calculate the maximum price, cost and sales columns using the sapply and lapply function. Note the difference between the two functions.
# Subset the data set 
library(dplyr)

subset_df <- df %>%
  select(price, cost, sales)

# Using sapply function
sapply(subset_df, max)
##    price     cost    sales 
## 499.8730 447.0229  49.0000
# Using the lapply function 
lapply(subset_df, max)
## $price
## [1] 499.873
## 
## $cost
## [1] 447.0229
## 
## $sales
## [1] 49
  1. Use aggregation to find the maximum sales for each product category.
library(plyr)

# Perform aggregation 
aggregated_sales <- aggregate(sales ~ category,
                              data = df,
                              FUN = max)

aggregated_sales
##   category sales
## 1      Bed    49
## 2    Chair    49
## 3     Desk    49
## 4     Sofa    49
## 5    Table    49
  1. Subset the brand, sales columns and name it brand_df. Convert the data to a longer format and store in a variable pbrand_df_long.
library(tidyr)
library(dplyr)

# Subset the data set 
brand_df <- df %>%
  select(brand, sales)

head(brand_df)
##    brand sales
## 1 BrandA    40
## 2 BrandD     7
## 3 BrandD    32
## 4 BrandD    48
## 5 BrandD    19
## 6 BrandB     6
# To a longer format 
brand_df_long <- brand_df %>% 
  pivot_longer( cols = "brand")

head(brand_df_long)
## # A tibble: 6 × 3
##   sales name  value 
##   <int> <chr> <chr> 
## 1    40 brand BrandA
## 2     7 brand BrandD
## 3    32 brand BrandD
## 4    48 brand BrandD
## 5    19 brand BrandD
## 6     6 brand BrandB

________________________________________________________________________________