Chapter 4 General Statistics


4.1 Tabulating Factors and Creating Contingency Tables

This section explores how to handle categorical data using factors and contingency tables in R. We will learn how to:


4.1.1 Understanding Factors in R

Factors store categorical variables efficiently and allow statistical functions to recognize levels.

4.1.1.1 Example: Creating a Factor Variable

# Creating a categorical variable
survey_response <- c("Agree", "Disagree", "Neutral", "Agree", "Agree", "Disagree")

# Convert to factor
survey_factor <- factor(survey_response)

# Print the factor
print(survey_factor)
## [1] Agree    Disagree Neutral  Agree    Agree    Disagree
## Levels: Agree Disagree Neutral

4.1.1.2 Example: Reordering Factor Levels

survey_factor_ordered <- factor(survey_response, levels = c("Disagree", "Neutral", "Agree"), ordered = TRUE)
print(survey_factor_ordered)
## [1] Agree    Disagree Neutral  Agree    Agree    Disagree
## Levels: Disagree < Neutral < Agree

4.1.2 Creating Frequency Tables

A frequency table counts the number of occurrences of each category.

4.1.2.1 Example: Using table() to Create a Frequency Table

table(survey_factor)
## survey_factor
##    Agree Disagree  Neutral 
##        3        2        1

4.1.2.2 Example: Getting Proportions with prop.table()

prop.table(table(survey_factor))
## survey_factor
##     Agree  Disagree   Neutral 
## 0.5000000 0.3333333 0.1666667

4.1.3 Creating Contingency Tables

A contingency table (cross-tabulation) is used to summarize two categorical variables.

4.1.3.1 Example: 2-Way Contingency Table

# Sample data
gender <- c("Male", "Female", "Female", "Male", "Male", "Female")

# Creating a contingency table
contingency_table <- table(gender, survey_factor)

# Display the table
print(contingency_table)
##         survey_factor
## gender   Agree Disagree Neutral
##   Female     0        2       1
##   Male       3        0       0

4.1.4 Adding Margins to Contingency Tables

We can add row and column totals using addmargins().

4.1.4.1 Example: Adding Margins

addmargins(contingency_table)
##         survey_factor
## gender   Agree Disagree Neutral Sum
##   Female     0        2       1   3
##   Male       3        0       0   3
##   Sum        3        2       1   6

4.1.5 Computing Row and Column Proportions

4.1.5.1 Example: Row Proportions

prop.table(contingency_table, margin = 1)
##         survey_factor
## gender       Agree  Disagree   Neutral
##   Female 0.0000000 0.6666667 0.3333333
##   Male   1.0000000 0.0000000 0.0000000

4.1.5.2 Example: Column Proportions

prop.table(contingency_table, margin = 2)
##         survey_factor
## gender   Agree Disagree Neutral
##   Female     0        1       1
##   Male       1        0       0

4.1.6 Visualizing Contingency Tables

4.1.6.1 Example: Bar Plot

barplot(contingency_table, beside = TRUE, legend = TRUE, col = c("blue", "red"))

4.1.6.2 Example: Mosaic Plot

mosaicplot(contingency_table, main = "Survey Responses by Gender", col = c("skyblue", "pink"))


4.1.7 Practical Exercises

4.1.7.1 Exercise 1: Working with Factors

  1. Create a factor variable from the following data:
   education <- c("High School", "College", "College", "PhD", "Masters", "High School")
  1. Convert it into an ordered factor with levels: "High School" < "College" < "Masters" < "PhD"
  2. Print the ordered factor.

Solution:

education_factor <- factor(education, levels = c("High School", "College", "Masters", "PhD"), ordered = TRUE)
print(education_factor)
## [1] High School College     College     PhD         Masters     High School
## Levels: High School < College < Masters < PhD

4.1.7.2 Exercise 2: Creating a Contingency Table

  1. Create two categorical variables:
   department <- c("Sales", "HR", "IT", "Sales", "HR", "IT", "Sales", "IT")
   status <- c("Full-Time", "Part-Time", "Full-Time", "Part-Time", "Full-Time", "Full-Time", "Part-Time", "Full-Time")
  1. Generate a contingency table.
  2. Compute row and column proportions.
  3. Add margins to the table.

Solution:

# Creating a contingency table
dept_table <- table(department, status)

# Display the table
print(dept_table)
##           status
## department Full-Time Part-Time
##      HR            1         1
##      IT            3         0
##      Sales         1         2
# Row proportions
prop.table(dept_table, margin = 1)
##           status
## department Full-Time Part-Time
##      HR    0.5000000 0.5000000
##      IT    1.0000000 0.0000000
##      Sales 0.3333333 0.6666667
# Column proportions
prop.table(dept_table, margin = 2)
##           status
## department Full-Time Part-Time
##      HR    0.2000000 0.3333333
##      IT    0.6000000 0.0000000
##      Sales 0.2000000 0.6666667
# Adding margins
addmargins(dept_table)
##           status
## department Full-Time Part-Time Sum
##      HR            1         1   2
##      IT            3         0   3
##      Sales         1         2   3
##      Sum           5         3   8

4.1.7.3 Exercise 3: Titanic Dataset Analysis

Use the built-in Titanic dataset to:

  1. Create a contingency table of passenger class (Pclass) and survival (Survived).

  2. Compute row and column proportions.

  3. Create a bar plot.

Solution:

# Load required packages
library(dplyr)
library(tidyr)

# Load Titanic dataset
data(Titanic)

# Convert to a proper dataframe and expand the frequency count
Titanic_df <- as.data.frame(Titanic) %>%
  uncount(Freq)  # Expands rows based on the frequency column

# Check the structure of the dataframe
str(Titanic_df)
## 'data.frame':    2201 obs. of  4 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
# Create a contingency table using the correct column names
titanic_table <- table(Titanic_df$Class, Titanic_df$Survived)

# Print the table
print(titanic_table)
##       
##         No Yes
##   1st  122 203
##   2nd  167 118
##   3rd  528 178
##   Crew 673 212

4.2 Calculating Quantiles

Quantiles are statistical measures that divide a dataset into equal parts. They help summarize distributions, identify outliers, and assess skewness.

4.2.1 Understanding Quantiles

Quantiles divide data into equal-sized groups:

  • Median (50th percentile): The middle value of a dataset.

  • Quartiles (25th, 50th, 75th percentiles): Divide data into four equal parts.

  • Deciles (10th, 20th, …, 90th percentiles): Divide data into ten equal parts.

  • Percentiles (1st, 2nd, …, 99th percentiles): Divide data into 100 equal parts.

4.2.2 Computing Quantiles in R

4.2.2.1 Example: Finding Quartiles

# Create a dataset
data <- c(3, 7, 8, 5, 12, 14, 21, 13, 18)

### Compute quartiles
quantile(data)
##   0%  25%  50%  75% 100% 
##    3    7   12   14   21
  • 25% (Q1): First quartile
  • 50% (Q2): Median
  • 75% (Q3): Third quartile
  • 100%: Maximum value

4.2.3 Computing Specific Quantiles

4.2.3.1 Example: Finding the 10th and 90th Percentiles

quantile(data, probs = c(0.10, 0.90))
##  10%  90% 
##  4.6 18.6

4.2.4 Using summary() for Quick Insights

summary(data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    7.00   12.00   11.22   14.00   21.00

The summary() function provides:

  • Min: Minimum value

  • 1st Qu. (Q1, 25%): First quartile

  • Median (Q2, 50%): Second quartile

  • Mean: Average value

  • 3rd Qu. (Q3, 75%): Third quartile

  • Max: Maximum value


4.2.5 Visualizing Quantiles

4.2.5.1 Example: Boxplot to Show Quartiles

boxplot(data, main = "Boxplot of Data", col = "lightblue")

In a boxplot:

  • The median (Q2) as a thick line in the box

  • The interquartile range (IQR, Q1 to Q3) as the box

  • Outliers as individual points


4.2.6 Finding Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of the data:

IQR(data)
## [1] 7

OR manually:

iqr_value <- quantile(data, 0.75) - quantile(data, 0.25)
print(iqr_value)
## 75% 
##   7

4.2.7 Finding Outliers Using IQR

Outliers are values outside Q1 - 1.5*IQR and Q3 + 1.5*IQR.

4.2.7.1 Example: Detecting Outliers

# Compute quartiles
q1 <- quantile(data, 0.25)
q3 <- quantile(data, 0.75)
iqr_value <- IQR(data)

# Define outlier thresholds
lower_bound <- q1 - 1.5 * iqr_value
upper_bound <- q3 + 1.5 * iqr_value

# Find outliers
outliers <- data[data < lower_bound | data > upper_bound]
print(outliers)
## numeric(0)

4.2.8 Hands-on Exercises

4.2.8.1 Exercise 1: Computing Quartiles

  1. Create a dataset:
scores <- c(55, 78, 85, 90, 92, 60, 73, 81, 95, 88)
  1. Compute Q1, Q2 (median), and Q3.
  2. Calculate the IQR.
  3. Plot a boxplot.

Solution:

quantile(scores)
##    0%   25%   50%   75%  100% 
## 55.00 74.25 83.00 89.50 95.00
IQR(scores)
## [1] 15.25
boxplot(scores, main = "Exam Scores", col = "lightblue")


4.2.8.2 Exercise 2: Computing Custom Quantiles

  1. Use the dataset:
heights <- c(150, 160, 165, 170, 175, 180, 185, 190, 195, 200)
  1. Find the 5th, 25th, 50th, 75th, and 95th percentiles.

Solution:

quantile(heights, probs = c(0.05, 0.25, 0.50, 0.75, 0.95))
##     5%    25%    50%    75%    95% 
## 154.50 166.25 177.50 188.75 197.75

4.2.8.3 Exercise 3: Finding Outliers

  1. Use the dataset:
salaries <- c(40000, 42000, 45000, 47000, 50000, 52000, 55000, 58000, 60000, 100000)
  1. Compute Q1, Q3, and IQR.
  2. Identify outliers.

Solution:

q1 <- quantile(salaries, 0.25)
q3 <- quantile(salaries, 0.75)
iqr_value <- IQR(salaries)

lower_bound <- q1 - 1.5 * iqr_value
upper_bound <- q3 + 1.5 * iqr_value

outliers <- salaries[salaries < lower_bound | salaries > upper_bound]
print(outliers)
## [1] 1e+05


4.3 z-Scores

The z-score (also called the standard score) tells us how many standard deviations a data point is from the mean. It is a useful measure for comparing values across different distributions and detecting outliers.

4.3.1 Understanding z-Scores

The z-score formula is:

\[ z = \frac{x - \mu}{\sigma} \]

Where:

  • \(x\) = data point

  • \(\mu\) = mean of the dataset

  • \(\sigma\) = standard deviation of the dataset

Interpretation:

  • \(z = 0\) → The value is equal to the mean.

  • \(z > 0\) → The value is above the mean.

  • \(z < 0\) → The value is below the mean.

  • \(|z| > 2\) → The value is unusual.

  • \(|z| > 3\) → The value is potentially an outlier.


4.3.2 Computing z-Scores in R

4.3.2.1 Example: Computing z-Scores Manually

# Sample data
data <- c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95)

# Compute mean and standard deviation
mean_value <- mean(data)
sd_value <- sd(data)

# Compute z-scores
z_scores <- (data - mean_value) / sd_value
print(z_scores)
##  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446  0.1651446  0.4954337
##  [8]  0.8257228  1.1560120  1.4863011

4.3.3 Computing z-Scores Using scale()

The scale() function standardizes a dataset (converts it into z-scores):

# Compute z-scores using scale()
z_scaled <- scale(data)
print(z_scaled)
##             [,1]
##  [1,] -1.4863011
##  [2,] -1.1560120
##  [3,] -0.8257228
##  [4,] -0.4954337
##  [5,] -0.1651446
##  [6,]  0.1651446
##  [7,]  0.4954337
##  [8,]  0.8257228
##  [9,]  1.1560120
## [10,]  1.4863011
## attr(,"scaled:center")
## [1] 72.5
## attr(,"scaled:scale")
## [1] 15.13825

The scale() function automatically centers and scales the data.


4.3.4 Interpreting z-Scores

Let’s compute the z-score of 80 from our dataset:

z_80 <- (80 - mean_value) / sd_value
print(z_80)
## [1] 0.4954337

If \(z = 0.5\), this means 80 is 0.5 standard deviations above the mean.


4.3.5 Using z-Scores to Detect Outliers

Outliers are values that have \(|z| > 3\).

4.3.5.1 Example: Finding Outliers

# Identify values with |z| > 3
outliers <- data[abs(z_scores) > 3]
print(outliers)
## numeric(0)

4.3.6 Visualizing z-Scores

4.3.6.1 Example: Histogram of z-Scores

hist(z_scores, main = "Histogram of z-Scores", col = "skyblue", xlab = "z-Scores")
abline(v = c(-3, 3), col = "red", lwd = 2) # Marking outlier thresholds

4.3.6.2 Example: Standard Normal Curve with z-Scores

x <- seq(-4, 4, length=100)
y <- dnorm(x)

plot(x, y, type="l", lwd=2, col="blue", main="Standard Normal Distribution")
abline(v = c(-3, -2, -1, 0, 1, 2, 3), col="red", lty=2) # Mark z-scores


4.3.7 Hands-on Exercises

4.3.7.1 Exercise 1: Compute z-Scores

  1. Use the dataset:
   heights <- c(150, 160, 165, 170, 175, 180, 185, 190, 195, 200)
  1. Compute the mean and standard deviation.

  2. Calculate the z-scores.

  3. Find any outliers (\(|z| > 3\)).

Solution:

mean_height <- mean(heights)
sd_height <- sd(heights)
z_scores <- (heights - mean_height) / sd_height
outliers <- heights[abs(z_scores) > 3]
print(z_scores)
##  [1] -1.6853070 -1.0611192 -0.7490253 -0.4369314 -0.1248376  0.1872563  0.4993502
##  [8]  0.8114441  1.1235380  1.4356319
print(outliers)
## numeric(0)

4.3.7.2 Exercise 2: Standardize Data Using scale()

  1. Use the dataset:
   weights <- c(55, 60, 65, 70, 75, 80, 85, 90, 95, 100)
  1. Standardize the data using scale().

  2. Plot a histogram of the z-scores.

Solution:

z_weights <- scale(weights)
hist(z_weights, main = "Histogram of Standardized Weights", col = "lightgreen")


4.3.7.3 Exercise 3: Identifying Outliers

  1. Use the dataset:
   salaries <- c(40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000, 80000, 200000)
  1. Compute z-scores.

  2. Identify outliers.

Solution:

mean_salary <- mean(salaries)
sd_salary <- sd(salaries)
z_salaries <- (salaries - mean_salary) / sd_salary
outliers <- salaries[abs(z_salaries) > 3]
print(outliers)
## numeric(0)

4.4 Inferential Statistics

Inferential statistics allows us to make conclusions about a population based on a sample.

4.4.1 Basic Concepts

4.4.1.1 Population vs. Sample

  • Population: The entire group we want to study.

  • Sample: A subset of the population used for analysis.

4.4.1.2 Parameter vs. Statistic

  • Parameter: A value that describes the population.

  • Statistic: A value computed from a sample.

4.4.1.3 Common Inferential Techniques

  1. Confidence Intervals – Estimating population values.

  2. Hypothesis Testing – Testing claims about populations.


4.4.2 Confidence Intervals

A confidence interval (CI) gives a range where we expect a population parameter to lie.

4.4.2.1 Example: Confidence Interval for a Mean

# Sample data
data <- c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95)

# Mean and standard deviation
mean_data <- mean(data)
sd_data <- sd(data)
n <- length(data)

# Compute confidence interval (95% confidence)
error_margin <- qt(0.975, df=n-1) * (sd_data / sqrt(n))
lower_bound <- mean_data - error_margin
upper_bound <- mean_data + error_margin

# Print confidence interval
c(lower_bound, upper_bound)
## [1] 61.67075 83.32925

We are 95% confident that the population mean lies within this range.


4.4.3 Hypothesis Testing

Hypothesis testing helps us determine whether a claim about a population is supported by sample data.

4.4.4 Steps in Hypothesis Testing

  1. State the null (\(H_0\)) and alternative (\(H_A\)) hypotheses.

  2. Select a significance level (\(\alpha\)).

  3. Compute the test statistic.

  4. Compare the test statistic to a critical value or p-value.

  5. Make a conclusion.


4.4.5 One-Sample t-Test

Tests if the sample mean is different from a known population mean.

4.4.5.1 Example: Testing If a Sample Mean Differs from 70

# Sample data
sample_data <- c(65, 68, 72, 75, 70, 66, 71, 69, 74, 67)

# One-sample t-test (H0: Mean = 70)

t.test(sample_data, mu = 70)
## 
##  One Sample t-test
## 
## data:  sample_data
## t = -0.28446, df = 9, p-value = 0.7825
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
##  67.31429 72.08571
## sample estimates:
## mean of x 
##      69.7
  • If p-value < 0.05, reject \(H_0\).

  • If p-value > 0.05, fail to reject \(H_0\).


4.4.6 Comparing Two Sample Means

A two-sample t-test compares the means of two independent groups.

4.4.6.1 Example: Comparing Male vs. Female Heights

# Sample data
male_heights <- c(170, 175, 180, 185, 190, 195)
female_heights <- c(160, 165, 168, 170, 175, 178)

# Two-sample t-test
t.test(male_heights, female_heights)
## 
##  Welch Two Sample t-test
## 
## data:  male_heights and female_heights
## t = 2.8225, df = 8.9621, p-value = 0.02005
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.607164 23.726169
## sample estimates:
## mean of x mean of y 
##  182.5000  169.3333
  • If p-value < 0.05, the means are significantly different.

  • If p-value > 0.05, the means are not significantly different.


4.4.7 Testing Proportions

A proportion test is used for categorical data.

4.4.7.1 Example: Testing if 60% of People Prefer Brand A

# Sample data: 55 people prefer Brand A out of 100
prop.test(55, 100, p = 0.60)
## 
##  1-sample proportions test with continuity correction
## 
## data:  55 out of 100, null probability 0.6
## X-squared = 0.84375, df = 1, p-value = 0.3583
## alternative hypothesis: true p is not equal to 0.6
## 95 percent confidence interval:
##  0.4475426 0.6485719
## sample estimates:
##    p 
## 0.55
  • If p-value < 0.05, reject \(H_0\).

  • If p-value > 0.05, fail to reject \(H_0\).


4.4.8 Practical Exercises

4.4.8.1 Exercise 1: Confidence Interval for Population Mean

  1. Use the dataset:
weights <- c(55, 60, 65, 70, 75, 80, 85, 90, 95, 100)
  1. Compute a 95% confidence interval for the population mean.

Solution:

mean_weights <- mean(weights)
sd_weights <- sd(weights)
n_weights <- length(weights)

# Compute CI
error_margin <- qt(0.975, df=n_weights-1) * (sd_weights / sqrt(n_weights))
c(mean_weights - error_margin, mean_weights + error_margin)
## [1] 66.67075 88.32925

4.4.8.2 Exercise 2: Hypothesis Test for a Mean

  1. Use the dataset:
test_scores <- c(78, 82, 85, 90, 88, 79, 84, 87, 92, 81)
  1. Test whether the mean test score is greater than 80.

Solution:

t.test(test_scores, mu = 80, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  test_scores
## t = 3.1139, df = 9, p-value = 0.00622
## alternative hypothesis: true mean is greater than 80
## 95 percent confidence interval:
##  81.89206      Inf
## sample estimates:
## mean of x 
##      84.6

4.4.8.3 Exercise 3: Comparing Two Sample Means

  1. Create two samples:
group_A <- c(15, 18, 20, 22, 25, 27, 30)
group_B <- c(17, 19, 21, 24, 26, 28, 32)
  1. Perform a two-sample t-test.

Solution:

t.test(group_A, group_B)
## 
##  Welch Two Sample t-test
## 
## data:  group_A and group_B
## t = -0.50767, df = 12, p-value = 0.6209
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.559670  4.702527
## sample estimates:
## mean of x mean of y 
##  22.42857  23.85714

4.4.8.4 Exercise 4: Testing a Sample Proportion

  1. Suppose 45 out of 100 people prefer Product X.

  2. Test if the true proportion is different from 50%.

Solution:

prop.test(45, 100, p = 0.50)
## 
##  1-sample proportions test with continuity correction
## 
## data:  45 out of 100, null probability 0.5
## X-squared = 0.81, df = 1, p-value = 0.3681
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3514281 0.5524574
## sample estimates:
##    p 
## 0.45



4.5 Testing the Mean of a Sample (t-Test) and its Confidence Interval

A t-test is used to test whether the mean of a sample is significantly different from a hypothesized population mean. It helps answer questions like:

  • “Is the average test score significantly different from 70?”

  • “Does the sample data suggest a real effect, or is it due to random chance?”


4.5.1 Understanding the t-Test

The one-sample t-test formula:

\[ t = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]

Where:

  • \(\bar{x}\) = sample mean

  • \(\mu\) = population mean

  • \(s\) = sample standard deviation

  • \(n\) = sample size

4.5.1.1 Key Assumptions:

  • The data is normally distributed (or \(n > 30\)).

  • The sample is randomly selected.

  • The standard deviation is unknown (if known, use a z-test instead).


4.5.2 Computing Confidence Intervals for the Mean

A confidence interval (CI) provides a range where the true population mean is expected to lie.

4.5.2.1 Example: 95% Confidence Interval for a Sample Mean

# Sample data
data <- c(55, 60, 65, 70, 75, 80, 85, 90, 95, 100)

# Compute mean, standard deviation, and sample size
mean_data <- mean(data)
sd_data <- sd(data)
n <- length(data)

# Compute confidence interval (95% confidence level)
error_margin <- qt(0.975, df=n-1) * (sd_data / sqrt(n))
lower_bound <- mean_data - error_margin
upper_bound <- mean_data + error_margin

# Print confidence interval
c(lower_bound, upper_bound)
## [1] 66.67075 88.32925

We are 95% confident that the population mean lies within this range.


4.5.3 Performing a One-Sample t-Test

A one-sample t-test checks whether the sample mean is significantly different from a given value.

4.5.3.1 Example: Testing if the Mean is Different from 70

# Sample data
sample_data <- c(65, 68, 72, 75, 70, 66, 71, 69, 74, 67)

# Perform one-sample t-test
t.test(sample_data, mu = 70)
## 
##  One Sample t-test
## 
## data:  sample_data
## t = -0.28446, df = 9, p-value = 0.7825
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
##  67.31429 72.08571
## sample estimates:
## mean of x 
##      69.7
  • If p-value < 0.05, reject \(H_0\) → The mean is significantly different from 70.

  • If p-value > 0.05, fail to reject \(H_0\) → No significant difference.


4.5.4 One-Sided vs. Two-Sided Tests

By default, t.test() performs a two-sided test (\(H_A: \mu \neq 70\)).

If we want to test whether the mean is greater than or less than a value:

4.5.4.1 Example: Testing if Mean is Greater Than 70

t.test(sample_data, mu = 70, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  sample_data
## t = -0.28446, df = 9, p-value = 0.6088
## alternative hypothesis: true mean is greater than 70
## 95 percent confidence interval:
##  67.76676      Inf
## sample estimates:
## mean of x 
##      69.7

4.5.4.2 Example: Testing if Mean is Less Than 70

t.test(sample_data, mu = 70, alternative = "less")
## 
##  One Sample t-test
## 
## data:  sample_data
## t = -0.28446, df = 9, p-value = 0.3912
## alternative hypothesis: true mean is less than 70
## 95 percent confidence interval:
##      -Inf 71.63324
## sample estimates:
## mean of x 
##      69.7

4.5.5 Comparing Two Sample Means (Independent t-Test)

A two-sample t-test checks if two groups have significantly different means.

4.5.5.1 Example: Comparing Male vs. Female Heights

# Sample data
male_heights <- c(170, 175, 180, 185, 190, 195)
female_heights <- c(160, 165, 168, 170, 175, 178)

# Perform independent t-test
t.test(male_heights, female_heights)
## 
##  Welch Two Sample t-test
## 
## data:  male_heights and female_heights
## t = 2.8225, df = 8.9621, p-value = 0.02005
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.607164 23.726169
## sample estimates:
## mean of x mean of y 
##  182.5000  169.3333

4.5.6 Paired t-Test (Dependent Samples)

A paired t-test compares before and after measurements.

4.5.6.1 Example: Testing Before vs. After Training Scores

# Scores before and after training
before <- c(60, 65, 70, 75, 80, 85, 90)
after  <- c(65, 68, 75, 78, 85, 88, 92)

# Perform paired t-test
t.test(before, after, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before and after
## t = -7.8393, df = 6, p-value = 0.0002277
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -4.873641 -2.554930
## sample estimates:
## mean difference 
##       -3.714286

4.5.7 Practical Exercises

4.5.7.1 Exercise 1: Compute a Confidence Interval

  1. Use the dataset:
scores <- c(78, 82, 85, 90, 88, 79, 84, 87, 92, 81)
  1. Compute a 95% confidence interval for the mean.

Solution:

mean_scores <- mean(scores)
sd_scores <- sd(scores)
n_scores <- length(scores)

# Compute CI
error_margin <- qt(0.975, df=n_scores-1) * (sd_scores / sqrt(n_scores))
c(mean_scores - error_margin, mean_scores + error_margin)
## [1] 81.25826 87.94174

4.5.7.2 Exercise 2: One-Sample t-Test

  1. Use the dataset:
weights <- c(55, 60, 65, 70, 75, 80, 85, 90, 95, 100)
  1. Test whether the mean is different from 72.

Solution:

t.test(weights, mu = 72)
## 
##  One Sample t-test
## 
## data:  weights
## t = 1.1489, df = 9, p-value = 0.2802
## alternative hypothesis: true mean is not equal to 72
## 95 percent confidence interval:
##  66.67075 88.32925
## sample estimates:
## mean of x 
##      77.5

4.5.7.3 Exercise 3: Comparing Two Groups

  1. Use the dataset:
group_A <- c(15, 18, 20, 22, 25, 27, 30)
group_B <- c(17, 19, 21, 24, 26, 28, 32)
  1. Perform an independent two-sample t-test.

Solution:

t.test(group_A, group_B)
## 
##  Welch Two Sample t-test
## 
## data:  group_A and group_B
## t = -0.50767, df = 12, p-value = 0.6209
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.559670  4.702527
## sample estimates:
## mean of x mean of y 
##  22.42857  23.85714

4.5.7.4 Exercise 4: Paired t-Test

  1. A study measures reaction time before and after caffeine consumption:
before_caffeine <- c(300, 320, 310, 305, 315, 290, 295)
after_caffeine  <- c(280, 300, 290, 285, 295, 275, 280)
  1. Perform a paired t-test to determine if caffeine affects reaction time.

Solution:

t.test(before_caffeine, after_caffeine, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before_caffeine and after_caffeine
## t = 20.14, df = 6, p-value = 9.733e-07
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  16.31504 20.82782
## sample estimates:
## mean difference 
##        18.57143

4.6 Testing a Sample Proportion and its Confidence Interval

A proportion test is used when we want to make inferences about categorical data. This test helps us:

  • Estimate the proportion of a population with a certain characteristic.

  • Determine whether a sample proportion differs significantly from a hypothesized value.


4.6.1 Understanding Proportion Testing

A sample proportion is calculated as:

\[ \hat{p} = \frac{x}{n} \]

Where:

  • \(x\) = Number of successes (e.g., people who answered “Yes”)

  • \(n\) = Total number of observations

The confidence interval (CI) for a proportion is given by:

\[ \hat{p} \pm Z \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

Where:

  • \(Z\) = Critical value for the confidence level (e.g., 1.96 for 95%)

  • \(\hat{p}\) = Sample proportion

  • \(n\) = Sample size


4.6.2 Computing Confidence Intervals for Proportions

We can calculate confidence intervals for proportions using prop.test().

4.6.2.1 Example: 95% Confidence Interval for Proportion

Suppose 60 out of 100 people prefer Brand A.

# Number of successes (people preferring Brand A)
x <- 60
# Total sample size
n <- 100

# Compute confidence interval
prop.test(x, n, conf.level = 0.95, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.5
## X-squared = 4, df = 1, p-value = 0.0455
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5020026 0.6905987
## sample estimates:
##   p 
## 0.6

We are 95% confident that the true proportion of people who prefer Brand A falls within the computed confidence interval.


4.6.3 Performing a One-Sample Proportion Test

We test whether a sample proportion is significantly different from a hypothesized proportion \(p_0\).

4.6.3.1 Example: Testing if 60% Prefer Brand A

We test:

\[ H_0: p = 0.60 \]

\[ H_A: p \neq 0.60 \]

prop.test(x, n, p = 0.60, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.6
## X-squared = 0, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.6
## 95 percent confidence interval:
##  0.5020026 0.6905987
## sample estimates:
##   p 
## 0.6
  • If p-value < 0.05, reject \(H_0\) → The sample proportion is significantly different from 60%.

  • If p-value > 0.05, fail to reject \(H_0\) → No significant difference.


4.6.4 One-Sided Proportion Tests

If we want to test if the proportion is greater than or less than a given value:

4.6.4.1 Example: Testing if Proportion is Greater Than 50%

prop.test(x, n, p = 0.50, alternative = "greater", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.5
## X-squared = 4, df = 1, p-value = 0.02275
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.5178095 1.0000000
## sample estimates:
##   p 
## 0.6

4.6.4.2 Example: Testing if Proportion is Less Than 70%

prop.test(x, n, p = 0.70, alternative = "less", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.7
## X-squared = 4.7619, df = 1, p-value = 0.01455
## alternative hypothesis: true p is less than 0.7
## 95 percent confidence interval:
##  0.0000000 0.6769219
## sample estimates:
##   p 
## 0.6

4.6.5 Comparing Two Sample Proportions

We can compare two proportions to determine if they are significantly different.

4.6.5.1 Example: Comparing Success Rates of Two Groups

  • Group 1: 30 successes out of 50

  • Group 2: 45 successes out of 80

prop.test(c(30, 45), c(50, 80), correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(30, 45) out of c(50, 80)
## X-squared = 0.17727, df = 1, p-value = 0.6737
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1364425  0.2114425
## sample estimates:
## prop 1 prop 2 
## 0.6000 0.5625
  • If p-value < 0.05, the two proportions are significantly different.

  • If p-value > 0.05, no significant difference.


4.6.6 Visualizing Proportions

4.6.6.1 Example: Bar Plot of Proportions

successes <- c(30, 45)
total <- c(50, 80)
proportions <- successes / total

barplot(proportions, names.arg = c("Group 1", "Group 2"), col = c("blue", "red"),
        main = "Comparison of Two Proportions", ylim = c(0, 1), ylab = "Proportion")


4.6.7 Practical Exercises

4.6.7.1 Exercise 1: Compute a Confidence Interval

  1. A survey shows that 150 out of 500 people support a new policy.

  2. Compute a 95% confidence interval for the proportion.

Solution:

prop.test(150, 500, conf.level = 0.95, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  150 out of 500, null probability 0.5
## X-squared = 80, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2614819 0.3415678
## sample estimates:
##   p 
## 0.3

4.6.7.2 Exercise 2: One-Sample Proportion Test

  1. A sample of 200 students finds that 140 prefer online learning.

  2. Test if the proportion is different from 65%.

Solution:

prop.test(140, 200, p = 0.65, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  140 out of 200, null probability 0.65
## X-squared = 2.1978, df = 1, p-value = 0.1382
## alternative hypothesis: true p is not equal to 0.65
## 95 percent confidence interval:
##  0.6332093 0.7592526
## sample estimates:
##   p 
## 0.7

4.6.7.3 Exercise 3: One-Sided Proportion Test

  1. In a company, 45 out of 100 employees prefer remote work.

  2. Test if the proportion is greater than 40%.

Solution:

prop.test(45, 100, p = 0.40, alternative = "greater", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  45 out of 100, null probability 0.4
## X-squared = 1.0417, df = 1, p-value = 0.1537
## alternative hypothesis: true p is greater than 0.4
## 95 percent confidence interval:
##  0.370561 1.000000
## sample estimates:
##    p 
## 0.45

4.6.7.4 Exercise 4: Comparing Two Proportions

  1. Two groups were surveyed:
  • Group A: 85 out of 150 prefer a new product.

  • Group B: 75 out of 130 prefer the new product.

  1. Test whether the proportions are significantly different.

Solution:

prop.test(c(85, 75), c(150, 130), correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(85, 75) out of c(150, 130)
## X-squared = 0.029915, df = 1, p-value = 0.8627
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1264510  0.1059382
## sample estimates:
##    prop 1    prop 2 
## 0.5666667 0.5769231


4.7 Comparing the Means of Two Samples

Comparing the means of two independent samples is essential in determining if there is a significant difference between two groups.


4.7.1 Understanding Two-Sample t-Test

The two-sample t-test checks whether the means of two independent groups are significantly different.

Hypotheses:

  • Null Hypothesis (\(H_0\)): The two group means are equal.

  • Alternative Hypothesis (\(H_A\)): The two group means are different.

\[ t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

Where:

  • \(\bar{x_1}, \bar{x_2}\) = Sample means

  • \(s_1, s_2\) = Standard deviations

  • \(n_1, n_2\) = Sample sizes


4.7.2 Independent (Unpaired) t-Test

This test is used when the two samples are independent.

4.7.2.1 Example: Comparing Heights of Males and Females

# Sample data
male_heights <- c(170, 175, 180, 185, 190, 195)
female_heights <- c(160, 165, 168, 170, 175, 178)

# Perform independent t-test
t.test(male_heights, female_heights)
## 
##  Welch Two Sample t-test
## 
## data:  male_heights and female_heights
## t = 2.8225, df = 8.9621, p-value = 0.02005
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.607164 23.726169
## sample estimates:
## mean of x mean of y 
##  182.5000  169.3333
  • If p-value < 0.05, reject \(H_0\) → The two means are significantly different.

  • If p-value > 0.05, fail to reject \(H_0\) → No significant difference.


4.7.3 Checking Assumptions

Before running a t-test, we must check:

  1. Normality (Use Shapiro-Wilk test)

  2. Equal Variances (Use F-test)

4.7.3.1 Example: Checking Normality

shapiro.test(male_heights)
## 
##  Shapiro-Wilk normality test
## 
## data:  male_heights
## W = 0.98189, p-value = 0.9606
shapiro.test(female_heights)
## 
##  Shapiro-Wilk normality test
## 
## data:  female_heights
## W = 0.9841, p-value = 0.97

4.7.3.2 Example: Checking Equal Variances

var.test(male_heights, female_heights)
## 
##  F test to compare two variances
## 
## data:  male_heights and female_heights
## F = 2.0317, num df = 5, denom df = 5, p-value = 0.4551
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   0.2843024 14.5195451
## sample estimates:
## ratio of variances 
##           2.031734
  • If p-value < 0.05, variances are not equal → Use var.equal = FALSE in t.test().

  • If p-value > 0.05, variances are equal → Use var.equal = TRUE.


4.7.4 Performing t-Test with Unequal Variances

4.7.4.1 Example: When Variances are Unequal

t.test(male_heights, female_heights, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  male_heights and female_heights
## t = 2.8225, df = 8.9621, p-value = 0.02005
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.607164 23.726169
## sample estimates:
## mean of x mean of y 
##  182.5000  169.3333

4.7.4.2 Example: When Variances are Equal

t.test(male_heights, female_heights, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  male_heights and female_heights
## t = 2.8225, df = 10, p-value = 0.01808
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   2.772665 23.560668
## sample estimates:
## mean of x mean of y 
##  182.5000  169.3333

4.7.5 Paired t-Test (Dependent Samples)

A paired t-test is used when the same subjects are measured twice (e.g., before and after treatment).

4.7.5.1 Example: Testing Before vs. After Training Scores

# Scores before and after training
before <- c(60, 65, 70, 75, 80, 85, 90)
after  <- c(65, 68, 75, 78, 85, 88, 92)

# Perform paired t-test
t.test(before, after, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before and after
## t = -7.8393, df = 6, p-value = 0.0002277
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -4.873641 -2.554930
## sample estimates:
## mean difference 
##       -3.714286

4.7.6 Visualizing Group Differences

4.7.6.1 Example: Boxplot Comparing Two Groups

# Combine data into a dataframe
data <- data.frame(
  Height = c(male_heights, female_heights),
  Gender = rep(c("Male", "Female"), each = 6)
)

# Plot boxplot
boxplot(Height ~ Gender, data = data, col = c("blue", "red"), main = "Height Comparison")


4.7.7 Practical Exercises

4.7.7.1 Exercise 1: Independent t-Test

  1. Two groups take an exam:
group_A <- c(78, 80, 85, 88, 90, 92, 95)
group_B <- c(75, 78, 82, 85, 87, 89, 91)
  1. Test if their mean scores are significantly different.

Solution:

t.test(group_A, group_B)
## 
##  Welch Two Sample t-test
## 
## data:  group_A and group_B
## t = 0.92929, df = 11.951, p-value = 0.3711
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.037004 10.037004
## sample estimates:
## mean of x mean of y 
##  86.85714  83.85714

4.7.7.2 Exercise 2: Checking Assumptions

  1. Use the dataset:
data_1 <- c(10, 12, 14, 16, 18, 20)
data_2 <- c(8, 9, 10, 12, 14, 15)
  1. Check for normality and equal variances.

Solution:

shapiro.test(data_1)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_1
## W = 0.98189, p-value = 0.9606
shapiro.test(data_2)
## 
##  Shapiro-Wilk normality test
## 
## data:  data_2
## W = 0.94009, p-value = 0.6599
var.test(data_1, data_2)
## 
##  F test to compare two variances
## 
## data:  data_1 and data_2
## F = 1.7797, num df = 5, denom df = 5, p-value = 0.5423
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   0.2490297 12.7181372
## sample estimates:
## ratio of variances 
##           1.779661

4.7.7.3 Exercise 3: Paired t-Test

  1. A fitness test is conducted before and after training:
before_fitness <- c(50, 55, 60, 62, 65, 67, 70)
after_fitness  <- c(55, 58, 63, 65, 68, 70, 73)
  1. Test if there is a significant improvement after training.

Solution:

t.test(before_fitness, after_fitness, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before_fitness and after_fitness
## t = -11.5, df = 6, p-value = 2.597e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -3.984832 -2.586597
## sample estimates:
## mean difference 
##       -3.285714

4.7.8 Exercise 4: Visualizing Group Differences

  1. Create two datasets:
treatment <- c(100, 110, 120, 130, 140)
control <- c(95, 105, 115, 125, 135)
  1. Plot a boxplot to compare the groups.

Solution:

# Combine data into dataframe
data <- data.frame(
  Score = c(treatment, control),
  Group = rep(c("Treatment", "Control"), each = 5)
)

# Plot boxplot
boxplot(Score ~ Group, data = data, col = c("blue", "red"), main = "Treatment vs. Control")


4.8 Performing Pairwise Comparisons Between Group Means

When comparing more than two groups, pairwise comparisons allow us to identify which groups differ significantly.
Common methods include:

  • t-Tests with adjustments for multiple comparisons

  • Tukey’s Honest Significant Difference (HSD) test

  • Bonferroni correction

  • Dunnett’s test (comparing to a control group)


4.8.1 Understanding Pairwise Comparisons

When comparing multiple groups, running multiple t-tests increases the risk of Type I errors (false positives).
To correct this, we apply multiple comparison adjustments like:

  • Bonferroni correction (divides alpha by the number of comparisons)

  • Holm correction (stepwise adjustment)

  • Tukey’s HSD (for ANOVA post-hoc comparisons)


4.8.2 Performing Pairwise t-Tests

The pairwise.t.test() function performs multiple t-tests while adjusting for multiple comparisons.

4.8.2.1 Example: Comparing Exam Scores Across Three Groups

# Sample data
group <- rep(c("Group A", "Group B", "Group C"), each = 5)
scores <- c(85, 88, 90, 92, 94, 78, 80, 83, 85, 87, 70, 72, 75, 77, 79)

### Perform pairwise t-tests with Bonferroni correction
pairwise.t.test(scores, group, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores and group 
## 
##         Group A Group B
## Group B 0.024   -      
## Group C 6.8e-05 0.013  
## 
## P value adjustment method: bonferroni
  • The output provides p-values for each pairwise comparison.

  • If p-value < 0.05, the groups significantly differ.


4.8.3 Tukey’s HSD Test

Tukey’s Honest Significant Difference (HSD) test is used after ANOVA to compare all groups.

4.8.3.1 Example: Tukey’s HSD Test

# Create dataset
data <- data.frame(
  Group = factor(rep(c("A", "B", "C"), each = 5)),
  Score = c(85, 88, 90, 92, 94, 78, 80, 83, 85, 87, 70, 72, 75, 77, 79)
)

# Perform ANOVA
anova_model <- aov(Score ~ Group, data = data)

# Tukey's HSD Test
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Score ~ Group, data = data)
## 
## $Group
##      diff       lwr       upr     p adj
## B-A  -7.2 -13.26805 -1.131954 0.0206029
## C-A -15.2 -21.26805 -9.131954 0.0000620
## C-B  -8.0 -14.06805 -1.931954 0.0109527
  • The test provides confidence intervals for differences between groups.

  • If p-value < 0.05, the groups significantly differ.


4.8.4 Bonferroni and Holm Corrections

The Bonferroni correction divides alpha (0.05) by the number of comparisons.

The Holm correction adjusts p-values stepwise, maintaining more power.

4.8.4.1 Example: Comparing Methods

pairwise.t.test(scores, group, p.adjust.method = "holm")  # Holm
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores and group 
## 
##         Group A Group B
## Group B 0.0085  -      
## Group C 6.8e-05 0.0085 
## 
## P value adjustment method: holm
pairwise.t.test(scores, group, p.adjust.method = "bonferroni")  # Bonferroni
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores and group 
## 
##         Group A Group B
## Group B 0.024   -      
## Group C 6.8e-05 0.013  
## 
## P value adjustment method: bonferroni
pairwise.t.test(scores, group, p.adjust.method = "BH")  # Benjamini-Hochberg
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores and group 
## 
##         Group A Group B
## Group B 0.0081  -      
## Group C 6.8e-05 0.0064 
## 
## P value adjustment method: BH
  • Bonferroni is more conservative.

  • Holm maintains statistical power.

  • BH (Benjamini-Hochberg) controls the false discovery rate.


4.8.5 Dunnett’s Test (Comparing to a Control Group)

Dunnett’s test compares all groups against a control group.

4.8.5.1 Example: Comparing Treatment Groups to a Control

# Create dataset
data <- data.frame(
  Treatment = factor(rep(c("Control", "Drug A", "Drug B"), each = 5)),
  Response = c(50, 55, 53, 52, 54, 60, 62, 65, 67, 64, 70, 72, 75, 78, 77)
)

# Perform ANOVA
anova_model <- aov(Response ~ Treatment, data = data)

# Perform Dunnett’s test
library(multcomp)
summary(glht(anova_model, linfct = mcp(Treatment = "Dunnett")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Dunnett Contrasts
## 
## 
## Fit: aov(formula = Response ~ Treatment, data = data)
## 
## Linear Hypotheses:
##                       Estimate Std. Error t value Pr(>|t|)    
## Drug A - Control == 0   10.800      1.724   6.263 7.98e-05 ***
## Drug B - Control == 0   21.600      1.724  12.527 5.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
  • Compares Drug A and Drug B to the Control.

  • p-values tell if treatments differ from the control.


4.8.6 Practical Exercises

4.8.6.1 Exercise 1: Perform Pairwise Comparisons

  1. Create three groups of exam scores:
students <- rep(c("Class A", "Class B", "Class C"), each = 6)
scores <- c(78, 80, 82, 85, 88, 90, 70, 73, 75, 77, 78, 80, 60, 62, 65, 68, 70, 72)
  1. Perform pairwise t-tests with Holm correction.

Solution:

pairwise.t.test(scores, students, p.adjust.method = "holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  scores and students 
## 
##         Class A Class B
## Class B 0.0046  -      
## Class C 1.2e-05 0.0041 
## 
## P value adjustment method: holm

4.8.6.2 Exercise 2: Tukey’s HSD Test

  1. Create three treatment groups:
group <- rep(c("Control", "Treatment A", "Treatment B"), each = 5)
values <- c(10, 12, 15, 13, 14, 18, 20, 22, 21, 23, 25, 27, 30, 29, 31)
  1. Perform ANOVA and Tukey’s HSD test.

Solution:

anova_model <- aov(values ~ group)
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = values ~ group)
## 
## $group
##                         diff       lwr      upr     p adj
## Treatment A-Control      8.0  4.460679 11.53932 0.0001624
## Treatment B-Control     15.6 12.060679 19.13932 0.0000002
## Treatment B-Treatment A  7.6  4.060679 11.13932 0.0002584

4.8.6.3 Exercise 3: Comparing to a Control Group

  1. A clinical trial tests three conditions:
condition <- rep(c("Control", "Low Dose", "High Dose"), each = 6)
blood_pressure <- c(130, 128, 132, 129, 131, 130, 125, 123, 120, 124, 126, 122, 115, 113, 118, 116, 117, 114)
  1. Perform Dunnett’s test to compare treatments against the control.

Solution:

# Ensure condition is a factor
condition <- factor(condition)
anova_model <- aov(blood_pressure ~ condition)
library(multcomp)
summary(glht(anova_model, linfct = mcp(condition = "Dunnett")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Dunnett Contrasts
## 
## 
## Fit: aov(formula = blood_pressure ~ condition)
## 
## Linear Hypotheses:
##                          Estimate Std. Error t value Pr(>|t|)    
## High Dose - Control == 0  -14.500      1.063 -13.643 1.44e-09 ***
## Low Dose - Control == 0    -6.667      1.063  -6.273 2.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)


4.9 Hands-on Exercise

4.9.1 Exercise 1: Descriptive Statistics

  1. Create a dataset of monthly sales revenue:
revenue <- c(12000, 13500, 14200, 16000, 17000, 12500, 14000, 15000, 15500, 16500)
  1. Compute:
  • Mean, median, standard deviation

  • Minimum and maximum values

  • Interquartile range (IQR)

Solution

# Compute summary statistics
summary(revenue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12000   13625   14600   14620   15875   17000
# Standard deviation
sd(revenue)
## [1] 1673.187
# Interquartile range
IQR(revenue)
## [1] 2250

4.9.2 Exercise 2: Confidence Interval for Mean

  1. Use the dataset:
test_scores <- c(65, 70, 75, 80, 85, 90, 95, 100, 105, 110)
  1. Compute a 95% confidence interval for the mean.

Solution

mean_test_scores <- mean(test_scores)
sd_test_scores <- sd(test_scores)
n <- length(test_scores)

# Compute confidence interval
error_margin <- qt(0.975, df=n-1) * (sd_test_scores / sqrt(n))
c(mean_test_scores - error_margin, mean_test_scores + error_margin)
## [1] 76.67075 98.32925

4.9.3 Exercise 3: One-Sample t-Test**

  1. Use the dataset:
weights <- c(55, 60, 65, 70, 75, 80, 85, 90, 95, 100)
  1. Test if the mean weight is significantly different from 72.

Solution

t.test(weights, mu = 72)
## 
##  One Sample t-test
## 
## data:  weights
## t = 1.1489, df = 9, p-value = 0.2802
## alternative hypothesis: true mean is not equal to 72
## 95 percent confidence interval:
##  66.67075 88.32925
## sample estimates:
## mean of x 
##      77.5

4.9.4 Exercise 4: Proportion Test

  1. A survey finds that 65 out of 120 respondents prefer a new product.

  2. Test if the proportion is different from 50%.

Solution

prop.test(65, 120, p = 0.50, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  65 out of 120, null probability 0.5
## X-squared = 0.83333, df = 1, p-value = 0.3613
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4526097 0.6281387
## sample estimates:
##         p 
## 0.5416667

4.9.5 Exercise 5: Comparing Two Sample Means

  1. Two classes take a math test:
class_A <- c(78, 80, 85, 88, 90, 92, 95)
class_B <- c(75, 78, 82, 85, 87, 89, 91)
  1. Test if their mean scores are significantly different.

Solution

t.test(class_A, class_B)
## 
##  Welch Two Sample t-test
## 
## data:  class_A and class_B
## t = 0.92929, df = 11.951, p-value = 0.3711
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.037004 10.037004
## sample estimates:
## mean of x mean of y 
##  86.85714  83.85714

4.9.6 Exercise 6: Data Visualization

  1. Use the dataset:
categories <- c("A", "B", "C", "A", "A", "B", "C", "A", "B", "C")
  1. Create a bar chart.

Solution

category_table <- table(categories)
barplot(category_table, col = c("blue", "red", "green"), main = "Category Distribution",
        xlab = "Category", ylab = "Count")


4.9.7 Exercise 7: Scatterplot with Regression Line

  1. Create two variables:
experience <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
salary <- c(40000, 42000, 45000, 47000, 50000, 52000, 55000, 58000, 60000, 63000)
  1. Create a scatterplot with a regression line.

Solution

plot(experience, salary, col = "blue", pch = 19, main = "Experience vs Salary",
     xlab = "Years of Experience", ylab = "Salary ($)")
abline(lm(salary ~ experience), col = "red", lwd = 2)


4.9.8 Exercise 8: Pairwise Comparisons

  1. Create a dataset with three groups:
group <- rep(c("Group A", "Group B", "Group C"), each = 5)
values <- c(85, 88, 90, 92, 94, 78, 80, 83, 85, 87, 70, 72, 75, 77, 79)
  1. Perform pairwise t-tests.

Solution

pairwise.t.test(values, group, p.adjust.method = "holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  values and group 
## 
##         Group A Group B
## Group B 0.0085  -      
## Group C 6.8e-05 0.0085 
## 
## P value adjustment method: holm

4.9.9 Exercise 9: Tukey’s HSD Test

  1. Use the dataset:
treatment <- rep(c("Control", "Treatment A", "Treatment B"), each = 5)
response <- c(50, 55, 53, 52, 54, 60, 62, 65, 67, 64, 70, 72, 75, 78, 77)
  1. Perform ANOVA and Tukey’s HSD test.

Solution

anova_model <- aov(response ~ treatment)
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response ~ treatment)
## 
## $treatment
##                         diff       lwr      upr     p adj
## Treatment A-Control     10.8  6.199708 15.40029 0.0001144
## Treatment B-Control     21.6 16.999708 26.20029 0.0000001
## Treatment B-Treatment A 10.8  6.199708 15.40029 0.0001144

4.9.10 Exercise 10: Comparing a Treatment to a Control

  1. A clinical trial tests three conditions:
condition <- rep(c("Control", "Low Dose", "High Dose"), each = 6)
blood_pressure <- c(130, 128, 132, 129, 131, 130, 125, 123, 120, 124, 126, 122, 115, 113, 118, 116, 117, 114)
  1. Perform Dunnett’s test to compare treatments against the control.

Solution

# Ensure condition is a factor
condition <- factor(condition)
anova_model <- aov(blood_pressure ~ condition)
library(multcomp)
summary(glht(anova_model, linfct = mcp(condition = "Dunnett")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Dunnett Contrasts
## 
## 
## Fit: aov(formula = blood_pressure ~ condition)
## 
## Linear Hypotheses:
##                          Estimate Std. Error t value Pr(>|t|)    
## High Dose - Control == 0  -14.500      1.063 -13.643 1.44e-09 ***
## Low Dose - Control == 0    -6.667      1.063  -6.273 2.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)






________________________________________________________________________________