Chapter 1 INTRODUCTION

The distinction between data sample and the underlying data population is always ignored. For instance the mean concentration of a chemical contaminant in a soil sample of a contaminated site. Here, the data sample is the soil sample that was taken to the laboratory to find out the mean concentration of the contaminant while the data population is the entire data population consisting of all possible soil measurements of the chemical contaminant at the site. We need to define some terms to get this clear.

Data Population: is the entire set of individuals, items or observations of interest in a study. For instance all the soil samples from the contaminant site.
Data Sample: is the subset of the population selected for analysis. For example, measuring the height of randomly selected trees in the forest.
Data Distribution: describes how data points are spread of arranged.

Data Sample vs Data Population

A deep understanding of the difference between data sample and data population is important in inferential statistics which seeks to generalize the results of data analyses performed on the data samples. A data population consists of all possible observations or data points concerning the characteristic of interest. For instance, when finding out the mean height of all the oak trees in a forest, the data population will be every height of every single oak tree in the forest.

Contrary, data sample is usually a limited subset of observations drawn for the entire population. In this case we can have a data sample by randomly selecting 100 oak trees and measuring their heights.

Try it!

Lets simulate data of oak trees with R

# Seed for reproducibility
set.seed(123)

# A population of 12500 oak trees 
population_heights <- rnorm(12500, mean = 20, sd=5)
head(population_heights)

## [1] 17.19762 18.84911 27.79354 20.35254 20.64644 28.57532

We have simulated the heights of 12500 oak trees in a forest. The variance population_heights represent the data population for the oak tree heights. Lets randomly select 86 trees from the entire data population of oak tree heights

# Randomly select 86 trees 
sample_heights <- sample(population_heights, size=86, replace = FALSE)
tail(sample_heights)

## [1] 25.88792 29.03996 29.00965 37.10547 17.76631 19.81021

The variable sample_heights represents data sample for the oak tree heights.

If we make this experiment more interesting, we find the mean and standard deviation for the data population and the data sample for the oak tree heights. Lets do it!

# Mean and standard deviation of the entire population
population_stats <- c(
  Mean = mean(population_heights),
  SD = sd(population_heights)
)

population_stats

##      Mean        SD 
## 19.995609  4.996295

The entire oak tree data population has a mean height of approximately 20 meters and a standard deviation of approx 5 meters. Lets see how it compares with the data sample of the oak tree heights.

# Mean and standard deviation of the sample
sample_stats <- c(
  Mean = mean(sample_heights),
  SD = sd(sample_heights)
)

sample_stats

##      Mean        SD 
## 21.094995  5.507379

There is a minor difference between the data sample and the entire data population of the oak tree heights. As you can see the selected sample of oak trees has an average height of 21 meters and a standard deviation of 5.5 meters.

Practical Exercise

Run the code below to generate the data of soil samples collected in farm with livestock. The main objective was to find the pH of the soil.

Use the code below to simulate the collection of the data.

# Set seed for reproducibility 
set.seed(76)

# Simulate the data collection
farm_population_pH <- rnorm(n = 5342, mean=3.61, sd=0.8)
head(farm_population_pH)

## [1] 3.864037 3.471632 3.544874 3.192810 3.239502 3.724600

You are required to work on the problems below;

What is the Mean and Standard Deviation of the pH of the entire farm population soil samples.
Select 65 random samples from the entire population.
Calculate the Mean and Standard Deviation of the sample.
How does the Mean and Standard deviation differ from the entire population?

Solution

Run the code below to generate the data of soil samples collected in farm with livestock. The main objective was to find the pH of the soil.

Use the code below to simulate the collection of the data.

# Set seed for reproducibility 
set.seed(76)

# Simulate the data collection
farm_population_pH <- rnorm(n = 5342, mean=3.61, sd=0.8)
head(farm_population_pH)

## [1] 3.864037 3.471632 3.544874 3.192810 3.239502 3.724600

You are required to work on the problems below;

What is the Mean and Standard Deviation of the pH of the entire farm population soil samples.

population_stats <- c(
  Mean = mean(farm_population_pH),
  standard_deviation = sd(farm_population_pH)
)

population_stats # show the results

##               Mean standard_deviation 
##          3.6218315          0.7934042

Select 65 random samples from the entire population.

farm_sample_pH <- sample(farm_population_pH, 
                         size = 65, 
                         replace = FALSE)
head(farm_sample_pH)

## [1] 3.478034 4.268278 4.084203 4.044815 2.613637 3.575463

Calculate the Mean and Standard Deviation of the sample.

sample_stats <- c(
  Mean = mean(farm_sample_pH),
  standard_deviation = (sd(farm_sample_pH))
)

sample_stats # show the results

##               Mean standard_deviation 
##          3.7237092          0.7554283

How does the Mean and Standard deviation differ from the entire population?

The sample mean is slightly higher than the whole population mean while the standard deviation of the whole populations is almost equal to the data sample’s standard deviation

________________________________________________________________________________

1.1 Data Distributions

Data Distribution describes how data values in data population are spread across the range of possible values. For instance, in our case the height of oak trees in the forest might be distributed normally, meaning that most trees cluster around the average height. Understanding data distribution is crucial in ecological analysis since it helps in;

identifying patterns ( for instance seasonal changes in bird populations).
choosing appropriate statistical tests.
modelling ecological phenomena accurately.

When dealing with continuous data values like height of oak trees, soil acidity or rainfall, Probability Distribution Functions is used to find the likelihood of different outcomes in a population. There are two types of probability distribution functions, namely;

Cumulative distribution function,
Empirical Cumulative distribution function.

Below is the distribution of oak tree heights represented in a density plot.

library(ggplot2)

# Create a data frame from the data 
oak_df <- data.frame(Height = population_heights)

# Plot the density plot 
ggplot(oak_df, 
       aes(x = Height)) + 
  geom_density(fill = "blue", alpha=0.5) + 
  labs(
    title = "Distribution of tree heights",
    x = "Tree Height(m)",
    y = "Density"
  ) + 
  theme_minimal()

Above is the distribution of oak tree heights, lets find its probability using the two types of distribution functions.

Cumulative Distribution Function

This distribution function leans toward that a probability of a random variable is less than or equal to a certain value.

We can randomly select a sample of heights from a population using the following code:

sample_heights <- sample(population_heights, size=86, replace = FALSE)

tail(sample_heights)

## [1] 16.77457 21.71678 21.76080 36.45259 25.02581 14.43727

For example, if we analyze heights, the CDF could help answer: “What percentage of people have a height less than or equal to 170 cm?”

The CDF of a normal distribution can be visualized using:

x <- seq(min(population_heights), max(population_heights), length = 100)
cdf_values <- pnorm(x, mean = mean(population_heights), sd = sd(population_heights))

plot(x, cdf_values, type="l", col="blue", lwd=2, 
     xlab="Height", ylab="Cumulative Probability", 
     main="Cumulative Distribution Function (CDF)")

Empirical Cumulative Distribution Function

The Empirical Cumulative Distribution Function (ECDF) is a way to estimate the CDF based on the actual observed data (your sample). Instead of assuming a theoretical distribution, it calculates the proportion of values that are less than or equal to each observed value in the sample.

Lets create an ECDF plot in R using the sample heights

# calculate CDF 
CDF <- ecdf(sample_heights)

# draw the cdf plot
plot(CDF,
     main="Empirical Cumulative Distribution Function (ECDF)", 
     xlab="Height", ylab="Cumulative Probability", col="red")

As you can see, the ECDF graph shows a step-like increase whenever a new height is encountered and the curve provides a data-driven approximation of how heights are distributed in the sample.

Here are the main differences between CDF(Cumulative Distribution Function) and ECDF(Empirical Cumulative Distribution Function);

CDF is based on the entire population while ECDF is based on the observed sample data.
The shape of CDF curve is smooth while the ECDF has a step-like plot. CDF is used to make model assumptions and probability estimates while ECDF is used to make data driven insights and explanatory analysis.

Practical Exercise

Solution

________________________________________________________________________________

1.1.1 Types of Distribution

In statistical tests, it is assumed that the data sample represents the underlying data population. Similarly, the distribution of the data sample is assumed to be similar to the one of the data population.

These are what characterizes the data distribution; whether the data is;

discrete or continuous
symmetrical or asymmetrical
bounded by lower and/or upper limits or unbounded.

Lets discuss different types of data distributions.

1.1.1.1 Normal distribution

The normal distribution is a continuous symmetrical distribution in the real number domain (i.e., with the set of possible values ranging between -∞ and ∞) whose probability density function plots as a smooth bell-shaped curve, and whose cumulative distribution is S-shaped.

Try it!

To demonstrate this, lets generate 100 random data values with a mean of 5 and a standard deviation of 3.

set.seed(1)

data <- rnorm(100, mean=5, sd=3)
df <- data.frame(data)
# Plot the data
ggplot(df, 
       aes(x = data)) + 
  geom_density(fill = "blue", alpha=0.5) + 
  labs(
    title = "Distribution of data sample",
    x = "Sample Value",
    y = "Density"
  ) + 
  theme_minimal()

The data distribution below represents a near bell-shaped curve.

A normal distribution with a mean of zero and variance of 1 (or standard deviation of 1, since standard deviation is the square root of variance and square root of 1 is 1) is referred to as a standard normal distribution

A standard normal distribution has many applications in statistics such as computation of cumulative probabilities and critical values for hypothesis tests.

Try it!

Lets generate a random data set that is standardized.

set.seed(100)

data <- rnorm(100, mean=0, sd=1)
df <- data.frame(data)
# Plot the data
ggplot(df, 
       aes(x = data)) + 
  geom_density(fill = "blue", alpha=0.5) + 
  labs(
    title = "Distribution of data sample",
    x = "Sample Value",
    y = "Density"
  ) + 
  theme_minimal()

Any data sample that follows a normal distribution can be standardized to a standard normal distribution by subtracting its mean from each individual data value and dividing the result by its standard deviation. The resultant values can be referred to as standard score, normal score, normal quantile or z-value)

Here is the formula for calculation the z-value (standardized score); \[z = {{x - \overline x}\over{\sigma}}\]

Where;

\(x\) is the actual value
\(\overline x\) is the data sample mean
\(\sigma\) is the standard deviation

Try it

——————-will generate a data distribution with a known standard deviation and mean then standardize it. Will then plot both samples - raw data and the standardized one—————————-

Example: CUMULATIVE AND EXCEEDANCE PROBABILITIES FOR A NORMAL DISTRIBUTION

Groundwater Manganese Concentrations in mg/L

______________________Expand on this____________________________________

modify the code below to show up the distribution

x = seq(0, 0.4, length=100)
y = dnorm(x, 0.52, 0.18)
polygon(c(0, x, 0.4), c(0, y, 0), col = "gray")

# Add a shaded area 
x = seq(1, 0.9, length=25)
y = dnorm(x, 0.52, 0.18)
polygon(c(1, x, 0.9), c(0, y, 0), col = "gray")

Goodness-of-Fit (GOF) Tests for the Normal Distribution

In parametric statistical tests, it assumed that the data is normally distributed or can be normalized by data transformation such as log transform. The parameters, such as mean and standard deviation, in parametric tests must be specified.

Verifying the data normality is crucial before conducting the tests. If it fails the test of normality other types of tests(non parametric) are considered. The GOF tests are used to access the normality of the data sample which requires 8 to 10 randomly picked data values.

The tests are;

Normal probability plot - also known as normal quantile plot.
Coefficient of Variation(CV)
Coefficient of Skewness
Shapiro-Wilk(SW) and Shapiro-Francia(SF) procedures
Filiben’s probability plot correlation coefficient (PPCC)
Shapiro-Wilk multiple group normality

The CV test is the most commonly used method. It is computed simply by dividing the standard deviation by the mean. If the resultant value is greater than 1 then the data fails the normality test, otherwise it passes. \[CV = {\sigma\over{\overline x}}\]

Other GOF tests that are used in testing for normal and other distributions are Anderson–Darling (AD) test and the Lilliefors–Kolmogorov–Smirnov test

The Coefficient of Skewness stated above provides a more direct measure of skewness where skewness of the magnitude of greater than 0.5 is considered moderate to the substantial degree of skewness.

——————————Add an example————————

Central Limit Theorem

____________________simplify it with an example _________________________

1.1.1.2 Lognormal, Gamma, and Other Continuous Distributions

Before we dive into Gamma distribution, we need to to get the definition of exponential distribution. Exponential distribution is the probability of the waiting time between events in a Poisson Process.

Imagine you’re analyzing environmental data, such as soil contamination levels or water flow rates. Often, these values are not normally distributed because they can’t be negative and tend to have a few extremely high values. This is where the lognormal distribution comes in—it describes data whose logarithms follow a normal distribution. For example, if you take the natural logarithm of highly skewed mercury concentration values, their distribution becomes nearly normal, making statistical analysis easier. However, when back-transforming results, the arithmetic mean can be underestimated, leading to what’s called transformation bias—but the median, often equal to the geometric mean, provides a useful estimate. Aside from lognormal, environmental studies frequently rely on other distributions such as gamma, logistic, and uniform, each useful in different scenarios. The gamma distribution, for instance, is often applied to model waiting times or rainfall amounts, offering flexibility through its shape and scale parameters. Understanding these distributions allows us to apply the right statistical methods, improve predictions, and make sense of complex real-world data patterns effectively.

Lets dive deep into each type of the distribution mentioned

Gamma Distribution

Exponential infers the probability until the first event happens while Gamma distribution gives us the probability of the waiting time until the \(n^{th}\) event.

The gamma distribution is bounded by zero on the left (no negative values) and can stretch infinitely to the right, making it suitable for data that exhibit long right tails. It is a powerful tool for modeling real-world data that is positively skewed, such as rainfall amounts, insurance claims, and even environmental contaminant levels.

When dealing with skewed data, both the lognormal and gamma distributions are commonly used. However, the gamma distribution has an advantage: it avoids the pitfalls of transformation bias, which can arise when converting data back from logarithmic scales to the original scale.

The gamma distribution is controlled by two key parameters:

Shape parameter (k) – Determines the shape of the distribution. A higher k makes the distribution look more like a normal distribution.
Scale parameter (θ) – Controls the spread or range of the distribution.

Want to see how these parameters influence the distribution? Let’s generate some random data and visualize it!

# Load necessary library
library(ggplot2)

# Generate random data from a gamma distribution
set.seed(42)  # Ensures reproducibility
gamma_data <- rgamma(1000, shape = 5, scale = 2)

# Visualize the data distribution
ggplot(data.frame(gamma_data), aes(x = gamma_data)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black", alpha = 0.7) +
  ggtitle("Histogram of Gamma-Distributed Data") +
  xlab("Values") + ylab("Frequency") +
  theme_minimal()

Try it

Try tweaking the shape and scale values in the code above to see how they affect the distribution.

Increase the shape parameter (k) to see how the distribution approaches normality.
Reduce the scale parameter (θ) to create a sharper peak.

Want to check if your data follows a gamma distribution? You can perform Goodness-of-Fit (GOF) tests using the fitdistrplus package in R.

The package is installed by:

install.packages("fitdistrplus")

library(fitdistrplus)

## Loading required package: MASS

## Loading required package: survival

# Fit a gamma distribution to the data
fit <- fitdist(gamma_data, "gamma")

# Print summary of the fit
summary(fit)

## Fitting of the distribution ' gamma ' by maximum likelihood 
## Parameters : 
##       estimate Std. Error
## shape 4.851288 0.20991639
## rate  0.491825 0.02242213
## Loglikelihood:  -2845.914   AIC:  5695.827   BIC:  5705.643 
## Correlation matrix:
##           shape      rate
## shape 1.0000000 0.9491175
## rate  0.9491175 1.0000000

The gamma distribution is an excellent choice for modeling skewed data, offering flexibility and reliable statistical methods without the transformation bias of lognormal distributions. By experimenting with different parameters and visualizing the results, you can better understand its applications in environmental science, finance, and beyond!

Practical Exercise

Solution

________________________________________________________________________________

Logistic Distribution

The logistic distribution, like the normal distribution, is symmetric and unbounded. However, it differs in having fatter tails, meaning it contains more extreme values on both sides. This makes it useful in various fields, such as:

Logistic regression, where it models binary outcomes (e.g., success/failure).
Ecological modeling, where population growth follows an S-shaped logistic curve.
Machine learning, for classification problems.

The image below shows the slight difference between the logistic and normal distribution

Here are the key characteristics of lognormal distribution;

Symmetry: Centered around a location parameter (mean).
Scale Parameter: Determines the spread, similar to the standard deviation in normal distribution.
Fat Tails: Greater occurrence of extreme values compared to the normal distribution.

The cumulative distribution function (CDF) of the logistic distribution resembles the logistic function: \[{P(t)}={{e^t}\over{1 + e^t}}\] In logistic regression, the probability of success is expressed as: \[{\pi(x)}={{e^{(b_0 +b_1x)}}\over{1+e^{(b_0+b_1x)}}}\] The logit function, which serves as the response variable in logistic regression, defined as; \[{log({{\pi(x)}\over{1-\pi(x)}})}={b_0 + b_1x}\]

Try it

Let’s generate some logistic-distributed data and compare it with the normal distribution to understand their differences.

The below one is a logistic distribution

# Load necessary library
library(ggplot2)

# Generate random data from the logistic distribution
set.seed(42)  # Ensures reproducibility
logistic_data <- rlogis(1000, location = 0, scale = 1)

# Visualize the data distribution
ggplot(data.frame(logistic_data), aes(x = logistic_data)) +
  geom_histogram(binwidth = 0.5, fill = "lightcoral", color = "black", alpha = 0.7) +
  ggtitle("Histogram of Logistic-Distributed Data") +
  xlab("Values") + ylab("Frequency") +
  theme_minimal()

Now let’s compare their probability density functions (PDFs).

# Define x-axis range
x <- seq(-5, 5, length.out = 300)

# Compute density for normal and logistic distributions
normal_density <- dnorm(x, mean = 0, sd = 1)
logistic_density <- dlogis(x, location = 0, scale = 1)

# Plot the distributions
plot(x, normal_density, type = "l", lwd = 2, col = "blue", 
     ylab = "Density", xlab = "x", main = "Normal vs Logistic Distribution")
lines(x, logistic_density, lwd = 2, col = "red", lty = 2)
legend("topright", legend = c("Normal Distribution", "Logistic Distribution"),
       col = c("blue", "red"), lty = c(1, 2), lwd = 2)

You can try experiment with different location and scale parameters in the above code to see how the logistic distribution changes. Location is changes to shift the distribution left and right while the scale is increased to make the distribution spread out more.

Want to see logistic regression in action? Here’s a quick example of modeling a binary outcome.

# Simulating data
set.seed(42)
x <- rnorm(100)
y <- rbinom(100, 1, plogis(0.5 + 1.2 * x))  # Logistic function

# Fit logistic regression
logistic_model <- glm(y ~ x, family = binomial)

# Summary of model
summary(logistic_model)

## 
## Call:
## glm(formula = y ~ x, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2892     0.2357   1.227     0.22    
## x             1.3348     0.3044   4.385 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.25  on 98  degrees of freedom
## AIC: 111.25
## 
## Number of Fisher Scoring iterations: 4

The logistic distribution is a powerful tool for modeling data with skewed distributions and binary outcomes. Understanding its properties, generating data, and visualizing it interactively can provide valuable insights, especially in fields like statistics, machine learning, and population modeling.

Practical Exercise

Solution

________________________________________________________________________________

Other Continuous Distributions

Other types of continuous distributions include;

Uniform distribution: is appropriate when the only available data consists of the minimum value and maximum value . All values within this range have an equal probability of occurring, meaning the probability density function (PDF) is constant at within the interval , and zero elsewhere.
Triangular distribution: is suitable when the minimum value , maximum value , and most likely value are known. This distribution assumes that values increase linearly from to the peak , and then decrease linearly to.

install.packages("triangle")
library(triangle)

Extreme Value distributions: Extreme value distributions model the smallest or largest values in a dataset. A common example is analyzing annual peak streamflow rates at a monitoring station to estimate flood magnitudes expected within a specific period (e.g., 50-year or 100-year floods).

One commonly used extreme value distribution is the Log-Pearson Type III distribution, which is right-skewed and bounded on the left by zero (since negative values are not possible). This distribution is widely used in flood frequency analysis and is related to the gamma distribution.

For analyzing low flow values, distributions such as the Weibull and Gumbel distributions are commonly employed. These distributions are particularly useful for estimating probabilities of extreme low flow conditions over time.

Try it!

Practical Exercise!

Solution

________________________________________________________________________________

1.1.1.3 Distributions Used in Inferential Statistics

When the sample size is large (typically ≥30) and the data are not highly skewed (skewness coefficient ≤ 0.5 or 1), the central limit theorem allows for approximate normality, enabling extrapolations from sample statistics to population parameters. However, for smaller samples—common in environmental data—or when the population standard deviation is unknown, exact sampling theory is preferred, leading to the use of distributions like Student’s t, chi-square, and F distributions. These distributions are essential for statistical inference, including confidence interval estimation, hypothesis testing for population means, and significance tests for regression coefficients.

1.1.1.3.1 Student’s t Distribution

This type of distribution is essential for inference when the population standard deviation is unknown. The t-distribution is used for hypothesis testing, confidence intervals, and regression analysis, among other applications.

Given a random sample \(x_1,x2,...,x_n\) from a normal distribution with an unknown mean and standard deviation , the t-statistic is given by: \[t = {{x-\overline X}\over{s/\sqrt n}}\] Where:

\(x\) is the new observation
\(\overline X\) is the sample mean
\(s\) is the sample standard deviation
\(n\) is the sample size

It can be used in ecology to;

Compare species populations – if the two populations have the same mean
Assessing biodiversity changes – like before and fater habitat restoration differs significantly
Analyzing environmental variables – like poullution levels between two forests areas differs.

Try it!

We will investigate difference in tree heights in a forested area.

Generationg and plotting t-distribution

# Load necessary library
library(ggplot2)

# Define x values
t_values <- seq(-4, 4, length=100)

# Compute density for different degrees of freedom
df2 <- dt(t_values, df=2)
df12 <- dt(t_values, df=12)
norm <- dnorm(t_values) # Standard normal

data <- data.frame(x=t_values, StdNormal=norm, df2=df2, df12=df12)

# Plot the distributions
ggplot(data, aes(x=x)) +
  geom_line(aes(y=StdNormal, linetype="Std Normal")) +
  geom_line(aes(y=df2, linetype="t, df = 2")) +
  geom_line(aes(y=df12, linetype="t, df = 12")) +
  labs(y="Probability Density", x="t values") +
  scale_linetype_manual("", values=c("solid", "dashed", "dotdash")) +
  theme_minimal()

Perform two sample t-test

# Sample data (species richness in two forest areas)
area_A <- c(23, 25, 27, 30, 28, 26, 29, 31)
area_B <- c(19, 20, 22, 24, 21, 23, 22, 25)

# Perform two-sample t-test
t.test(area_A, area_B, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  area_A and area_B
## t = 4.558, df = 14, p-value = 0.0004468
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.845765 7.904235
## sample estimates:
## mean of x mean of y 
##    27.375    22.000

Lets now visualize the critical regions in the t-distributions

# Define x values
x_vals <- seq(-4, 4, length=200)
y_vals <- dt(x_vals, df=10)

# Critical region (alpha = 0.05, two-tailed)
critical_val <- qt(0.975, df=10)

# Plot
plot(x_vals, y_vals, type="l", ylab="Probability Density", xlab="t values", main="t-Distribution with Critical Regions")
abline(v=c(-critical_val, critical_val), col="red", lwd=2, lty=2)

This will help you with the interpretation

Compares the t-distribution with different degrees of freedom to the normal distribution. The t-distribution has heavier tails for small degrees of freedom.
One-Sample t-Test: Tests if the mean tree height significantly differs from 14m.
Two-Sample t-Test: Tests if species richness significantly differs between two areas.
Critical Region Visualization: Shows rejection regions for hypothesis testing.

Practical Exercise

Solution

________________________________________________________________________________

1.1.1.3.2 Chi-Square Distribution

The Chi-square(\(x^2\)) distribution is widely used in ecological research for analyzing categorical data and testing statistical hypotheses. Many ecological studies involve count data, presence-absence data, or species distribution models, where the Chi-square test provides insights into species interactions, habitat preferences, and environmental associations.

This type of chi-square distribution arises when we sum the squares of independent standard normal variables.

In ecology, chi-square distribution may be used to analyze species-habitat relationships and species interactions.

Try it!

We will analyze the presence and absence of species A in a habitat in a contingency table.

Null Hypothesis - Species A’s occurrence is independent of habitat type Alternative Hypothesis - Species A prefers a particular habitat

# Create contingency table
species_habitat <- matrix(c(30, 20, 15, 35), nrow=2, byrow=TRUE)
colnames(species_habitat) <- c("Present", "Absent")
rownames(species_habitat) <- c("Forest", "Grassland")

# Perform Chi-square test
chisq.test(species_habitat)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  species_habitat
## X-squared = 7.9192, df = 1, p-value = 0.004891

The p-value is way below 0.05 therefore we reject the null hypothesis. We have enough evidence to conclude that Species A prefers a particular habitat.

Lets repeat the analysis with observed and expected bird counts on different wetland sites. We will use goodness-of-fit test

Null hypothesis - Birds are evenly distributed. Alternate hypothesis - Birds are not evenly distributed.

# Observed and expected bird counts
observed <- c(45, 55, 60, 40)
expected <- c(50, 50, 50, 50)

# Perform Chi-square goodness-of-fit test
chisq.test(observed, p = rep(1/4, 4)) # Equal probability across sites

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 5, df = 3, p-value = 0.1718

p-value is above 0.05 therefore we conclude that the birds are evenly distributed.

The chi-square distribution is used to construct confidence intervals for variance estimates such as error variance in ecological regression models.

The F-distributions, which is the ration of two chi-square distributions is used in ANOVA tests for comparing species abundance across multiple environmental conditions.

Lets perform an example to demonstrate this

par(mfrow = c(1,2))  

# Chi-square distribution for different degrees of freedom
curve(dchisq(x, df=2), from=0, to=10, ylab="Probability density", 
      main="Chi-square Distributions in Ecology", lwd=1, lty=1)
curve(dchisq(x, df=5), from=0, to=10, add=TRUE, lty=2)
curve(dchisq(x, df=10), from=0, to=10, add=TRUE, lty=3)
legend("topright", legend=c("df = 2", "df = 5", "df = 10"), 
       lty=1:3, bty="n")

# F-distribution for different df values
curve(df(x, df1=1, df2=10), from=0, to=5, ylab="Probability density", 
      main="F Distributions in Ecology", lwd=1, lty=1)
curve(df(x, df1=6, df2=60), from=0, to=5, add=TRUE, lty=2)
curve(df(x, df1=10, df2=100), from=0, to=5, add=TRUE, lty=3)
legend("topright", legend=c("df1 = 1, df2 = 10", "df1 = 6, df2 = 60", "df1 = 10, df2 = 100"), 
       lty=1:3, bty="n")

1.1.1.3.3 F Distribution

The F-distribution, named after the British statistician Ronald Fisher, is commonly used in ecological studies for testing hypotheses involving variance and regression models. It is the distribution of the F-statistic, which is computed as the ratio of two chi-square distributed variables, each normalized by their respective degrees of freedom.

Mathematically, the F-statistic is expressed as:

\[F = {{X_{k1}^2/k1}\over{X_{k2}^2/k2}}\] where:

\(X_{k1}^2/k1\) is a chi-square distributed variable with \(k1\) degrees of freedom,
\(X_{k1}^2/k1\) is anothe chi-square distributed variable with \(k2\) degress of freedom

Try it!

Lets visualize F-distribution in Ecology

par(mfrow = c(1,1))  

# F-distributions for different degrees of freedom
curve(df(x, df1=1, df2=10), from=0, to=5, ylab="Probability Density", 
      main="F Distribution in Ecology", lwd=1, lty=1)
curve(df(x, df1=6, df2=60), from=0, to=5, add=TRUE, lty=2)
curve(df(x, df1=10, df2=100), from=0, to=5, add=TRUE, lty=3)

legend("topright", legend=c("df1 = 1, df2 = 10", "df1 = 6, df2 = 60", "df1 = 10, df2 = 100"), 
       lty=1:3, bty="n")

As df1 and df2 increase, the F-distribution becomes less-skewed and more bell-shaped.

1.1.1.4 Dicrete Distributions

A random variable is considered discrete if it takes only whole-number values, rather than fractions (for example, a class cannot have 10.5 students). Discrete data include both count data and categorical or qualitative data, such as whether an organism is present or absent or whether a value exceeds a control limit. Two commonly used distributions for modeling discrete data are the binomial distribution, which applies to categorical data, and the Poisson distribution, which is used for count data.

1.1.1.4.1 Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes: success (1) or failure (0).

Imagine we are testing 20 groundwater monitoring wells for a contaminant. Each well either detects the contaminant (success) or does not detect it (failure). For instance;

Probability of detecting the contaminant (success): p = 0.4
Probability of not detecting the contaminant (failure): q = 1 - p = 0.6

The probability of getting exactly successes in n trials is given by:

\[{P(X=k)}={\begin{pmatrix} n \\ k \end{pmatrix}p^k(1-p)^{n-k}}\]

Where:

\({\begin{pmatrix} n \\ k \end{pmatrix}}={{n!}\over{k!(n-k)!}}\) is the binomial coefficient.
\(p\) is the probability of success.
\((1-p)\) is the probability of failure.
\(k\) is the number of successes.

For a Binomial(n, p) distributions:

Mean(Expected Value): \[E(X) = np\] This tells us the average of successes over many trials.
Variance \[\sigma^2 = np(1-p)\]
Standard deviation \[\sigma = \sqrt{np(1-p)}\]

In this case, for our groundwater monitoring example

Expected Value: \(E(X) = 20 X 0.4 = 8\)
Standard Deviation \[\sigma = \sqrt{20 x 0.4 X 0.6} = \sqrt {4.8} = 2.19\]

Lets apply R for this example of groundwater contamination. R provides four main functions to work with the binomial distribution:

dbinom(x, size, prob): Computes \(P(X=x\), the probability of exactly x successes.
pbinom(x, size, prob): Computes \(P(\geq x)\), the cumulative probability.
qbinom(p, size, prob): Computes the quantile for a given probability.
rbinom(n, size, prob): Generates random binomial values (simulating real-world experiments)

Try it!

Lets simulate binomial outcomes

set.seed(123)  # For reproducibility
rbinom(10, size = 20, prob = 0.4)

##  [1]  7 10  7 11 11  4  8 11  8  8

This simulates 10 trials where we sample 20 wells each time, and each well has a 40% chance of detecting the contaminant.

Lets try the probability of detecting the contaminant in exactly 9 wells

dbinom(9, size = 20, prob = 0.4)

## [1] 0.1597385

This means there’s a 15.97% chance that exactly 9 out of 20 wells will detect the contaminant.

Now lets try the Probability of Detecting the Contaminant in 9 or Fewer Wells to find the cumulative probability of detecting the contaminant in at most 9 wells.

pbinom(9, size = 20, prob = 0.4)

## [1] 0.7553372

Lets plot it

par(lend = 2)  # Adjust bar ends
plot(0:15, dbinom(0:15, 20, 0.4), type = "h", lwd = 6,
     xlab = "# of Wells with Contaminant", 
     ylab = "Probability",
     main = "Binomial Distribution (n=20, p=0.4)")

From the graph;

The most probable outcome is finding the contaminant in 8 wells.
Extreme values (0 or 15 wells detecting contaminants) are very unlikely.

What if?

When we increase n (e.g., testing 100 wells instead of 20), the distribution becomes smoother and begins to resemble a normal distribution.

plot(0:40, dbinom(0:40, 100, 0.4), type = "h", lwd = 6,
     xlab = "# of Wells with Contaminant",
     ylab = "Probability",
     main = "Binomial Distribution (n=100, p=0.4)")

The graph is left skewed when N was adjusted

1.1.1.4.2 Poisson Distribution

Suppose a count of the wood turtle, a protected species under the New Jersey Endangered Species Statute, is conducted in a specific geographic region. What would be a suitable data distribution model for the results obtained? Since counts are discrete rather than continuous (e.g., there can be 5 or 12 turtles, but not 7.4 turtles), continuous distribution models such as the normal, lognormal, and gamma distributions would not be appropriate. Additionally, because we are interested in the number of turtles rather than their presence or absence, the binomial distribution—though discrete—would also not be suitable, as it is designed for binary data (i.e., yes/no types of data).

The Poisson distribution, named after Siméon Poisson, is a discrete probability distribution used to model the number of events occurring within a fixed space or time interval, given an estimated average rate of occurrence (denoted by λ). If our focus were on the presence or absence of turtles in a given area rather than their actual count, the binomial distribution would be the appropriate choice.

The probability mass function (PMF) for the Poisson distribution is given by:

\[P(X=k) = {{e^{-\lambda}\lambda^k}\over{k!}}\] Where:

\(P(X=k)\) is the probability that the Poisson-distributed random variable \(X\) takes a value \(k\) (where \(k\) can be 0, 1, 2, etc)
\(\lambda\) is the mean or expected number of occurrences.
\(e\) is the base of the natural logarithm (apporximately 2.71828)
\(k!\)(k factorial) is computed as \(k \times (k-1) \times(k-2) \times...\times 1\)

For the Poisson distribution, the mean (expected value) is \(\lambda\), and the variance is also equal to \(\lambda\), meaning the distribution’s spread is directly tied to the mean.

Consider a scenario where the expected number of turtles per hectare is 5. The probability of finding exactly 10 turtles per hectare can be computed in R as:

dpois(10, 5)

## [1] 0.01813279

This yields approximately 0.0181 (or 1.81%). Conversely, the probability of finding 10 or fewer turtles can be obtained using:

ppois(10, 5)

## [1] 0.9863047

which results in 0.9863 (or 98.63%). Thus, the probability of finding more than 10 turtles is: \[1 - 0.9863 = 0.0137\]

Lets visualize this;

The following R script generates Poisson probability mass function plots for different values of \(\lambda\) (2, 5, and 20):

par(mfrow = c(1, 3))
par(lend = 2)

plot(0:10, dpois(0:10, 2), type = 'h', lwd = 6, xlab = 'Count or # of Occurrences (k)',
ylab = 'Probability (P(X = k))')
mtext(side = 3, line = 0.5, expression(lambda == 2))

plot(0:15, dpois(0:15, 5), type = 'h', lwd = 4, xlab = 'Count or # of Occurrences (k)',
ylab = 'Probability (P(X = k))')
mtext(side = 3, line = 0.5, expression(lambda == 5))

plot(0:40, dpois(0:40, 20), type = 'h', lwd = 2, xlab = 'Count or # of Occurrences (k)',
ylab = 'Probability (P(X = k))')
mtext(side = 3, line = 0.5, expression(lambda == 20))

As \(\lambda\) increases, the Poisson distribution becomes more symmetric and begins to resemble the normal distribution. The Poisson distribution can also serve as an approximation for the binomial distribution when the number of trials (\(n\)) is large and the probability of success (\(p\)) is small, such that \(\lambda = np\). Typically, the Poisson approximation to the binomial is valid when \(n \geq 20\) and \(p\leq 0.05\), or when \(n \geq 100\) and \(np \geq 10\).

A common application of the Poisson distribution in environmental science is Poisson regression, where the response variable consists of event counts assumed to follow a Poisson distribution, while predictor variables influence these counts. For instance, turtle population density per hectare might be modeled as a function of habitat type (e.g., forested wetlands, upland forests, agricultural fields) and proximity to roadways or urban areas.

A key issue in Poisson regression is overdispersion—when the observed data variance significantly exceeds the mean. Overdispersion can lead to poor model fit and necessitate alternative models, such as the negative binomial distribution.

In summary, the Poisson distribution provides a useful model for count data, particularly when events occur independently over a fixed interval of time or space.

1.2 Hands-on Exercises

Solve the following questions;

Generate a data set of 100 random values from a normal distribution with mean = 5 and standard deviation = 2. Plot the empirical CDF and overlay the theoretical CDF. Compare the two.
You will be provided with minerals.csv data file. Use it to answer the questions henceforth. Test whether the mineral concentration data follows a normal distribution using the Shapiro-Wilk test. Interpret the results.
Take 1000 samples of size 30 from the mineral concentration data. Plot the distribution of the sample means and comment on its shape.
Fit a lognormal and gamma distribution to the mineral concentration data. Compare the AIC values to determine which distribution fits better.
Perform a t-test to compare the mean concentration of two minerals. Also, perform a chi-square test to check if the observed frequencies of mineral categories match expected frequencies.
Simulate a data set where the presence of a rare mineral follows a Poisson distribution. Calculate the probability of finding more than 3 occurrences in a sample.

Solution

Generate a data set of 100 random values from a normal distribution with mean = 5 and standard deviation = 2. Plot the empirical CDF and overlay the theoretical CDF. Compare the two.

# Generate 100 random values from a normal distribution
data <- rnorm(100, mean = 5, sd = 2)

# Plot empirical CDF
plot(ecdf(data), main = "Empirical vs Theoretical CDF", col = "blue")

# Overlay theoretical CDF
curve(pnorm(x, mean = 5, sd = 2), add = TRUE, col = "red", lwd = 2)
legend("topleft", legend = c("Empirical CDF", "Theoretical CDF"), col = c("blue", "red"), lwd = 2)

Test whether the mineral concentration data follows a normal distribution using the Shapiro-Wilk test. Interpret the results.

# Load the data 
minerals <- read.csv("data/minerals.csv")

# Perform Shapiro-Wilk test
shapiro_test <- shapiro.test(minerals$Concentration)
shapiro_test

## 
##  Shapiro-Wilk normality test
## 
## data:  minerals$Concentration
## W = 0.88019, p-value < 2.2e-16

# Interpretation
if (shapiro_test$p.value < 0.05) {
  print("The data does not follow a normal distribution.")
} else {
  print("The data follows a normal distribution.")
}

## [1] "The data does not follow a normal distribution."

Take 1000 samples of size 30 from the mineral concentration data. Plot the distribution of the sample means and comment on its shape.

# Take 1000 samples of size 30
sample_means <- replicate(1000, mean(sample(minerals$Concentration, 30)))

# Plot the distribution of sample means
hist(sample_means, breaks = 30, main = "Distribution of Sample Means", xlab = "Sample Means")

Fit a lognormal and gamma distribution to the mineral concentration data. Compare the AIC values to determine which distribution fits better.

# Load required libraries 
library(fitdistrplus)

# Fit lognormal distribution
fit_lognormal <- fitdist(minerals$Concentration, "lnorm")

# Fit gamma distribution
fit_gamma <- fitdist(minerals$Concentration, "gamma")

# Compare AIC values
aic_comparison <- data.frame(
  Lognormal = fit_lognormal$aic,
  Gamma = fit_gamma$aic
)
aic_comparison

##   Lognormal    Gamma
## 1  8771.871 8872.361

Perform a t-test to compare the mean concentration of two minerals. Also, perform a chi-square test to check if the observed frequencies of mineral categories match expected frequencies.

# Load library
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Simulate two mineral concentrations
#mineral1 <- rnorm(50, mean = 5, sd = 1)
#mineral2 <- rnorm(50, mean = 6, sd = 1)

# Use calcium and sodium(any combinations can work)
calcium <- minerals %>% 
  filter(Mineral == "Calcium")
sodium <- minerals %>% 
  filter(Mineral == "Sodium")

# Perform t-test
#t_test_result <- t.test(calcium, sodium)
# t_test_result -- Will correct later

# Chi-square test for observed vs expected frequencies
observed <- c(20, 30, 25, 25) # Observed frequencies
expected <- c(25, 25, 25, 25) # Expected frequencies
chisq_test_result <- chisq.test(observed, p = expected / sum(expected))
chisq_test_result

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 2, df = 3, p-value = 0.5724

Simulate a data set where the presence of a rare mineral follows a Poisson distribution. Calculate the probability of finding more than 3 occurrences in a sample.

# Simulate Poisson data
poisson_data <- rpois(100, lambda = 2) # Lambda = average rate

# Calculate probability of more than 3 occurrences
prob_more_than_3 <- 1 - ppois(3, lambda = 2)
prob_more_than_3

## [1] 0.1428765

________________________________________________________________________________

Ecological Analysis

Ecological Analysis

Chapter 1 INTRODUCTION

1.1 Data Distributions

1.1.1 Types of Distribution

1.1.1.1 Normal distribution

1.1.1.2 Lognormal, Gamma, and Other Continuous Distributions

1.1.1.3 Distributions Used in Inferential Statistics

1.1.1.3.1 Student’s t Distribution

1.1.1.3.2 Chi-Square Distribution

1.1.1.3.3 F Distribution

1.1.1.4 Dicrete Distributions

1.1.1.4.1 Binomial Distribution

1.1.1.4.2 Poisson Distribution

1.2 Hands-on Exercises