Chapter 3 TEST FOR AUTOCORRELATION AND TEMPORAL EFFECTS

Environment Data is not always stable or stationary over time or space and is frequently subjected to sustained or cyclical change to the affecting factors. These concepts will help us understand the temporal effects better;

Temporal Trends: refer to changes in a variable or phenomenon over time. Consider monitoring soil nitrogen levels in agricultural field over several years. There might be temporal rise of nitrogen immediately after application of nitrogen-based fertilizer and return back to almost the average state after time. Although there might be a general increase in soil nitrogen levels, but they won’t match the immediate spikes after fertilizer application.

Autocorrelation or serial correlation: is the correlation between data values of the same variable. Positive autocorrelation means that values close in time/space tend to be similar while negative autocorrelation means that they tend to differ.

Monotonic trend: this is a trend that is in one direction, either constantly upward or downwards. A good example are the levels of carbon dioxide(CO2) concentrations in the atmosphere, they have been consistently rising due to industrial activities and deforestation.

Cyclic or Seasonal trend refer to patterns that repeat at regular intervals such as daily ,hourly or annually. These trends are usually affected by natural cycles, human activities or climatic conditions. For instance, these trends are evident in phosphorous levels in water bodies that might increase during rainy seasons due agricultural runoff and decline during dry seasons.

When testing for autocorrelation and temporal effects the data samples are assumed to have values that are independent or uncorrelated, and identically or similarly distributed. The data are said to be autocorrelated or serially correlated when this assumption is violated.

These are the tests that are used to check for autocorrelation:

  • Sample Autocorrelation Function(ACF): used for a single time series where data is normally distributed.

  • Rank von Neumann ratio test used when the data samples evidence of nonnormality.

  • The complete block design ANOVA or Friedman test is used for multiple different data samples for instance testing the concentration of ground water contaminant from multiple monitoring wells at a site.

3.1 Test for Autocorrelation Using the Sample Autocorrelation Function

Before testing for autocorrelation using the ACF method, the data sample should be;

  • normally distributed.
  • stationary(that is, not trending).
  • free of outliers.
  • containing atleast 10 to 12 observations.

… And here is how ACf is computed.

  1. Arrange the data in lagged data pairs (\(x_{i}, x_{i + k}\)), for \(i\) = 1, 2, ….(\(n-k\)), where;
  1. \(n\) is the size of the data samples(number of data values therein).
  2. \(k\) is the lag. i.e number of sampling events dates separating one data value in the pair from the second value. For instance a lag of 1 shows that the two value pairs were collected in consecutive time intervals(e.g days) while lag shows that the data value pairs were collected every each time interval
  1. Calculate the sample Autocorrelation coefficient \[r_k = {{\sum^{n-k}_{i=1}(x_i-x^!)(x_{i+k}- x^!)}\over{\sum^n_{i=1}(x_i - x^!)^2}}\] where;
  • \(n\) is the data sample size.
  • \(k\) is the lag
  • \(x^!\) is the data sample mean
  • \(x_{i}, x_{i + k}\) are the components of the data pairs formed based on the lag.
  1. If \(k\) = 1, the sample autocorrelation coefficient, \(r_1\) is referred to as the first order sample autocorrelation coefficient. If \(k\) = 2, \(r_2\) is the second order coefficient, and so on.

  2. Lets breakdown the possible results of autocorrelation;

  • \(r_0\) = 1: Autocorrelation at lag 0 is always one because a value is perfectly correlated with itself.
  • Random Data: if the data is random, most autocorrelation coefficients will be close to zero, and will decrease further as the lag increases.
  • Autocorrelated Data: If there’s autocorrelation, some coefficients(\(r_k\)) will be significantly larger than zero, but their strength decreases with higher lags.
  • Trend in data: If there is a trend, coefficients wont diminish, showing persistent correlation.

As a summary, the normally distributed data, \(r_k\) should be close to zero if there is no correlation. At a 95% confidence level, autocorrelation is insignificant if no \(r_k\) exceeds the threshold \(2\over{\sqrt{n}}\). There result of each autocorrelation are therefore plotted in a correlogram and the confidence limits are show as the horizontal lines.

Try it!

Climate and weather data are central to ecological research, helping us understand environmental trends and variability. In this exercise, you’ll examine whether daily temperature data from Melbourne exhibit autocorrelation over time. Autocorrelation indicates that today’s temperature might be correlated with yesterday’s, the day before, and so on. Detecting significant autocorrelation is important because it affects how we model and predict future climate patterns.

You are required to retrieve the daily Temperature of Melbourne and save it as a csv file. Formulate the hypothesis;

  • Null hypothesis: The daily minimum temperature series is random(i.e there is no significant autocorrelation)
  • Alternate hypothesis: The daily temperature time series exhibits significant autocorrelation (i.e., past values influence future values).

Perform the analysis with R

# Load necessary libraries
library(readr)
library(ggplot2)

# Load the data set
melbourne_temp <- read_csv("data/melboune-daily-temperature.csv")
## Warning: One or more parsing issues, call `problems()` on
## your data frame for details, e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 3651 Columns: 2
## ── Column specification ─────────────────────────
## Delimiter: ","
## chr (2): Date, Daily minimum temperatures in Melbourne, Australia, 1981-1990
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspect the data structure
str(melbourne_temp)
## spc_tbl_ [3,651 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date                                                         : chr [1:3651] "1981-01-01" "1981-01-02" "1981-01-03" "1981-01-04" ...
##  $ Daily minimum temperatures in Melbourne, Australia, 1981-1990: chr [1:3651] "20.7" "17.9" "18.8" "14.6" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_character(),
##   ..   `Daily minimum temperatures in Melbourne, Australia, 1981-1990` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Convert the 'Date' column to Date format.
melbourne_temp$Date <- as.Date(melbourne_temp$Date, format = "%Y-%m-%d")

# Rename the temperature column
names(melbourne_temp)[names(melbourne_temp) == 'Daily minimum temperatures in Melbourne, Australia, 1981-1990'] <- 'Temperature'

# Convert to numeric 
melbourne_temp <- transform(melbourne_temp, Temperature = as.numeric(Temperature))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
# Remove rows with missing values 
melbourne_temp <- na.omit(melbourne_temp)

# Sort data by date if not already sorted
melbourne_temp <- melbourne_temp[order(melbourne_temp$Date), ]

# Create a time series object (assuming daily data; adjust the 'frequency' parameter if needed)
# Here, we assume the data is daily, so frequency = 365
temp_ts <- ts(melbourne_temp$Temp, frequency = 365)

# Plot the time series to visualize trends
plot(temp_ts, main = "Daily Temperature Time Series (Melbourne)",
     xlab = "Time", ylab = "Temperature (°C)")

# Compute and plot the Sample Autocorrelation Function (ACF)
acf(temp_ts, main = "ACF of Melbourne Daily Temperature")

The results show the time series of daily temperature in Melbourne and its corresponding autocorrelation function (ACF). There is a cyclic trends in temperature over time and the ACF plot suggests strong autocorrelation at shorter lags, indicating a temporal dependency in the data.

3.2 An Example on Site-Wide Temporal Effects

Analysis of Site-Wide Temporal Effects on Groundwater Manganese Concentrations

Generate the data

# Load necessary libraries
library(dplyr)

# Set seed for reproducibility
set.seed(42)

# Generate synthetic manganese concentration data
Mn_values <- c(3.5, 2.8, 4.1, 3.9, 5.2, 4.8, 6.0, 5.7,  # Well 1
               3.2, 2.5, 3.8, 3.7, 5.0, 4.6, 5.8, 5.5,  # Well 2
               3.6, 3.0, 4.2, 4.0, 5.3, 4.9, 6.1, 5.8,  # Well 3
               3.3, 2.6, 3.9, 3.8, 5.1, 4.7, 5.9, 5.6)  # Well 4

# Create a data frame
Mn2Data8 <- data.frame(
  Mn = Mn_values,
  Location = rep(c("Well1", "Well2", "Well3", "Well4"), each = 8),
  Time = rep(paste0("Q", 1:8), times = 4)
)

# Convert factors
Mn2Data8$Location <- as.factor(Mn2Data8$Location)
Mn2Data8$Time <- as.factor(Mn2Data8$Time)

# Display the first few rows
head(Mn2Data8)
##    Mn Location Time
## 1 3.5    Well1   Q1
## 2 2.8    Well1   Q2
## 3 4.1    Well1   Q3
## 4 3.9    Well1   Q4
## 5 5.2    Well1   Q5
## 6 4.8    Well1   Q6

Groundwater manganese concentrations have been monitored quarterly across four wells for a period of two years (eight quarters). The data set above, 2009), includes measurements from each well per quarter. The goal is to determine whether there is a site-wide temporal trend in manganese concentrations over time.

To establish whether a time trend exists, we must first check for spatial variability among the four wells. If the manganese concentrations do not significantly differ across wells, we can then assess whether concentrations change over time.

Here is the process that will be followed;

  1. Check for Spatial Variability (Homogeneity) by;
  • Testing whether the means or medians of the four well datasets are statistically equal.
  • Checking If the datasets show normality and homogeneity of variances, a two-way ANOVA without replication (Complete Block Design) is appropriate. Otherwise, the non-parametric Friedman test is used.
  1. Check for Temporal Trends by;
  • Checking for a significant trend over time if no spatial variability is found.
  • A two-way ANOVA is performed again, switching the roles of Time and Location where initially Time was the factor of interest and Location as a blocking factor.

The ANOVA model follows the equation: \[{Y_{ij}}={\mu + \alpha_i + \beta_j + \epsilon_{ij}}\] Where:

  • \(Y_{ij}\) is the Manganese concentration
  • \(\mu\) is the overall mean
  • \(\alpha_i\) represents the effect of the well (location)
  • \(\beta_j\) represents the effect of time (quarter)
  • \(\epsilon_{ij}\) is the error term

The hypothesis tests:

  • For spatial variability: \(H_0:\mu_1=\mu_2=\mu_3=\mu_4\) (no significant difference between wells)
  • For temporal trend \(H_0:\mu_{Q1}=\mu_{Q2}=...=\mu_{Q8}\)(no significant difference across quarters)

Check for variability

# Step 1: Check for Spatial Variability
Mn2.anova1 <- aov(Mn ~ Location + Time, data = Mn2Data8)
summary(Mn2.anova1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Location     3   0.57   0.189   125.5 1.44e-13 ***
## Time         7  35.63   5.091  3387.1  < 2e-16 ***
## Residuals   21   0.03   0.002                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results;

  • Location: p_value(denoted as Pr(>F) on the table above) is less than 0.05 - therefore there is a significant spatial variability
  • Time: p_value is less than 0.05 - there is a siginfant temporal trend.

If there was no spatial variability we could have proceed to test the time trend. However, lets just proceed to show you how it is done.

Check the code below

# Step 2: Check for Temporal Effects
Mn2.anova2 <- aov(Mn ~ Time + Location, data = Mn2Data8)
summary(Mn2.anova2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Time         7  35.63   5.091  3387.1  < 2e-16 ***
## Location     3   0.57   0.189   125.5 1.44e-13 ***
## Residuals   21   0.03   0.002                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is far much less than 0.05. The results confirm that manganese concentrations vary significantly over time.

3.3 TESTS FOR TREND

When working with data that has a time component (i.e., data collected at different points in time), a time series plot is a useful first step in exploring whether the values are randomly distributed or if a trend or pattern exists. If the plot suggests a trend, the next step is to determine whether this trend is statistically significant or just a random occurrence.

To assess this, we can use parametric or nonparametric methods, depending on whether the data meets certain assumptions such as normality, absence of outliers, and constant variance.

  • Parametric methods assume that the data follows a specific distribution (usually normal). These methods rely on parameters like mean and variance. A common parametric approach for trend detection is simple linear regression.

  • Nonparametric methods do not require the data to follow any particular distribution, making them more flexible. Two widely used nonparametric trend tests are:

    • The Mann-Kendall test, which determines whether a trend exists but does not measure its magnitude.
    • The Theil-Sen method, which not only detects trends but also estimates the slope (rate of change) of the trend line.

In some cases, external factors like temperature or streamflow rate can influence the trend results. If these factors are not accounted for, they may lead to incorrect conclusions. Therefore, it is important to adjust for such variables before performing trend analysis.

3.4 Hands-On Exercises

You will be required to download the Global Land Temperature by City data that is hosted in Kaggle from here. The data provides historical monthly average average temperatures covering multiple countries and cities in the world.

Use the data set to find the solutions to the problems below. Note: We will focus on the New York City.

  1. Load the data set, handle any missing values appropriately and sort the data chronologically.
  2. Determine if there is any autocorrelation in the historical monthly temperature. This can be achieved by;
  • Compute and plot the ACF of the temperature time series.
  • Identify significant lags where autocorrelation exists in the plot.
  1. Assess if there is linear trend in the temperature data over time by;
  • Fitting a simple linear linear regression model with time as the predictor and temperature as the response variable.
  • Examining the resultant slope from the regression line and the slope coefficient to determine the direction and significance of the trend.
  1. You will perform different non-parametric test for trends as instructed below;
  • Apply the Mann-Kendall test to the time series to test for monotonic trend in the temperature. Determine of there is significant upward or downward trend.
  • Account for seasonality in the data while testing for trends using the Seasonal Mann-Kendall test to the monthly temperature data.
  • Use the Theil-Sen estimator to determine the median slope of the temperature trend to estimate the slope of a trend in the presence of outliers. Evaluate the magnitude and direction of the trend.

Solution

  1. Load the data set, handle any missing values appropriately and sort the data chronologically.
library(dplyr)
# Load data 
data <- read.csv("data/global_land_temperature_by_city/data.csv")

# Handle missing values 
data <- na.omit(data)

# Sort the data chronologically for the New York City
sorted_data <- data%>%
  filter(City == "New York") %>%
  mutate(date = as.Date(dt)) %>%
  select(date, AverageTemperature) %>% 
  arrange(date)
  1. Determine if there is any autocorrelation in the historical monthly temperature. This can be achieved by;
  • Compute and plot the ACF of the temperature time series.
  • Identify significant lags where autocorrelation exists in the plot.
# Test for Autocorrelation using ACF
acf(sorted_data$AverageTemperature, 
    main = "ACF of New York Monthly Average Temperature")

  1. Assess if there is linear trend in the temperature data over time by;
  • Fitting a simple linear linear regression model with time as the predictor and temperature as the response variable.
  • Examining the resultant slope from the regression line and the slope coefficient to determine the direction and significance of the trend.
# Parametric Test for Trends 
## Simple Linear Regression
sorted_data <- sorted_data %>%
  mutate(time = as.numeric(date - min(date))) # get difference in days

lm_model <- lm(AverageTemperature ~ time, data = sorted_data)

summary(lm_model)
## 
## Call:
## lm(formula = AverageTemperature ~ time, data = sorted_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.9985  -8.5816   0.0129   8.7943  20.4073 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.619e+00  3.475e-01  24.805  < 2e-16 ***
## time        1.773e-05  5.994e-06   2.959  0.00311 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.239 on 3117 degrees of freedom
## Multiple R-squared:  0.0028, Adjusted R-squared:  0.00248 
## F-statistic: 8.753 on 1 and 3117 DF,  p-value: 0.003115

Remember add a regression line to a scatter plot

  1. You will perform different non-parametric test for trends as instructed below;
  • Apply the Mann-Kendall test to the time series to test for monotonic trend in the temperature. Determine of there is significant upward or downward trend.
# Load libraries
library(Kendall)

# Perform the test
mk_test <- MannKendall(sorted_data$date)
print(mk_test)
## tau = 1, 2-sided pvalue =< 2.22e-16
  • Account for seasonality in the data while testing for trends using the Seasonal Mann-Kendall test to the monthly temperature data.
# Seasonal Mann Kendall test
smk_test <- SeasonalMannKendall(ts(sorted_data$AverageTemperature, frequency = 12))
print(smk_test)
## tau = 0.154, 2-sided pvalue =< 2.22e-16
  • Use the Theil-Sen estimator to determine the median slope of the temperature trend to estimate the slope of a trend in the presence of outliers. Evaluate the magnitude and direction of the trend.
library(trend)
# Theil-Sen Trend test 
ts_test <- sens.slope(sorted_data$AverageTemperature)
print(ts_test)
## 
##  Sen's slope
## 
## data:  sorted_data$AverageTemperature
## z = 3.179, n = 3119, p-value = 0.001478
## alternative hypothesis: true z is not equal to 0
## 95 percent confidence interval:
##  0.0001963213 0.0008278986
## sample estimates:
##  Sen's slope 
## 0.0005120968

________________________________________________________________________________