Statistical Programming, Simulation and Modeling

Dr. Basim Alsaedi

2025-04-12

What is Statistical Programming?

Statistical programming combines:

Programming skills
Statistical knowledge

To perform tasks like:

Data cleaning and preparation
Exploration and visualization
Statistical modeling
Simulation and hypothesis testing
Reproducible analyses

Why Learn Statistical Programming?

Data literacy: Understand and communicate with data
Problem-solving: Apply statistics to real-world problems
Career opportunities: Skills for data science and analytics
Research capabilities: Design and conduct statistical analyses

Probability Distributions and Simulation

Why Simulate Data?

Simulation helps us:

Understand theoretical distribution properties
Test hypotheses when analytical solutions are difficult
Generate synthetic data for testing
Model real-world processes

Normal Distribution

The normal distribution has the familiar bell shape:

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{- \frac{(x - \mu)^{2}}{2\sigma^{2}}}\]

Simulating Normal Data in R

# Generate random values from a normal distribution
normal_data <- rnorm(1000, mean = 0, sd = 1)

# Plot a histogram
hist(normal_data, breaks = 30, col = "skyblue", border = "white",
     main = "Normal Distribution",
     xlab = "Value", prob = TRUE)

# Add theoretical density curve
curve(dnorm(x, mean = 0, sd = 1), add = TRUE, col = "red", lwd = 2)

Normal Distribution: Real-world Example

Human Heights:

Binomial Distribution

Models the number of successes in fixed trials:

\[P(X = k) = \binom{n}{k}p^{k}(1 - p)^{n - k}\]

Simulating Binomial Data in R

# Generate binomial random values
binomial_data <- rbinom(1000, size = 10, prob = 0.3)

# Plot the distribution
barplot(table(binomial_data)/1000, col = "salmon", 
        main = "Binomial Distribution (n = 10, p = 0.3)",
        xlab = "Number of Successes", ylab = "Probability")

# Add theoretical probabilities
points(0:10, dbinom(0:10, size = 10, prob = 0.3), 
       pch = 16, col = "blue")
lines(0:10, dbinom(0:10, size = 10, prob = 0.3), 
      col = "blue", lwd = 2)

Binomial Distribution: Real-world Example

Coin Flips:

Poisson Distribution

Models events in fixed time/space intervals:

\[P(X = k) = \frac{\lambda^{k}e^{- \lambda}}{k!}\]

Simulating Poisson Data in R

# Generate Poisson random values
poisson_data <- rpois(1000, lambda = 3)

# Plot the distribution
barplot(table(poisson_data)/1000, col = "lightblue", 
        main = "Poisson Distribution (λ = 3)",
        xlab = "Number of Events", ylab = "Probability")

# Add theoretical probabilities
points(0:max(poisson_data), dpois(0:max(poisson_data), lambda = 3), 
       pch = 16, col = "red")
lines(0:max(poisson_data), dpois(0:max(poisson_data), lambda = 3), 
      col = "red", lwd = 2)

Poisson Distribution: Real-world Example

Call Center Arrivals:

Uniform Distribution

Equal probability across all values in a range:

\[f(x) = \frac{1}{b - a}, \quad a \leq x \leq b\]

Simulating Uniform Data in R

# Generate uniform random values
uniform_data <- runif(1000, min = 0, max = 1)

# Plot a histogram
hist(uniform_data, breaks = 20, col = "lightgoldenrod", border = "white",
     main = "Uniform Distribution (min = 0, max = 1)",
     xlab = "Value", prob = TRUE)

# Add the theoretical density line
abline(h = 1, col = "red", lwd = 2)

Exponential Distribution

Models time between events:

\[f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\]

Simulating Exponential Data in R

# Generate exponential random values
exponential_data <- rexp(1000, rate = 0.5)

# Plot a histogram
hist(exponential_data, breaks = 30, col = "thistle", border = "white",
     main = "Exponential Distribution (rate = 0.5)",
     xlab = "Value", prob = TRUE)

# Add the theoretical density curve
curve(dexp(x, rate = 0.5), from = 0, to = 10, add = TRUE, 
      col = "purple", lwd = 2)

Exponential Distribution: Real-world Example

Customer Service Times:

Monte Carlo Simulation

What is Monte Carlo Simulation?

Monte Carlo methods use repeated random sampling to:

Define the domain of possible inputs
Generate inputs randomly from the domain
Perform computations using these inputs
Aggregate the results

Applications of Monte Carlo Simulation

Statistics and probability calculations
Finance for risk assessment
Physics and chemistry simulations
Engineering for reliability analysis
Machine learning for uncertainty estimation

Estimating Probabilities: Example 1

Probability of rolling a sum of 7 with two dice:

Monte Carlo for Probability: Code

# Function to simulate rolling two dice
roll_two_dice <- function() {
  sum(sample(1:6, 2, replace = TRUE))
}

# Run many simulations
rolls <- replicate(10000, roll_two_dice())

# Estimate the probability
prob_sum_7 <- mean(rolls == 7)
cat("Estimated probability:", prob_sum_7, "\n")
cat("Theoretical probability:", 6/36, "\n")

Estimating π Using Monte Carlo

Using random points in a square to estimate π:

Monte Carlo Integration

Numerical integration using random numbers:

Estimating Expected Values

Monte Carlo for expected values of complex distributions:

Portfolio Returns Simulation

Portfolio Returns: Key Statistics

For a portfolio with 60% stocks, 35% bonds, and 5% cash:

## Expected portfolio return: 6.36 %

## Portfolio risk (standard deviation): 11.2 %

## Probability of negative return: 28.61 %

Central Limit Theorem

As sample size increases, the distribution of sample means approaches a normal distribution:

Effects of Outliers on Regression

Regression Coefficients Comparison

## Clean data model: y = 1.93 + 2.99 x

## Outlier data model: y = 1.43 + 3.14 x

A single outlier dramatically changed the regression line!

Bootstrap Resampling

Estimating sampling distributions by resampling the original data:

Bootstrap: Key Results

For our original sample:

## Original sample mean: 81.33333

## 95% bootstrap confidence interval: 76.25 to 86.08333

Practice Exercise: Basic Probability

Problem: What’s the probability of at least one 6 in three dice rolls?

Practice Exercise: Confidence Intervals

How sample size affects confidence interval width:

Key Takeaways

Statistical Programming combines programming and statistics for data analysis
Probability Distributions model different types of random phenomena
Simulation Techniques allow us to:
- Generate synthetic data for testing
- Estimate complex probabilities
- Calculate integrals and expected values
- Verify statistical theories
- Assess statistical method robustness