12 GBM Variable Selection
12.1 1. Introduction to Variable Selection in GBM
12.1.1 Why Variable Selection Matters in GBM for SDM
Gradient Boosting Machines (GBM) are powerful for Species Distribution Modeling (SDM), but they can become computationally expensive and prone to overfitting when too many predictors are included. Selecting the most important environmental variables improves:
✅ Model Accuracy – Reduces noise from irrelevant predictors.
✅ Model Simplicity – Fewer variables make the model easier to interpret.
✅ Faster Computation – Training and predictions become more efficient.
✅ Better Generalization – The model performs well on unseen data.
12.1.2 Challenges of Using Too Many Predictors
While GBM can handle many variables, not all predictors contribute equally to the model. Using too many predictors leads to:
⚠️ Overfitting – The model memorizes noise instead of learning general patterns.
⚠️ Longer Training Time – More variables mean more computations, slowing down model fitting.
⚠️ Difficult Interpretation – It becomes harder to explain why the model makes certain predictions.
⚠️ Collinearity Issues – Highly correlated variables can distort the model’s ability to learn independent relationships.
Example of a Poorly Selected Model
Imagine modeling the distribution of a bird species with 20 climate variables. If 10 of them are highly correlated, the model might give redundant or misleading predictions, overcomplicating interpretation.
12.1.3 Goal: Selecting the Most Relevant Environmental Variables
The aim of variable selection in GBM is to:
🔹 Identify which predictors strongly influence species distribution.
🔹 Remove weak or redundant variables that add noise.
🔹 Ensure the selected variables align with ecological understanding.
12.1.4 What’s Next?
In the next section, we will explore different methods for selecting the best variables for GBM models. These include:
📌 Feature Importance Scores – Identify which variables matter most.
📌 Recursive Feature Elimination (RFE) – Iteratively remove the weakest variables.
📌 Correlation Analysis – Avoid redundancy by removing correlated predictors.
📌 Cross-Validation-Based Selection – Keep only variables that improve test set performance.
🚀 Let’s dive into the different selection methods!
Selecting the right predictors is essential for efficient and accurate species distribution modeling (SDM) using Gradient Boosting Machines (GBM). Below are five key methods used to identify and retain the most relevant environmental variables.
12.1.5 1. Feature Importance Scores
GBM automatically assigns an importance score to each variable based on how often it is used to split the data across decision trees.
🔹 High-importance variables contribute significantly to model accuracy.
🔹 Low-importance variables can be removed to simplify the model without reducing performance.
How to use it?
- Train an initial GBM model using all predictors.
- Extract feature importance rankings and remove the weakest variables.
When to Use This?
Use feature importance as the first step before applying other variable selection methods.
12.1.6 2. Recursive Feature Elimination (RFE)
Recursive Feature Elimination (RFE) is an iterative approach where the least important variables are removed one by one, and the model is retrained each time.
🔹 Helps identify the optimal number of variables.
🔹 Ensures weak predictors do not dilute model accuracy.
How to use it?
- Start with all predictors and rank their importance.
- Remove the least important variable and retrain the GBM model.
- Repeat until performance no longer improves.
Downside:
RFE is computationally expensive because the model is retrained multiple times.
12.1.7 3. Correlation Analysis
Many environmental variables (e.g., temperature, precipitation) are highly correlated, which can distort GBM’s ability to learn independent patterns.
🔹 Goal: Identify and remove redundant variables.
🔹 Solution: Use a correlation matrix and Variance Inflation Factor (VIF) to detect collinearity.
How to use it?
- Compute correlation coefficients between predictors.
- If two variables are highly correlated (r > 0.7), remove one.
Example:
If Bio1 (Annual Mean Temperature) and Bio5 (Maximum Temperature of the Warmest Month) are highly correlated, keep only one.
12.1.8 4. AIC/BIC Model Comparison
Model selection using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) helps determine the best predictor set based on model complexity vs. accuracy.
🔹 AIC favors models with fewer predictors while maintaining accuracy.
🔹 BIC penalizes models with too many variables, ensuring simplicity.
How to use it?
- Fit GBM models with different sets of variables.
- Compute AIC/BIC scores for each model.
- Select the model with the lowest AIC/BIC score.
Best Practice:
Use AIC/BIC alongside feature importance and correlation analysis for optimal variable selection.
12.1.9 5. Cross-Validation-Based Selection
Cross-validation ensures that variable selection improves real-world predictive performance, not just training accuracy.
🔹 Goal: Keep only predictors that improve test set performance.
🔹 Method: Use AUC (Area Under Curve) and Accuracy on a validation dataset.
How to use it?
- Train a GBM model with all variables.
- Remove a predictor and check if AUC/accuracy decreases.
- Keep only the variables that consistently improve test set predictions.
Warning:
Cross-validation can be computationally expensive but ensures the final model is robust.
12.1.10 Summary of Variable Selection Methods
Method | When to Use | Pros | Cons |
---|---|---|---|
Feature Importance Scores | First step for identifying key variables. | Quick and easy. | May not remove all redundant variables. |
Recursive Feature Elimination (RFE) | When you want the best subset of features. | Finds optimal variable set. | Computationally expensive. |
Correlation Analysis | To remove redundant variables. | Improves model stability. | Doesn’t detect weakly correlated but irrelevant features. |
AIC/BIC Model Comparison | When balancing accuracy and simplicity. | Ensures model is not overly complex. | May remove useful predictors if over-penalized. |
Cross-Validation Selection | To optimize test set performance. | Ensures best real-world predictions. | Computationally expensive. |
12.2 Now that we understand methods for variable selection, we will move to the coding demonstration, applying these techniques in R to improve a GBM-based species distribution model. 🚀
12.3 3. Coding Demonstration: Variable Selection in GBM
This coding exercise will guide you through the process of selecting the most important variables for a GBM-based Species Distribution Model (SDM) in R. We will use built-in datasets and packages to ensure the workflow is reproducible.
12.3.1 Step 1: Load an SDM Dataset with Multiple Environmental Variables
We will use the bioclim
dataset from the dismo
package, which contains species presence-absence data along with environmental predictors.
12.3.2 Step 2: Train an Initial GBM Model with All Predictors
Before selecting variables, let’s train a baseline GBM model with all environmental predictors.
12.3.2.1 Split Data into Training and Testing Sets
# Set seed for reproducibility
set.seed(123)
# Create training (70%) and testing (30%) sets
trainIndex <- createDataPartition(data$presence, p = 0.7, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]
12.3.2.2 Train the Full GBM Model
# Train an initial GBM model
gbm_full <- gbm(presence ~ .,
data = train_data,
distribution = "bernoulli", # Classification task
n.trees = 500,
shrinkage = 0.01,
interaction.depth = 3,
cv.folds = 5) # Cross-validation to prevent overfitting
# View feature importance
summary(gbm_full)
✅ What to Look For?
- Higher scores indicate variables that contribute the most to predictions.
- Low-importance variables should be considered for removal.
12.3.3 Step 3: Compute Feature Importance Scores and Remove Low-Importance Variables
GBM provides relative influence scores for each predictor. We will remove variables that contribute little to model accuracy.
12.3.3.2 Remove Low-Importance Variables
# Define a cutoff threshold for importance (e.g., remove variables < 2%)
important_vars <- summary(gbm_full)$var[summary(gbm_full)$rel.inf > 2]
# Retain only important variables
train_data_reduced <- train_data[, c("presence", important_vars)]
test_data_reduced <- test_data[, c("presence", important_vars)]
✅ What to Look For?
- The feature importance plot highlights which variables significantly impact the model.
- Removing low-importance variables improves efficiency without reducing accuracy.
12.3.4 Step 4: Perform Correlation Analysis to Eliminate Redundant Variables
Environmental predictors are often highly correlated, which can introduce redundancy.
12.3.5 Step 5: Retrain GBM with Reduced Features and Compare Accuracy/AUC
12.3.5.1 Train the Reduced GBM Model
# Train a GBM model with selected variables
gbm_reduced <- gbm(presence ~ .,
data = train_data_final,
distribution = "bernoulli",
n.trees = 500,
shrinkage = 0.01,
interaction.depth = 3,
cv.folds = 5)
# View feature importance of reduced model
summary(gbm_reduced)
12.3.5.2 Evaluate Model Performance (AUC and Accuracy)
library(pROC)
# Predict on test set
full_pred <- predict(gbm_full, test_data, n.trees = 500, type = "response")
reduced_pred <- predict(gbm_reduced, test_data_final, n.trees = 500, type = "response")
# Compute AUC for full model
full_auc <- auc(roc(test_data$presence, full_pred))
# Compute AUC for reduced model
reduced_auc <- auc(roc(test_data_final$presence, reduced_pred))
# Print AUC Scores
print(paste("Full GBM AUC:", full_auc))
print(paste("Reduced GBM AUC:", reduced_auc))
12.3.5.3 Compare Accuracy
# Convert predictions to binary classes
full_pred_class <- ifelse(full_pred > 0.5, "1", "0")
reduced_pred_class <- ifelse(reduced_pred > 0.5, "1", "0")
# Compute accuracy
full_acc <- sum(full_pred_class == test_data$presence) / nrow(test_data)
reduced_acc <- sum(reduced_pred_class == test_data_final$presence) / nrow(test_data_final)
print(paste("Full Model Accuracy:", full_acc))
print(paste("Reduced Model Accuracy:", reduced_acc))
✅ Expected Outcome:
- The reduced model should have similar AUC and accuracy to the full model but with fewer predictors.
- Computation time is reduced, making the model more efficient.
12.3.6 Key Observations
-
Feature importance analysis helps remove weak variables.
-
Correlation filtering prevents redundancy.
- The reduced model performs as well as the full model but is more efficient.
12.3.7 Summary of GBM Variable Selection Process
Step | Purpose | Outcome |
---|---|---|
Train Full Model | Baseline model with all variables. | Initial AUC and accuracy. |
Feature Importance Filtering | Remove low-impact variables. | Simplifies model without losing accuracy. |
Correlation Analysis | Remove highly correlated predictors. | Improves model interpretability. |
Train Reduced Model | Use only important, independent variables. | Faster and more efficient predictions. |
Compare Accuracy & AUC | Ensure reduced model performs as well as full. | Similar accuracy but improved efficiency. |
## 4. Model Evaluation After Variable Selection |
After selecting the most relevant variables for GBM-based Species Distribution Modeling (SDM), we must evaluate the impact of variable selection on model performance, ecological relevance, and computational efficiency. |
12.3.8 1. Compare Model Performance Before and After Variable Selection
We compare the full model (with all variables) and the reduced model (with selected variables) using AUC (Area Under Curve) and accuracy.
12.3.8.1 Code Example: AUC and Accuracy Comparison
library(pROC)
# Compute AUC for Full Model
full_pred <- predict(gbm_full, test_data, n.trees = 500, type = "response")
full_auc <- auc(roc(test_data$presence, full_pred))
# Compute AUC for Reduced Model
reduced_pred <- predict(gbm_reduced, test_data_final, n.trees = 500, type = "response")
reduced_auc <- auc(roc(test_data_final$presence, reduced_pred))
# Print AUC Scores
print(paste("Full Model AUC:", full_auc))
print(paste("Reduced Model AUC:", reduced_auc))
✅ Expected Result:
- If the AUC remains similar, the reduced model is just as effective while being more efficient.
- If the AUC decreases significantly, important predictors may have been removed.
12.3.9 2. Visualizing Response Curves
Response curves show how environmental variables influence species suitability. Ensuring that response curves remain biologically meaningful after variable selection is crucial.
12.3.9.1 Code Example: Response Curve Visualization
# Plot response curves for the full model
par(mfrow = c(1,2))
plot.gbm(gbm_full, i.var = "bio1", main = "Full Model: Bio1")
plot.gbm(gbm_reduced, i.var = "bio1", main = "Reduced Model: Bio1")
✅ What to Look For?
- Similar response curves between the full and reduced models indicate that key environmental drivers are preserved.
- If response curves change drastically, an important predictor may have been removed.
12.3.10 3. Assessing Computational Efficiency Gains
A key advantage of variable selection is reducing computational time. We compare training times before and after variable selection.
12.3.10.1 Code Example: Compute Training Time
# Measure time for full model
start_time <- Sys.time()
gbm_full <- gbm(presence ~ ., data = train_data, distribution = "bernoulli", n.trees = 500)
end_time <- Sys.time()
full_time <- end_time - start_time
# Measure time for reduced model
start_time <- Sys.time()
gbm_reduced <- gbm(presence ~ ., data = train_data_final, distribution = "bernoulli", n.trees = 500)
end_time <- Sys.time()
reduced_time <- end_time - start_time
# Print time comparison
print(paste("Full Model Training Time:", full_time))
print(paste("Reduced Model Training Time:", reduced_time))
✅ Expected Result:
- The reduced model should train faster, improving efficiency without losing predictive power.
- In large datasets, this speed improvement can be significant.
12.4 5. Best Practices & Common Pitfalls
12.4.1 1. Avoid Over-Removing Important Variables
-
Pitfall: Removing slightly less important variables may still affect model performance.
- Solution: Gradually remove variables and compare AUC/accuracy at each step.
12.4.2 2. Ensure Biological/Ecological Relevance
-
Pitfall: Some variables may be statistically weak but ecologically essential.
- Solution: Consult ecological knowledge before eliminating predictors.
Example:
Even if elevation (Bio6) has low importance, it might still be critical for mountain species.
12.4.3 3. Balance Model Simplicity with Accuracy
-
Pitfall: Keeping too many predictors makes the model complex and slow.
- Solution: Use AIC, BIC, or cross-validation to find the best trade-off between accuracy and simplicity.
✅ Key Takeaway: The goal is to maintain high predictive power while removing unnecessary complexity.
12.5 6. Summary & Key Takeaways
12.5.1 1. Why Variable Selection Matters
- Reducing redundant variables improves model accuracy and interpretability.
- Removing unnecessary predictors speeds up training and prediction times.
12.5.2 2. Recommended Workflow for GBM Variable Selection
Step | Purpose | Method |
---|---|---|
Train Full Model | Establish a baseline performance. | Train GBM with all variables. |
Compute Feature Importance | Identify key predictors. | Use GBM’s built-in importance ranking. |
Remove Low-Importance Variables | Simplify the model. | Drop predictors with very low scores. |
Perform Correlation Analysis | Eliminate redundant variables. | Remove highly correlated predictors (r > 0.7). |
Retrain GBM | Improve computational efficiency. | Use only selected variables. |
Evaluate Performance | Ensure no loss in predictive power. | Compare AUC, accuracy, and response curves. |
12.5.3 3. Iterative Refinement for SDM Models
-
Reassess after each removal step to avoid losing important predictors.
-
Validate response curves to ensure ecological interpretability.
- Adjust hyperparameters to optimize performance after feature reduction.
✅ Final Takeaway:
By selecting the right variables, GBM models remain accurate, efficient, and ecologically meaningful, making them powerful tools for Species Distribution Modeling (SDM). 🚀