Headlines
Loading...
Boruta feature selection

Boruta feature selection


 

Boruta Feature Selection: 

Boruta is an all-relevant feature selection method based on Random Forest. It aims to identify every feature that has a meaningful impact on the target variable. The algorithm creates "shadow features" by shuffling original features' values. A Random Forest model is trained on both original and shadow features. Feature importance scores are calculated for all features. Real features are compared with the most important shadow feature. If a feature scores higher, it's marked important; if lower, unimportant. Uncertain features are re-evaluated over multiple iterations. Boruta is robust and captures complex feature interactions. However, it is computationally intensive, especially with large datasets.

Feature selection by color:

🟢 WIS, HE, AFB, DC: Marked green — these are important features with higher importance than the shadowMax.
🔴 ANCC: Marked red — consistently less important than shadow features, so it’s unimportant.
🟡 shadowMax: Sets the threshold — real features must exceed this to be considered important.
🔵 Blue features: Could be tentative or rejected, depending on their exact comparison to shadowMax.
✅ Focus on green features for modeling; consider removing red and re-evaluating blue ones.

Boruta Feature Selection in R

Here is the R code I used to perform feature selection with Boruta:


traindata <- read.csv("data.csv", header = TRUE, stringsAsFactors = FALSE)
str(traindata)
library(Boruta)

set.seed(123)
boruta_result <- Boruta(Birth_interval ~ ., data = traindata, doTrace = 2)
selected_boruta <- getSelectedAttributes(boruta_result, withTentative = TRUE)
print(selected_boruta)
print(boruta_result)

plot(boruta_result, xlab = "")
  

0 Comments: