Filter Based Feature Selection for mlr3 • mlr3filters

Package website: release | dev

{mlr3filters} adds feature selection filters to mlr3. The implemented filters can be used stand-alone, or as part of a machine learning pipeline in combination with mlr3pipelines and the filter operator.

Wrapper methods for feature selection are implemented in mlr3fselect. Learners which support the extraction feature importance scores can be combined with a filter from this package for embedded feature selection.

Installation

CRAN version

install.packages("mlr3filters")

Development version

remotes::install_github("mlr-org/mlr3filters")

Filters

Filter Example

set.seed(1)
library("mlr3")
library("mlr3filters")

task = tsk("sonar")
filter = flt("auc")
head(as.data.table(filter$calculate(task)))

##    feature     score
## 1:     V11 0.2811368
## 2:     V12 0.2429182
## 3:     V10 0.2327018
## 4:     V49 0.2312622
## 5:      V9 0.2308442
## 6:     V48 0.2062784

Implemented Filters

Name	label	Task Types	Feature Types	Package
anova	ANOVA F-Test	Classif	Integer, Numeric	stats
auc	Area Under the ROC Curve Score	Classif	Integer, Numeric	mlr3measures
carscore	Correlation-Adjusted coRrelation Score	Regr	Logical, Integer, Numeric	care
carsurvscore	Correlation-Adjusted coRrelation Survival Score	Surv	Integer, Numeric	carSurv, mlr3proba
cmim	Minimal Conditional Mutual Information Maximization	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
correlation	Correlation	Regr	Integer, Numeric	stats
disr	Double Input Symmetrical Relevance	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
find_correlation	Correlation-based Score	Universal	Integer, Numeric	stats
importance	Importance Score	Universal	Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
information_gain	Information Gain	Classif & Regr	Integer, Numeric, Factor, Ordered	FSelectorRcpp
jmi	Joint Mutual Information	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
jmim	Minimal Joint Mutual Information Maximization	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
kruskal_test	Kruskal-Wallis Test	Classif	Integer, Numeric	stats
mim	Mutual Information Maximization	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
mrmr	Minimum Redundancy Maximal Relevancy	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
njmim	Minimal Normalised Joint Mutual Information Maximization	Classif & Regr	Integer, Numeric, Factor, Ordered	praznik
performance	Predictive Performance	Universal	Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
permutation	Permutation Score	Universal	Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
relief	RELIEF	Classif & Regr	Integer, Numeric, Factor, Ordered	FSelectorRcpp
selected_features	Embedded Feature Selection	Universal	Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
univariate_cox	Univariate Cox Survival Score	Surv	Integer, Numeric, Logical	survival
variance	Variance	Universal	Integer, Numeric	stats

Variable Importance Filters

The following learners allow the extraction of variable importance and therefore are supported by FilterImportance:

## [1] "classif.featureless" "classif.ranger"      "classif.rpart"      
## [4] "classif.xgboost"     "regr.featureless"    "regr.ranger"        
## [7] "regr.rpart"          "regr.xgboost"

If your learner is not listed here but capable of extracting variable importance from the fitted model, the reason is most likely that it is not yet integrated in the package mlr3learners or the extra learner extension. Please open an issue so we can add your package.

Some learners need to have their variable importance measure “activated” during learner creation. For example, to use the “impurity” measure of Random Forest via the {ranger} package:

task = tsk("iris")
lrn = lrn("classif.ranger", seed = 42)
lrn$param_set$values = list(importance = "impurity")

filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)

##         feature     score
## 1: Petal.Length 44.682462
## 2:  Petal.Width 43.113031
## 3: Sepal.Length  9.039099

Performance Filter

FilterPerformance is a univariate filter method which calls resample() with every predictor variable in the dataset and ranks the final outcome using the supplied measure. Any learner can be passed to this filter with classif.rpart being the default. Of course, also regression learners can be passed if the task is of type “regr”.

Filter-based Feature Selection

In many cases filtering is only one step in the modeling pipeline. To select features based on filter values, one can use PipeOpFilter from mlr3pipelines.

library(mlr3pipelines)
task = tsk("spam")

# the `filter.frac` should be tuned
graph = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>%
  po("learner", lrn("classif.rpart"))

learner = as_learner(graph)
rr = resample(task, learner, rsmp("holdout"))