step_tomek
creates a specification of a recipe
step that removes majority class instances of tomek links. Using
unbalanced::ubTomek()
.
step_tomek( recipe, ..., role = NA, trained = FALSE, column = NULL, skip = TRUE, seed = sample.int(10^5, 1), id = rand_id("tomek") ) # S3 method for step_tomek tidy(x, ...)
recipe | A recipe object. The step will be added to the sequence of operations for this recipe. |
---|---|
... | One or more selector functions to choose which
variable is used to sample the data. See |
role | Not used by this step since no new variables are created. |
trained | A logical to indicate if the quantities for preprocessing have been estimated. |
column | A character string of the variable name that will
be populated (eventually) by the |
skip | A logical. Should the step be skipped when the
recipe is baked by |
seed | An integer that will be used as the seed when applied. |
id | A character string that is unique to this step to identify it. |
x | A |
An updated version of recipe
with the new step
added to the sequence of existing steps (if any). For the
tidy
method, a tibble with columns terms
which is
the variable used to sample.
The factor variable used to balance around must only have 2 levels. All other variables must be numerics with no missing data.
A tomek link is defined as a pair of points from different classes and are each others nearest neighbors.
All columns in the data are sampled and returned by juice()
and bake()
.
When used in modeling, users should strongly consider using the
option skip = TRUE
so that the extra sampling is not
conducted outside of the training set.
Tomek. Two modifications of cnn. IEEE Trans. Syst. Man Cybern., 6:769-772, 1976.
#> #> <NA> stem other #> 0 9539 50316ds_rec <- recipe(Class ~ age + height, data = okc) %>% step_meanimpute(all_predictors()) %>% step_tomek(Class) %>% prep() sort(table(bake(ds_rec, new_data = NULL)$Class, useNA = "always"))#> #> <NA> stem other #> 0 9539 49710# since `skip` defaults to TRUE, baking the step has no effect baked_okc <- bake(ds_rec, new_data = okc) table(baked_okc$Class, useNA = "always")#> #> stem other <NA> #> 9539 50316 0library(ggplot2) ggplot(circle_example, aes(x, y, color = class)) + geom_point() + labs(title = "Without Tomek") + xlim(c(1, 15)) + ylim(c(1, 15))recipe(class ~ ., data = circle_example) %>% step_tomek(class) %>% prep() %>% bake(new_data = NULL) %>% ggplot(aes(x, y, color = class)) + geom_point() + labs(title = "With Tomek") + xlim(c(1, 15)) + ylim(c(1, 15))