Introduction
Many of the sampling steps in themis accept an
over_ratio or under_ratio argument that
controls how much sampling is performed. This article explains what
these arguments do and how to set them.
Setup
To illustrate, we’ll use a small data set with a known class
imbalance. In this data set, class "a" has 100
observations, "b" has 65, and "c" has 20.
over_ratio
The over_ratio argument is used by over-sampling steps. It
controls the ratio of the minority-to-majority frequencies after
sampling.
The target number of observations for each class is calculated as:
where is the number of observations in the most common class.
over_ratio = 1(default): All classes are upsampled to the frequency of the majority class, resulting in a perfectly balanced data set.over_ratio < 1: Minority classes are upsampled to a fraction of the majority class frequency, resulting in a partially balanced data set.over_ratio > 1: All classes, including the majority, are upsampled to the same target, resulting in a balanced data set that is larger than the original.If a class already has at least as many observations as the target, it is left unchanged.
Examples
With the default over_ratio = 1, all classes are brought
up to 100 observations (the size of the majority class):
recipe(class ~ x, data = imbalanced_data) |>
step_upsample(class, over_ratio = 1) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 100
#> 2 b 100
#> 3 c 100With over_ratio = 0.5, the target is
floor(100 * 0.5) = 50. Only class "c" is
upsampled to 50. Classes "a" and "b" already
exceed the target and are left unchanged:
recipe(class ~ x, data = imbalanced_data) |>
step_upsample(class, over_ratio = 0.5) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 100
#> 2 b 65
#> 3 c 50With over_ratio = 0.3, the target is
floor(100 * 0.3) = 30. Class "c" is upsampled
from 20 to 30. Classes "a" and "b" both exceed
the target and are left unchanged:
recipe(class ~ x, data = imbalanced_data) |>
step_upsample(class, over_ratio = 0.3) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 100
#> 2 b 65
#> 3 c 30
under_ratio
The under_ratio argument is used by under-sampling steps.
It controls the ratio of the majority-to-minority frequencies after
sampling.
The target number of observations for each class is calculated as:
where is the number of observations in the least common class.
under_ratio = 1(default): All classes are downsampled to the frequency of the minority class, resulting in a perfectly balanced data set.under_ratio > 1: Majority classes are downsampled to a multiple of the minority class frequency, resulting in a partially balanced data set.under_ratio < 1: All classes, including the minority, are downsampled to the same target, resulting in a balanced data set that is smaller than the original.If a class already has at most as many observations as the target, it is left unchanged.
Examples
With the default under_ratio = 1, all classes are
brought down to 20 observations (the size of the minority class):
recipe(class ~ x, data = imbalanced_data) |>
step_downsample(class, under_ratio = 1) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 20
#> 2 b 20
#> 3 c 20With under_ratio = 2, the target is
floor(20 * 2) = 40. Classes "a" and
"b" are both downsampled to 40. Class "c"
already has fewer than 40 observations and is left unchanged:
recipe(class ~ x, data = imbalanced_data) |>
step_downsample(class, under_ratio = 2) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 40
#> 2 b 40
#> 3 c 20With under_ratio = 3, the target is
floor(20 * 3) = 60. Classes "a" and
"b" are both downsampled to 60. Class "c"
already has fewer than 60 observations and is left unchanged:
recipe(class ~ x, data = imbalanced_data) |>
step_downsample(class, under_ratio = 3) |>
prep() |>
bake(new_data = NULL) |>
count(class)
#> # A tibble: 3 × 2
#> class n
#> <fct> <int>
#> 1 a 60
#> 2 b 60
#> 3 c 20Choosing a ratio
Choosing the right ratio depends on your data and the model you are using. The default value of 1 gives a perfectly balanced data set, which is the most common choice. However, there are cases where a partial balance is preferable:
Preserving more majority class data: A perfectly balanced data set from under-sampling discards most of the majority class data. Using
under_ratio > 1retains more data at the cost of a less balanced class distribution.Limiting over-sampling: With a very large majority class, upsampling to perfect balance can generate a very large amount of synthetic data. Using
over_ratio < 1limits the amount of synthetic data generated.
In practice, over_ratio and under_ratio are
often treated as tunable hyperparameters and selected by
cross-validation. See the dials package for
the over_ratio() and under_ratio() parameter
objects used in tuning.
