Tutorial

The basic usage is shown below.

import shirokumas as sk

encoder = sk.AggregateEncoder(...)
encoder = sk.CountEncoder(...)
encoder = sk.NullEncoder(...)
encoder = sk.OneHotEncoder(...)
encoder = sk.MultiLabelBinarizer(...)
encoder = sk.OrdinalEncoder(...)
encoder = sk.TargetEncoder(...)

train_x, train_y, test_x = ...

encoder.fit(train_x, train_y)
encoded_train_x = encoder.transform(train_x)
encoded_test_x = encoder.transform(test_x)

The following sections explain how to use each encoder.

OrdinalEncoder

This implements an encoding method called Ordinal/Label Encoding. Specifically, it maps a specific class of categorical variables to a specific integer.

Prepare a sample DataFrame. This contains a column named “fruits” of a category variable.

>>> import polars as pl
>>> train_df = pl.DataFrame({"fruits": ["apple", "banana", "cherry"]})

Instantiate the encoder.

>>> encoder = sk.OrdinalEncoder()

Then fit and transform the DataFrame.

>>> encoder.fit_transform(train_df)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ i64    │
╞════════╡
│ 1      │
│ 2      │
│ 3      │
└────────┘

Since the mapping is saved in the instance, test data can also be transformed.

>>> test_df = pl.DataFrame({"fruits": ["cherry", "banana", "apple"]})
>>> encoder.transform(test_df)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ i64    │
╞════════╡
│ 3      │
│ 2      │
│ 1      │
└────────┘

If you want to give explicit mappings, specify mapping option.

>>> encoder = sk.OrdinalEncoder(mappings={
...     "fruits": {
...         "apple": 10,
...         "banana": 20,
...         "cherry": 30,
...     }
... })
>>> encoder.fit_transform(train_df)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ i64    │
╞════════╡
│ 10     │
│ 20     │
│ 30     │
└────────┘

By default, unknown values are replaced to -1 and missing values to -2.

>>> test_df = pl.DataFrame({"fruits": ["unseen", None, "apple"]})
>>> encoder.transform(test_df)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ i64    │
╞════════╡
│ -1     │
│ -2     │
│ 10     │
└────────┘

If you want to transform specific columns, specify cols option.

>>> train_df = pl.DataFrame({
...     "fruits": ["apple", "banana", "cherry"],
...     "vegetables": ["avocados", "broccoli", "carrots"],
... })
>>> encoder = sk.OrdinalEncoder(cols=["vegetables"])
>>> encoder.fit_transform(train_df)
shape: (3, 1)
┌────────────┐
│ vegetables │
│ ---        │
│ i64        │
╞════════════╡
│ 1          │
│ 2          │
│ 3          │
└────────────┘

OneHotEncoder

This implements an encoding method called One-hot Encoding. This encoding method creates as many columns as there are unique values, each with a true/false indicator representing whether a categorical variable belongs to a specific class.

>>> train_df = pl.DataFrame({"fruits": ["apple", "banana", "cherry"]})
>>> encoder = sk.OneHotEncoder()
>>> encoder.fit_transform(train_df)
shape: (3, 3)
┌──────────────┬───────────────┬───────────────┐
│ fruits_apple ┆ fruits_banana ┆ fruits_cherry │
│ ---          ┆ ---           ┆ ---           │
│ bool         ┆ bool          ┆ bool          │
╞══════════════╪═══════════════╪═══════════════╡
│ true         ┆ false         ┆ false         │
│ false        ┆ true          ┆ false         │
│ false        ┆ false         ┆ true          │
└──────────────┴───────────────┴───────────────┘

MultiLabelBinarizer

This implements an encoding method called Multi-hot Encoding. This encoding method is designed for scenarios where each instance can belong to multiple classes. In this method, multiple columns can have a true value for a single instance.

>>> train_df = pl.DataFrame({"fruits": [
...     ["apple"],
...     ["banana"],
...     ["apple", "banana"],
... ]})
>>> encoder = sk.MultiLabelBinarizer()
>>> encoder.fit_transform(train_df)
shape: (3, 2)
┌──────────────┬───────────────┐
│ fruits_apple ┆ fruits_banana │
│ ---          ┆ ---           │
│ bool         ┆ bool          │
╞══════════════╪═══════════════╡
│ true         ┆ false         │
│ false        ┆ true          │
│ true         ┆ true          │
└──────────────┴───────────────┘

CountEncoder

This implements an encoding method called Count/Frequency Encoding. This encoding method uses the number of occurrences of each class of categorical variables as a feature.

>>> train_df = pl.DataFrame({"fruits": [
...     "apple",
...     "apple",
...     "banana",
...     "banana",
...     "banana",
...     "cherry",
... ]})
>>> encoder = sk.CountEncoder()
>>> encoder.fit_transform(train_df)
shape: (6, 1)
┌────────┐
│ fruits │
│ ---    │
│ i64    │
╞════════╡
│ 2      │
│ 2      │
│ 3      │
│ 3      │
│ 3      │
│ 1      │
└────────┘

NullEncoder

This implements an encoding method that makes a boolean feature whether a column value is a missing value or not.

>>> train_df = pl.DataFrame({"fruits": ["apple", None, "cherry"]})
>>> encoder = sk.NullEncoder()
>>> encoder.fit_transform(train_df)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ bool   │
╞════════╡
│ false  │
│ true   │
│ false  │
└────────┘

TargetEncoder

This implements an encoding method called Target/Likelihood (Mean) Encoding. It is a supervised method that uses the objective variable for feature extraction.

Therefore, not only explanatory variables but also objective variables should be prepared.

>>> train_x = pl.DataFrame(
...     {
...         "fruits": ["apple", "banana", "banana", "apple"],
...     }
... )
>>> train_y = pl.Series(
...     name="target",
...     values=[1, 0, 1, 1],
... )

In addition, Out-of-Fold (OOF) feature extraction is required to reduce data leakage on cross validation. Use scikit-learn’s BaseCrossValidator subclass to define how to split data.

>>> from sklearn.model_selection import KFold
>>> folds = KFold(n_splits=4, shuffle=False)

In the following, four rows of training data are divided into four parts. In turn, three of the training data rows are used to compute the target statistics, and one row is replaced with the computed target statistics.

>>> encoder = sk.TargetEncoder(folds=folds)
>>> encoder.fit(train_x, train_y)
TargetEncoder(folds=KFold(n_splits=4, random_state=None, shuffle=False))
>>> encoder.transform(train_x)
shape: (4, 1)
┌────────┐
│ fruits │
│ ---    │
│ f64    │
╞════════╡
│ 1.0    │
│ 1.0    │
│ 0.0    │
│ 1.0    │
└────────┘

Unknown value will be replaced to global mean by default. In the following, cherry is replaced by 3 / 4 = 0.75 because it is an unknown value

>>> test_x = pl.DataFrame(
...     {
...         "fruits": ["apple", "banana", "cherry"],
...     }
... )
>>> encoder.transform(test_x)
shape: (3, 1)
┌────────┐
│ fruits │
│ ---    │
│ f64    │
╞════════╡
│ 1.0    │
│ 0.5    │
│ 0.75   │
└────────┘

Smoothing is available to reduce over-fitting. Shirokumas implements two types of smoothing: Empirical Bayesian and M-probability Estimate [1] .

To use Empirical Bayesian, you can set "eb" for smoothing_method argument. The parameters "k" and "f" are used to regulate the smoothing. If not specified, 20 and 10 are used, respectively by default. k and f are equivalent to min_samples_leaf and smoothing in TargetEncoder of category_encoders [2].

>>> encoder = sk.TargetEncoder(
...     folds=folds,
...     smoothing_method="eb",
...     smoothing_params={
...         "k": 1,
...         "f": 1,
...     },
... )
>>> encoder.fit_transform(train_x, train_y)
shape: (4, 1)
┌──────────┐
│ fruits   │
│ ---      │
│ f64      │
╞══════════╡
│ 0.833333 │
│ 1.0      │
│ 0.333333 │
│ 0.833333 │
└──────────┘

To use M-probability Estimate, you can set "m-estimate" for smoothing_method argument. The parameter "m" is used to regulate the smoothing. If not specified, 1.0 is used by default. m is equivalent to m in MEstimateEncoder of category_encoders [3].

>>> encoder = sk.TargetEncoder(
...     folds=folds,
...     smoothing_method="m-estimate",
...     smoothing_params={
...         "m": 1.0,
...     },
... )
>>> encoder.fit_transform(train_x, train_y)
shape: (4, 1)
┌──────────┐
│ fruits   │
│ ---      │
│ f64      │
╞══════════╡
│ 0.833333 │
│ 1.0      │
│ 0.333333 │
│ 0.833333 │
└──────────┘

AggregateEncoder

This implements an encoding method that makes an aggregate feature. It is a bit similar to Target Encoding, but it is an unsupervised feature extraction method.

>>> train_df = pl.DataFrame(
...     {
...         "fruits": ["apple", "apple", "banana", "banana", "cherry"],
...         "price": [100, 200, 300, 400, 500],
...     }
... )

The encoder requires cols and agg_exprs arguments. cols is specified the columns to be grouped. agg_exprs is specified the expressions used to br aggregated. agg_exprs is a dictionary. The dictionary keys are used for the suffix of the column names.

>>> encoder = sk.AggregateEncoder(
...     cols=[
...         "fruits",
...     ],
...     agg_exprs={
...         "mean": pl.col("price").mean(),
...         "max": pl.col("price").max(),
...     },
... )

The mean and maximum values for each category are encoded.

>>> encoder.fit_transform(train_df)
shape: (5, 2)
┌─────────────┬────────────┐
│ fruits_mean ┆ fruits_max │
│ ---         ┆ ---        │
│ f64         ┆ i64        │
╞═════════════╪════════════╡
│ 150.0       ┆ 200        │
│ 150.0       ┆ 200        │
│ 350.0       ┆ 400        │
│ 350.0       ┆ 400        │
│ 500.0       ┆ 500        │
└─────────────┴────────────┘