class clearbox_preprocessor.preprocessor.Preprocessor(data: LazyFrame | DataFrame | DataFrame, cat_labels_threshold: float = 0.02, get_discarded_info: bool = False, excluded_col: List = [], time: str | None = None, missing_values_threshold: float = 0.999, n_bins: int = 0, scaling: Literal['none', 'normalize', 'standardize', 'quantile'] = 'none', num_fill_null: Literal['interpolate', 'forward', 'backward', 'min', 'max', 'mean', 'zero', 'one'] = 'mean', unseen_labels='ignore', target_columns=None)[source]

Bases: object

A class for preprocessing datasets based on polars, including feature selection, handling missing values, scaling, and time-series feature extraction.

Parameters:
  • data (pl.LazyFrame or pl.DataFrame or pd.DataFrame) – The dataset to be processed. It can be a Polars LazyFrame, Polars DataFrame, or Pandas DataFrame.

  • cat_labels_threshold (float, optional, default=0.02) –

    A float value between 0 and 1 that sets the threshold for discarding categorical features. It defines a minimum frequency threshold for keeping a label as a separate category. If a label appears in less than cat_labels_threshold * 100% of the total occurrences in a categorical column, it is grouped into a generic "other" category.

    For instance, if cat_labels_threshold=0.02 and a label appears less than 2% in the dataset, that label will be converted to “other”.

  • get_discarded_info (bool, optional, default=False) – If set to True, the preprocessor will feature the method get_discarded_features_reason, which provides information on which columns were discarded and the reason for discarding. Note that enabling this option may significantly slow down the processing operation. The list of discarded columns is available even when get_discarded_info=False, so consider setting this flag to True only if you need to know why a column was discarded or, in the case of columns containing only one unique value, what that value was.

  • excluded_col (List, optional, default=[]) – A list of column names to be excluded from processing. These columns will be returned in the final DataFrame without being modified.

  • time (str, optional, default=None) – The name of the time column to sort the DataFrame in case of time series data.

  • scaling (str, default="none") –

    The method used to scale numerical features:

    • ”none” : No scaling is applied

    • ”normalize” : Normalizes numerical features to the [0, 1] range.

    • ”standardize” : Standardizes numerical features to have a mean of 0 and a standard deviation of 1.

    • ”quantile” : Transforms numerical features using quantiles information.

    • ”kbins” : Converts continuous numerical data into discrete bins. The number of bins is defined by the parameter n_bin

  • num_fill_null (FillNullStrategy or str, default="mean") –

    Strategy or value used to fill null values in numerical features:

    • ”mean” : Fills null values with the mean of the column.

    • ”interpolate” : Fills null values using interpolation.

    • ”forward” : Fills null values using the previous non-null value.

    • ”backward” : Fills null values using the next non-null value.

    • ”min” : Fills null values with the minimum value of the column.

    • ”max” : Fills null values with the maximum value of the column.

    • ”zero” : Fills null values with zeros.

    • ”one” : Fills null values with ones.

    • value : Fills null values with the specified value.

  • n_bins (int, default=0) – Number of bins to discretize numerical features. If set to a value greater than 0 and if scaling==”kbins”, numerical features are discretized into the specified number of bins using quantile-based binning.

  • unseen_labels (str, default="ignore") –

    • “ignore” : If new data contains labels unseen during fit one hot encoding contains 0 in every column.

    • ”error” : Raise an error if new data contains labels unseen during fit.

  • target_column (str, default=None)

numerical_features

Names of the numerical features in the dataset.

Type:

Tuple[str]

categorical_features

Names of the categorical features in the dataset.

Type:

Tuple[str]

temporal_features

Names of the temporal features in the dataset.

Type:

Tuple[str]

discarded_features

Features that were discarded during preprocessing, along with reason they were discarded, if available.

Type:

Union[List[str], Dict[str, str]]

single_value_columns

Dictionary storing columns with only one unique value, along with the unique value.

Type:

Dict[str, str]

Raises:

ValueError – If cat_labels_threshold is not between 0 and 1.

Notes

The constructor transforms Pandas DataFrames into Polars LazyFrames for more efficient processing.

extract_ts_features(data: LazyFrame | DataFrame, y: Series | Series | None = None, time: str | None = None, column_id: str | None = None) DataFrame[source]

Extract relevant time-series features from the provided data.

Parameters:
  • data (pl.LazyFrame or pd.DataFrame) – The input dataset containing the time-series data. It can be a Polars LazyFrame or a Pandas DataFrame.

  • y (pl.Series or pd.Series) – The label series associated with the data. It can be a Polars Series or a Pandas Series.

  • time (str, optional) – The name of the time column used to sort the data. If not provided, the method will try to use self.time if available.

  • column_id (str, optional) – The name of the ID column, if present in the data. This is used to distinguish different time-series within the same dataset.

Returns:

A DataFrame containing the extracted and filtered relevant time-series features.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the provided data is not a Polars LazyFrame or a Pandas or Polars DataFrame.

  • ValueError – If the provided label series is not a Polars Series or a Pandas Series.

  • ValueError – If the time column name is not provided and self.time is not available.

Notes

  • The function uses the extract_relevant_features method from the tsfresh library

to extract features from the time-series data. - The method stores the filtered features in self.features_filtered for further use.

get_categorical_features() Tuple[str][source]

Return the list of categorical features.

get_features_sizes() Tuple[List[int], List[int]][source]

Gets the sizes of ordinal and categorical features after transformation.

Returns:

Tuple: Sizes of ordinal and categorical features.

get_numerical_features() Tuple[str][source]

Return the list of numerical features.

inverse_transform(data: LazyFrame | DataFrame | DataFrame) DataFrame[source]

Reverse the transformations applied during the preprocessor.transform(data) phase.

This method performs the inverse transformations on numerical and categorical features to restore the original dataset format.

Parameters:

datapl.LazyFrame | pl.DataFrame | pd.DataFrame

The input dataset in either Polars LazyFrame, Polars DataFrame, or Pandas DataFrame format. The format must match the dataset type initially provided when the Preprocessor was initialized.

Returns:

pl.DataFrame

A Polars DataFrame with all transformations reversed, including: - Restored numerical features (inverse normalization, standardization, or quantile transformation). - Reconstructed categorical features from one-hot encoding.

Raises:

SystemExit

If the provided data type does not match the originally initialized dataset type.

Notes:

  • If data_was_pd is True, the method expects and processes a Pandas DataFrame.

  • If data_was_pd is False, it expects and processes a Polars DataFrame or LazyFrame.

  • The numerical features are reversed based on the stored transformation method (normalize, standardize, quantile).

  • One-hot encoded categorical columns are reconstructed into their original categorical format.

Example:

preprocessor = Preprocessor(real_data, scaling="standardize")
transformed_data = preprocessor.transform(real_data)

# Reverse the transformations
original_data = preprocessor.inverse_transform(transformed_data)
transform(data: LazyFrame | DataFrame | DataFrame) DataFrame | DataFrame[source]

Transform the input dataset by processing numerical, temporal, and categorical columns. This includes filling null values, scaling or discretizing numerical features, and encoding categorical features.

Parameters:

data (pl.LazyFrame or pl.DataFrame or pd.DataFrame) – The input dataset to be transformed. It can be a Polars LazyFrame, Polars DataFrame, or a Pandas DataFrame.

Returns:

The transformed dataset, returned as a Polars DataFrame or a Pandas DataFrame, depending on the input data type.

Return type:

pl.DataFrame or pd.DataFrame

Raises:

SystemExit – If the input data type does not match the data type used when the Preprocessor was initialized.

Notes

  • The method identifies and processes numerical, temporal, and categorical features separately.

  • Categorical features are filled with the most frequent value and then one-hot encoded.

  • Numerical features can be normalized, standardized, or discretized based on the specified parameters.

  • Temporal features are filled using interpolation and reordered to the beginning of the dataset.

Example:

preprocessor = Preprocessor(real_data, scaling="standardize")
transformed_data = preprocessor.transform(real_data)