How to Use

This guide provides detailed examples on how to use the MultivariateImputer and TimeSeriesImputer. DataFiller targets a pragmatic middle ground for imputation: it does not aim to match the absolute performance of large deep learning models on complex masking patterns, but it is simple to fit, easy to adapt, and flexible to integrate into existing workflows. It is also significantly faster than scikit-learn’s IterativeImputer, which makes it well-suited for fast iteration and production use.

Multivariate Imputer

The MultivariateImputer is the core of the library, designed to impute missing values in a 2D NumPy array or eager pandas or Polars DataFrame. It automatically handles mixed numerical, boolean, and categorical/string columns by one-hot encoding non-numerical features internally so they can help impute other columns, then returning the original schema.

Polars DataFrames

Install the optional integration with pip install "datafiller[polars]". Polars inputs return Polars outputs, preserve column order and supported dtypes, and treat both null and floating-point NaN as missing. rows_to_impute contains integer row positions and cols_to_impute contains column names. Numeric, Boolean, String, Categorical, and Enum columns are supported. LazyFrames must be collected before imputation.

import polars as pl
from datafiller import MultivariateImputer

df = pl.DataFrame({
    "value": [1.0, None, 3.0, 4.0],
    "group": ["a", "a", None, "b"],
})
df_imputed = MultivariateImputer()(df)

Titanic Mixed-Feature Example

MultivariateImputer handles categorical, string, and boolean columns by one-hot encoding them internally and imputing missing labels with a classifier. The Titanic dataset provides a compact mixed-type example.

This example shows how categorical columns (such as sex or embarked) are used as predictors for other features while their own missing values are imputed with a classifier.

from datafiller.datasets import load_titanic
from datafiller import MultivariateImputer, ExtremeLearningMachine

df = load_titanic()
df.head(15)

imputer = MultivariateImputer(regressor=ExtremeLearningMachine())
df_imputed = imputer(df)
df_imputed.head(15)

Original Titanic (titanic.csv)

Imputed Titanic (titanic_imputed.csv)

Parameters

The main initialization parameters are regressor and classifier to set the numeric/categorical models, plus scoring, rng, min_samples_train, fallback, and verbose to control feature selection, reproducibility, training thresholds, last-resort filling, and logging. Call parameters include rows_to_impute and cols_to_impute to target subsets and n_nearest_features to limit the features used per imputation; setting n_nearest_features is recommended to reduce computation time. For a complete list and full descriptions, see the API Reference reference.

Training-data selection follows a three-step path for each missingness pattern — complete rows → optimask → fallback: rows fully observed on the pattern’s features are used when at least min_samples_train (default 20) of them exist; otherwise optimask finds the largest NaN-free rectangle, preferring ones that keep at least min_samples_train rows; and the rare cells that still cannot get a model are filled with the column mean (most frequent category for categoricals) with the default fallback="simple" — pass fallback=None to leave them as NaN. See Algorithm for details.

GPU Acceleration

Both imputers accept an optional device parameter (e.g. device="cuda") that solves the default ridge models as batched GPU operations instead of a per-pattern Python loop. This is most beneficial when many columns are imputed on medium-to-large matrices, where it can be an order of magnitude faster; imputed values match the CPU path up to float32 rounding.

GPU support relies on PyTorch, which is an optional dependency and is only imported when a device is requested:

pip install datafiller[gpu]

from datafiller import MultivariateImputer

imputer = MultivariateImputer(device="cuda")
X_imputed = imputer(X)

The default PyPI build of PyTorch ships CUDA support on Linux; on Windows, install a CUDA-enabled build by following pytorch.org/get-started. With the default device=None the pure NumPy/Numba CPU path runs and PyTorch is never imported. Categorical targets, custom regressors, and patterns with fewer than min_samples_train complete rows transparently fall back to the CPU implementation.

Time Series Imputer

The TimeSeriesImputer is a wrapper around the MultivariateImputer that is specifically designed for time series data. It can infer a regular DatetimeIndex frequency, reinsert missing timestamp rows inside the observed range, and add low-cost calendar features that remain observed through contiguous gaps.

With Polars, pass an eager DataFrame and configure its Date or Datetime column through time_column. The timestamp column remains in the returned Polars DataFrame. Row selection accepts positions on the regularized output grid or timestamp values; column selection uses names and cannot target the timestamp column.

import polars as pl
from datafiller import TimeSeriesImputer

df = pl.DataFrame({
    "timestamp": pl.datetime_range(
        start=pl.datetime(2024, 1, 1),
        end=pl.datetime(2024, 1, 2),
        interval="1h",
        eager=True,
    ),
    "value": [float(i) for i in range(25)],
}).with_columns(
    pl.when(pl.int_range(pl.len()) == 12).then(None).otherwise("value").alias("value")
)

df_imputed = TimeSeriesImputer(time_column="timestamp", lags=[1, -1])(df)

PEMS-BAY Example

This example loads the PEMS-BAY dataset, punches a large contiguous hole in one sensor’s time series, adds 5% missing-at-random values to other sensors, and imputes the missing values using autoregressive lags and leads.

import numpy as np
import matplotlib.pyplot as plt
from datafiller import TimeSeriesImputer
from datafiller.datasets import add_mar, load_pems_bay

df = load_pems_bay()
rng = np.random.default_rng(0)
target_col = rng.choice(df.columns)
ground_truth = df[target_col].copy()

df_missing = df.copy()
n_rows = len(df_missing)
hole_length = int(n_rows * 0.2)
start = n_rows // 2 - hole_length // 2
end = start + hole_length
df_missing.loc[df_missing.index[start:end], target_col] = np.nan

other_cols = df_missing.columns.drop(target_col)
df_missing.loc[:, other_cols] = add_mar(df_missing[other_cols], nan_ratio=0.05, rng=0)

ts_imputer = TimeSeriesImputer(lags=[1, 2, 3, -1, -2, -3], rng=0)
df_imputed = ts_imputer(df_missing, cols_to_impute=[target_col], n_nearest_features=75)

Parameters

Initialization parameters include lags for autoregressive features (positive integers create lags like t-1, negative integers create leads like t+1), regressor for the numeric model, interpolate_gaps_less_than to pre-fill short gaps, add_time_features to include deterministic calendar/trend predictors, and the shared controls scoring, rng, min_samples_train, fallback, and verbose (the training-data selection path described above applies unchanged). Call parameters include rows_to_impute and cols_to_impute to target subsets, n_nearest_features to limit features used for imputation (recommended to reduce computation time), and before/after to restrict the time window. For a complete list and full descriptions, see the API Reference reference.