How to Use

This guide provides detailed examples on how to use the MultivariateImputer and TimeSeriesImputer. DataFiller targets a pragmatic middle ground for imputation: it does not aim to match the absolute performance of large deep learning models on complex masking patterns, but it is simple to fit, easy to adapt, and flexible to integrate into existing workflows. It is also significantly faster than scikit-learn’s IterativeImputer, which makes it well-suited for fast iteration and production use.

Multivariate Imputer

The MultivariateImputer is the core of the library, designed to impute missing values in a 2D NumPy array or pandas DataFrame. It automatically handles mixed numerical, boolean, and categorical/string columns by one-hot encoding non-numerical features internally so they can help impute other columns, then returning the original schema.

Titanic Mixed-Feature Example

MultivariateImputer handles categorical, string, and boolean columns by one-hot encoding them internally and imputing missing labels with a classifier. The Titanic dataset provides a compact mixed-type example.

This example shows how categorical columns (such as sex or embarked) are used as predictors for other features while their own missing values are imputed with a classifier.

from datafiller.datasets import load_titanic
from datafiller import MultivariateImputer, ExtremeLearningMachine

df = load_titanic()
df.head(15)
imputer = MultivariateImputer(regressor=ExtremeLearningMachine())
df_imputed = imputer(df)
df_imputed.head(15)
Original Titanic (titanic.csv)
Imputed Titanic (titanic_imputed.csv)

Parameters

The main initialization parameters are regressor and classifier to set the numeric/categorical models, plus scoring, rng, min_samples_train, and verbose to control feature selection, reproducibility, training thresholds, and logging. Call parameters include rows_to_impute and cols_to_impute to target subsets and n_nearest_features to limit the features used per imputation; setting n_nearest_features is recommended to reduce computation time. For a complete list and full descriptions, see the API Reference reference.

Time Series Imputer

The TimeSeriesImputer is a wrapper around the MultivariateImputer that is specifically designed for time series data.

PEMS-BAY Example

This example loads the PEMS-BAY dataset, punches a large contiguous hole in one sensor’s time series, adds 5% missing-at-random values to other sensors, and imputes the missing values using autoregressive lags and leads.

import numpy as np
import matplotlib.pyplot as plt
from datafiller import TimeSeriesImputer
from datafiller.datasets import add_mar, load_pems_bay

df = load_pems_bay()
rng = np.random.default_rng(0)
target_col = rng.choice(df.columns)
ground_truth = df[target_col].copy()

df_missing = df.copy()
n_rows = len(df_missing)
hole_length = int(n_rows * 0.2)
start = n_rows // 2 - hole_length // 2
end = start + hole_length
df_missing.loc[df_missing.index[start:end], target_col] = np.nan

other_cols = df_missing.columns.drop(target_col)
np.random.seed(0)
df_missing.loc[:, other_cols] = add_mar(df_missing[other_cols], nan_ratio=0.05)

ts_imputer = TimeSeriesImputer(lags=[1, 2, 3, -1, -2, -3], rng=0)
df_imputed = ts_imputer(df_missing, cols_to_impute=[target_col], n_nearest_features=75)

Parameters

Initialization parameters include lags for autoregressive features (positive integers create lags like t-1, negative integers create leads like t+1), regressor for the numeric model, interpolate_gaps_less_than to pre-fill short gaps, and the shared controls scoring, rng, min_samples_train, and verbose. Call parameters include rows_to_impute and cols_to_impute to target subsets, n_nearest_features to limit features used for imputation (recommended to reduce computation time), and before/after to restrict the time window. For a complete list and full descriptions, see the API Reference reference.