How to Use
This guide provides detailed examples on how to use the MultivariateImputer and TimeSeriesImputer. DataFiller targets a pragmatic middle ground for imputation: it does not aim to match the absolute performance of large deep learning models on complex masking patterns, but it is simple to fit, easy to adapt, and flexible to integrate into existing workflows. It is also significantly faster than scikit-learn’s IterativeImputer, which makes it well-suited for fast iteration and production use.
Multivariate Imputer
The MultivariateImputer is the core of the library, designed to impute missing values in a 2D NumPy array or pandas DataFrame.
It automatically handles mixed numerical, boolean, and categorical/string columns by one-hot encoding non-numerical features internally so they can help impute other columns, then returning the original schema.
Titanic Mixed-Feature Example
MultivariateImputer handles categorical, string, and boolean columns by one-hot encoding them internally and imputing missing labels with a classifier. The Titanic dataset provides a compact mixed-type example.
This example shows how categorical columns (such as sex or embarked) are used as predictors for other features while their own missing values are imputed with a classifier.
from datafiller.datasets import load_titanic
from datafiller import MultivariateImputer, ExtremeLearningMachine
df = load_titanic()
df.head(15)
imputer = MultivariateImputer(regressor=ExtremeLearningMachine())
df_imputed = imputer(df)
df_imputed.head(15)
Parameters
The main initialization parameters are regressor and classifier to set the numeric/categorical models, plus scoring, rng,
min_samples_train, and verbose to control feature selection, reproducibility, training thresholds, and logging. Call parameters
include rows_to_impute and cols_to_impute to target subsets and n_nearest_features to limit the features used per imputation;
setting n_nearest_features is recommended to reduce computation time. For a complete list and full descriptions, see the API Reference
reference.
Time Series Imputer
The TimeSeriesImputer is a wrapper around the MultivariateImputer that is specifically designed for time series data.
PEMS-BAY Example
This example loads the PEMS-BAY dataset, punches a large contiguous hole in one sensor’s time series, adds 5% missing-at-random values to other sensors, and imputes the missing values using autoregressive lags and leads.
import numpy as np
import matplotlib.pyplot as plt
from datafiller import TimeSeriesImputer
from datafiller.datasets import add_mar, load_pems_bay
df = load_pems_bay()
rng = np.random.default_rng(0)
target_col = rng.choice(df.columns)
ground_truth = df[target_col].copy()
df_missing = df.copy()
n_rows = len(df_missing)
hole_length = int(n_rows * 0.2)
start = n_rows // 2 - hole_length // 2
end = start + hole_length
df_missing.loc[df_missing.index[start:end], target_col] = np.nan
other_cols = df_missing.columns.drop(target_col)
np.random.seed(0)
df_missing.loc[:, other_cols] = add_mar(df_missing[other_cols], nan_ratio=0.05)
ts_imputer = TimeSeriesImputer(lags=[1, 2, 3, -1, -2, -3], rng=0)
df_imputed = ts_imputer(df_missing, cols_to_impute=[target_col], n_nearest_features=75)
Parameters
Initialization parameters include lags for autoregressive features (positive integers create lags like t-1, negative integers create
leads like t+1), regressor for the numeric model, interpolate_gaps_less_than to pre-fill short gaps, and the shared controls
scoring, rng, min_samples_train, and verbose. Call parameters include rows_to_impute and cols_to_impute to target
subsets, n_nearest_features to limit features used for imputation (recommended to reduce computation time), and before/after to
restrict the time window. For a complete list and full descriptions, see the API Reference reference.