API Reference

This page provides a detailed API reference for the classes and functions in the datafiller library.

Imputer Classes

MultivariateImputer

Bases: BaseEstimator, TransformerMixin

Imputes missing values in a 2D numpy array.

This class uses a model-based approach to fill in missing values, where each feature with missing values is predicted using other features in the dataset. It is designed to be efficient, using Numba for critical parts and finding optimal data subsets for model training. When a pandas or Polars DataFrame contains categorical, string, or boolean columns, they are one-hot encoded internally and imputed with a classifier before returning the original column layout.

Rows to impute are grouped by their pattern of observed features, and the training data for each pattern is selected along a fixed three-step path:

Complete rows — the rows fully observed on the pattern’s features are used directly when at least min_samples_train of them exist.
optimask — otherwise, the optimask algorithm searches for the largest NaN-free rectangular subset, trading feature columns for training rows and preferring rectangles that keep at least min_samples_train rows.
Fallback — cells whose pattern still cannot reach the threshold are filled by the fallback strategy (column mean / most frequent category by default, or left NaN with fallback=None).

Parameters:

regressor (RegressorMixin, optional) – A scikit-learn compatible regressor. It should be a lightweight model, as it is fitted many times. By default, a custom Ridge implementation is used.
classifier (ClassifierMixin, optional) – A scikit-learn compatible classifier used for categorical and string targets. Defaults to DecisionTreeClassifier(max_depth=4, random_state=rng).
verbose (int, optional) – The verbosity level. Defaults to 0.
min_samples_train (int, optional) – The minimum number of samples required to train a model. Patterns with fewer complete rows fall back to an optimask search that trades feature columns for training rows; cells whose patterns still cannot reach the threshold are handled by fallback. Defaults to None, which resolves to 20 — calibrated on real and synthetic datasets, where values of 10-20 consistently beat permissive ones (fits on fewer samples are often worse than a plain column mean once missingness reaches ~25%).
fallback (str or None, optional) – What to do with cells no model could impute (their pattern never reached min_samples_train training rows). "simple" (default) fills them with the column mean (most frequent category for categorical columns). None leaves them as NaN.
rng (int, optional) – A seed for the random number generator. This is used for reproducible feature sampling when n_nearest_features is not None, and for the default categorical classifier when one is not provided. Defaults to None.
scoring (str or callable, optional) – The scoring function to use for feature selection. If ‘default’, the default scoring function is used. If a callable, it must take two arguments as input: the data matrix X (np.ndarray of shape (n_samples, n_features)) and the columns to impute cols_to_impute (np.ndarray of shape (n_cols_to_impute,)), and return a score matrix of shape (n_cols_to_impute, n_features). Defaults to ‘default’.
device (str, optional) – Device used to solve the default ridge models, e.g. "cuda" or "cuda:0". Requires the optional PyTorch dependency (pip install datafiller[gpu]). All missingness patterns of a column are then solved as batched tensor operations, which is considerably faster when many columns are imputed on large matrices. Categorical targets and patterns with fewer than min_samples_train complete rows still use the CPU implementation, and a custom regressor ignores device entirely (a UserWarning is emitted). If None (default), the pure NumPy/Numba CPU path is used and PyTorch is never imported.

imputation_features_

A dictionary mapping each imputed column to the features used to impute it. This attribute is only populated when n_nearest_features is not None. If the input is a pandas or Polars DataFrame, the keys and values will be column names. If the input is a NumPy array, they will be integer indices.

Type:: dict or None

Examples

import numpy as np
from datafiller import MultivariateImputer

# Create a matrix with missing values
X = np.array([
    [1, 2, 3],
    [4, np.nan, 6],
    [7, 8, 9]
])

# Create an imputer and fill the missing values
imputer = MultivariateImputer()
X_imputed = imputer(X)

print(X_imputed)

Imputes missing values in the input data.

The method handles NumPy arrays and eager pandas or Polars DataFrames.

Parameters:

x – The input data matrix with missing values (NaNs). Can be a NumPy array or an eager pandas or Polars DataFrame.
rows_to_impute – The rows to impute. The interpretation of this argument depends on the type of x. - If x is a NumPy array, this must be a list of integer indices. - If x is a pandas DataFrame, this must be a list of index labels. - If x is a Polars DataFrame, this must be a list of integer row positions. If None, all rows are considered for imputation. Defaults to None.
cols_to_impute – The columns to impute. The interpretation of this argument depends on the type of x. - If x is a NumPy array, this must be a list of integer indices. - If x is a pandas or Polars DataFrame, this must be a list of column names. If None, all columns are considered for imputation. Defaults to None.
n_nearest_features – The number of features to use for imputation. If it’s an int, it’s the absolute number of features. If it’s a float, it’s the fraction of features to use. If None, all features are used. Defaults to None.
normalize – Whether to normalize numeric columns before imputation, then transform imputed values back to the original scale. Defaults to True.

Returns:

The imputed data matrix. The return type will match the input type (NumPy array, pandas DataFrame, or Polars DataFrame).

TimeSeriesImputer

class datafiller.TimeSeriesImputer(lags: Iterable[int] = (1,), regressor: RegressorMixin | None = None, classifier: ClassifierMixin | None = None, min_samples_train: int | None = None, fallback: str | None = 'simple', rng: int | None = None, verbose: int = 0, scoring: str | Callable = 'default', interpolate_gaps_less_than: int | None = None, add_time_features: bool = True, device: str | None = None, time_column: str | None = None)[source]

Bases: BaseEstimator, TransformerMixin

Imputes missing values in time series data.

This class wraps the MultivariateImputer to specifically handle time series data in pandas DataFrames. It automatically creates lagged and lead features based on the time series index, then uses these new features to impute missing values.

Parameters:

lags (Iterable[int], optional) – An iterable of integers specifying the lags and leads to create as autoregressive features. Positive integers create lags (e.g., t-1), and negative integers create leads (e.g., t+1). Defaults to (1,).
regressor (RegressorMixin, optional) – A scikit-learn compatible regressor used for numeric targets. Defaults to FastRidge.
classifier (ClassifierMixin, optional) – A scikit-learn compatible classifier used for categorical or string targets. Defaults to DecisionTreeClassifier(max_depth=4).
min_samples_train (int, optional) – The minimum number of samples required to train a model. Defaults to None, which resolves to 20 (see MultivariateImputer).
fallback (str or None, optional) – What to do with cells no model could impute. "simple" (default) fills them with the column mean (mode for categoricals); None leaves them as NaN.
rng (int, optional) – A seed for the random number generator. This is used for reproducible feature sampling when n_nearest_features is not None. Defaults to None.
verbose (int, optional) – The verbosity level. Defaults to 0.
scoring (str or callable, optional) – The scoring function to use for feature selection. If ‘default’, the default scoring function is used. If a callable, it must take two arguments (the data matrix and the columns to impute) and return a score matrix. Defaults to ‘default’.
interpolate_gaps_less_than (int, optional) – The maximum length of gaps to interpolate linearly. If None, no linear interpolation is performed. Defaults to None.
device (str, optional) – Device used to solve the default ridge models, e.g. "cuda". Requires the optional PyTorch dependency (pip install datafiller[gpu]). If None (default), the pure NumPy/Numba CPU path is used. See MultivariateImputer.
add_time_features (bool, optional) – Whether to add deterministic time features before model-based imputation. These features are fully observed after reindexing, which helps fill contiguous missing timestamp blocks. Defaults to True.
time_column (str, optional) – Name of the Date or Datetime column that represents time for a Polars DataFrame. Required for Polars input; pandas input continues to use its DatetimeIndex. Defaults to None.

imputation_features_

A dictionary mapping each imputed column to the features used to impute it. This attribute is only populated when n_nearest_features is not None. The keys and values are the column names, which will include the lagged/lead features created during the imputation process.

Type:: dict or None

import pandas as pd
import numpy as np
from datafiller import TimeSeriesImputer

# Create a time series DataFrame with missing values
rng = pd.date_range('2020-01-01', periods=10, freq='D')
data = {'value': [1, 2, np.nan, 4, 5, 6, np.nan, 8, 9, 10]}
df = pd.DataFrame(data, index=rng)

# Create a time series imputer and fill missing values
ts_imputer = TimeSeriesImputer(lags=[1, -1])
df_imputed = ts_imputer(df)

print(df_imputed)

Imputes missing values in a time series DataFrame.

Parameters:

df – A pandas DataFrame with a DatetimeIndex, or an eager Polars DataFrame with the Date/Datetime column configured by time_column. If the time axis has no explicit frequency, a regular one is inferred from the timestamps and any missing timestamps inside the observed range are reinserted as rows to impute.
rows_to_impute – The rows to impute. Can be an iterable of integer positions, timestamp values, a pandas DatetimeIndex, or None. Polars integer positions refer to the regularized output grid. If None, all rows are considered. Defaults to None.
cols_to_impute – The indices or names of columns to impute. If None, all columns are considered. Defaults to None.
n_nearest_features – The number of features to use for imputation. If it’s an int, it’s the absolute number of features. If it’s a float, it’s the fraction of features to use. If None, all features are used. Defaults to None.
before – A timestamp-like object. If specified, only rows before this timestamp are imputed. Can be anything that can be parsed by lambda x: pd.to_datetime(str(x)). Defaults to None.
after – A timestamp-like object. If specified, only rows after this timestamp are imputed. Can be anything that can be parsed by lambda x: pd.to_datetime(str(x)). Defaults to None.

Returns:

The imputed DataFrame, matching the input dataframe implementation.

Raises:

TypeError – If the input is unsupported or its time axis is invalid.
ValueError – If no regular frequency can be inferred from the index (e.g. unsorted or duplicated timestamps, or irregular gaps), or if the columns are not numeric.

Models

FastRidge

class datafiller.FastRidge(alpha: float = 0.01, fit_intercept: bool = True)[source]

Bases: object

A simplified Ridge regressor.

This implementation is designed for speed and assumes that the input data is well-behaved (e.g., no NaNs, correct dtypes). It is not a full-featured scikit-learn estimator but provides the necessary fit and predict methods for use within the DataFiller.

Parameters:

alpha (float) – The regularization strength. Defaults to 0.01.
fit_intercept (bool) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations. Defaults to True.

fit(X: ndarray, y: ndarray) → FastRidge[source]

Fits the Ridge regression model.

Parameters:

X (np.ndarray) – The training data.
y (np.ndarray) – The target values.

Returns:

The fitted regressor.

Return type:

self

predict(X: ndarray) → ndarray[source]

Makes predictions using the fitted model.

Parameters:: X (np.ndarray) – The data to predict on.
Returns:: The predicted values.
Return type:: np.ndarray

ExtremeLearningMachine

class datafiller.ExtremeLearningMachine(n_features: int = 100, alpha: float = 1.0, random_state: int = 0, min_samples_per_feature: int = 5)[source]

Bases: object

An Extreme Learning Machine (ELM) estimator.

This implementation uses a random projection, a ReLU activation, and a FastRidge regressor. It is designed for speed and assumes that the input data is well-behaved.

The random projection is sampled lazily for each input width and cached, so a single instance can be refit on data with varying numbers of input features (as happens inside the imputers) while staying reproducible. Weights and bias are fan-in scaled (1 / sqrt(n_input_features)) so the pre-activation variance, and therefore the effective strength of alpha, stays consistent across the different input widths the imputers refit this estimator with.

The hidden width is also capped by the number of training samples seen at fit time (see min_samples_per_feature): the imputers refit this estimator on a small, pattern-specific subset of rows for every distinct missingness pattern, and a hidden layer wider than that subset turns the internal ridge fit into a severely underdetermined (more parameters than samples) problem regardless of regularization.

Parameters:

n_features (int) – The maximum number of features in the random projection.
alpha (float) – The regularization strength for the FastRidge regressor.
random_state (int) – A seed for the random number generator for reproducibility.
min_samples_per_feature (int) – The minimum number of training samples required per hidden feature. At fit time, the hidden width is capped to n_samples // min_samples_per_feature (at least 1) so it never exceeds what the available data can support. Set to 0 to disable the cap and always use n_features. Defaults to 5.

fit(X: ndarray, y: ndarray) → ExtremeLearningMachine[source]

Fits the ELM model.

Parameters:

X (np.ndarray) – The training data.
y (np.ndarray) – The target values.

Returns:

The fitted estimator.

Return type:

self

get_params(deep: bool = True) → dict[source]

Get parameters for this estimator.

Parameters:: deep (bool) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: Parameter names mapped to their values.
Return type:: dict

predict(X: ndarray) → ndarray[source]

Makes predictions using the fitted model.

Parameters:: X (np.ndarray) – The data to predict on.
Returns:: The predicted values.
Return type:: np.ndarray

set_params(**params) → ExtremeLearningMachine[source]

Set the parameters of this estimator.

Parameters:: **params – Estimator parameters.
Returns:: Estimator instance.
Return type:: self

Exceptions

All validation errors raised by the library derive from DataFillerError, so a single except DataFillerError catches every datafiller-specific error. The concrete classes also inherit from the matching builtin (ValueError or TypeError), so existing handlers keep working.

Exception classes

class datafiller.DataFillerError[source]

Bases: Exception

Base class for all errors raised by datafiller.

class datafiller.DataFillerValueError[source]

Bases: DataFillerError, ValueError

A datafiller error raised for invalid argument values.

class datafiller.DataFillerTypeError[source]

Bases: DataFillerError, TypeError

A datafiller error raised for arguments of an unsupported type.

Low-Level Functions

optimask

Finds optimal rectangular subsets of a matrix.

This module provides the optimask function, a low-level utility for finding an optimal rectangular subset of a matrix that contains the fewest missing values. This is used to select the best rows and columns for training an imputation model.

datafiller._optimask.optimask(iy: ndarray, ix: ndarray, rows: ndarray, cols: ndarray, global_matrix_size: tuple[int, int], copy: bool = True, min_rows: int = 1) → tuple[ndarray, ndarray][source]

Finds the largest rectangular area of a matrix for training.

This is the main function of this module. It uses a pareto-optimal sorting strategy to find the largest rectangle of non-NaN values, which can then be used to train a model for imputation.

Parameters:

iy – Row indices of NaNs.
ix – Column indices of NaNs.
rows – The rows to consider for the mask.
cols – The columns to consider for the mask.
global_matrix_size – The shape of the original matrix (m, n).
copy – If False, iy and ix are used as scratch space and overwritten, halving peak memory. Only pass False when the arrays are not reused afterwards.
min_rows – Prefer rectangles keeping at least this many rows over strictly larger ones that keep fewer; falls back to the unconstrained maximum-area rectangle when the constraint is infeasible. Callers that need min_rows training samples get a usable rectangle whenever one exists on the pareto front.

Returns:

A tuple containing the rows and columns to keep for training.