API Reference
This page provides a detailed API reference for the classes and functions in the datafiller library.
Imputer Classes
MultivariateImputer
- class datafiller.MultivariateImputer(*, regressor: RegressorMixin | None = None, classifier: ClassifierMixin | None = None, verbose: int = 0, min_samples_train: int | None = None, rng: int | None = None, scoring: str | callable = 'default')[source]
Bases:
BaseEstimator,TransformerMixinImputes missing values in a 2D numpy array.
This class uses a model-based approach to fill in missing values, where each feature with missing values is predicted using other features in the dataset. It is designed to be efficient, using Numba for critical parts and finding optimal data subsets for model training. When a pandas DataFrame contains categorical, string, or boolean columns, they are one-hot encoded internally and imputed with a classifier before returning the original column layout.
- Parameters:
regressor (RegressorMixin, optional) – A scikit-learn compatible regressor. It should be a lightweight model, as it is fitted many times. By default, a custom Ridge implementation is used.
classifier (ClassifierMixin, optional) – A scikit-learn compatible classifier used for categorical and string targets. Defaults to
DecisionTreeClassifier(max_depth=4, random_state=rng).verbose (int, optional) – The verbosity level. Defaults to 0.
min_samples_train (int, optional) – The minimum number of samples required to train a model. If, after the imputation, some values are still missing, it is likely that no training set with at least min_samples_train samples could be found. Defaults to None, which means that a model will be trained if at least one sample is available.
rng (int, optional) – A seed for the random number generator. This is used for reproducible feature sampling when n_nearest_features is not None, and for the default categorical classifier when one is not provided. Defaults to None.
scoring (str or callable, optional) – The scoring function to use for feature selection. If ‘default’, the default scoring function is used. If a callable, it must take two arguments as input: the data matrix X (np.ndarray of shape (n_samples, n_features)) and the columns to impute cols_to_impute (np.ndarray of shape (n_cols_to_impute,)), and return a score matrix of shape (n_cols_to_impute, n_features). Defaults to ‘default’.
- imputation_features_
A dictionary mapping each imputed column to the features used to impute it. This attribute is only populated when n_nearest_features is not None. If the input is a pandas DataFrame, the keys and values will be column names. If the input is a NumPy array, they will be integer indices.
- Type:
dict or None
Examples
import numpy as np from datafiller import MultivariateImputer # Create a matrix with missing values X = np.array([ [1, 2, 3], [4, np.nan, 6], [7, 8, 9] ]) # Create an imputer and fill the missing values imputer = MultivariateImputer() X_imputed = imputer(X) print(X_imputed)- __call__(x: ndarray | DataFrame, rows_to_impute: None | int | Iterable[int] | Iterable[str] = None, cols_to_impute: None | int | Iterable[int] | Iterable[str] = None, n_nearest_features: None | float | int = None, normalize: bool = True) ndarray | DataFrame[source]
Imputes missing values in the input data.
The method can handle both NumPy arrays and pandas DataFrames.
- Parameters:
x – The input data matrix with missing values (NaNs). Can be a numpy array or a pandas DataFrame.
rows_to_impute – The rows to impute. The interpretation of this argument depends on the type of x. - If x is a NumPy array, this must be a list of integer indices. - If x is a pandas DataFrame, this must be a list of index labels. If None, all rows are considered for imputation. Defaults to None.
cols_to_impute – The columns to impute. The interpretation of this argument depends on the type of x. - If x is a NumPy array, this must be a list of integer indices. - If x is a pandas DataFrame, this must be a list of column labels. If None, all columns are considered for imputation. Defaults to None.
n_nearest_features – The number of features to use for imputation. If it’s an int, it’s the absolute number of features. If it’s a float, it’s the fraction of features to use. If None, all features are used. Defaults to None.
normalize – Whether to normalize numeric columns before imputation, then transform imputed values back to the original scale. Defaults to True.
- Returns:
The imputed data matrix. The return type will match the input type (NumPy array or pandas DataFrame).
TimeSeriesImputer
- class datafiller.TimeSeriesImputer(lags: Iterable[int] = (1,), regressor: RegressorMixin | None = None, classifier: ClassifierMixin | None = None, min_samples_train: int | None = None, rng: int | None = None, verbose: int = 0, scoring: str | callable = 'default', interpolate_gaps_less_than: int | None = None)[source]
Bases:
BaseEstimator,TransformerMixinImputes missing values in time series data.
This class wraps the
MultivariateImputerto specifically handle time series data in pandas DataFrames. It automatically creates lagged and lead features based on the time series index, then uses these new features to impute missing values.- Parameters:
lags (Iterable[int], optional) – An iterable of integers specifying the lags and leads to create as autoregressive features. Positive integers create lags (e.g., t-1), and negative integers create leads (e.g., t+1). Defaults to (1,).
regressor (RegressorMixin, optional) – A scikit-learn compatible regressor used for numeric targets. Defaults to
FastRidge.classifier (ClassifierMixin, optional) – A scikit-learn compatible classifier used for categorical or string targets. Defaults to
DecisionTreeClassifier(max_depth=4).min_samples_train (int, optional) – The minimum number of samples required to train a model. Defaults to None, which means that a model will be trained if at least one sample is available.
rng (int, optional) – A seed for the random number generator. This is used for reproducible feature sampling when n_nearest_features is not None. Defaults to None.
verbose (int, optional) – The verbosity level. Defaults to 0.
scoring (str or callable, optional) – The scoring function to use for feature selection. If ‘default’, the default scoring function is used. If a callable, it must take two arguments (the data matrix and the columns to impute) and return a score matrix. Defaults to ‘default’.
interpolate_gaps_less_than (int, optional) – The maximum length of gaps to interpolate linearly. If None, no linear interpolation is performed. Defaults to None.
- imputation_features_
A dictionary mapping each imputed column to the features used to impute it. This attribute is only populated when n_nearest_features is not None. The keys and values are the column names, which will include the lagged/lead features created during the imputation process.
- Type:
dict or None
import pandas as pd import numpy as np from datafiller import TimeSeriesImputer # Create a time series DataFrame with missing values rng = pd.date_range('2020-01-01', periods=10, freq='D') data = {'value': [1, 2, np.nan, 4, 5, 6, np.nan, 8, 9, 10]} df = pd.DataFrame(data, index=rng) # Create a time series imputer and fill missing values ts_imputer = TimeSeriesImputer(lags=[1, -1]) df_imputed = ts_imputer(df) print(df_imputed)- __call__(df: DataFrame, rows_to_impute: None | int | Iterable[int] = None, cols_to_impute: None | int | str | Iterable[int | str] = None, n_nearest_features: None | float | int = None, before: object | None = None, after: object | None = None) DataFrame[source]
Imputes missing values in a time series DataFrame.
- Parameters:
df – The input DataFrame with a DatetimeIndex and missing values (NaNs). The index must have a defined frequency.
rows_to_impute – The rows to impute. Can be an iterable of integer indices, a pandas DatetimeIndex, or None. If None, all rows are considered. Defaults to None.
cols_to_impute – The indices or names of columns to impute. If None, all columns are considered. Defaults to None.
n_nearest_features – The number of features to use for imputation. If it’s an int, it’s the absolute number of features. If it’s a float, it’s the fraction of features to use. If None, all features are used. Defaults to None.
before – A timestamp-like object. If specified, only rows before this timestamp are imputed. Can be anything that can be parsed by
lambda x: pd.to_datetime(str(x)). Defaults to None.after – A timestamp-like object. If specified, only rows after this timestamp are imputed. Can be anything that can be parsed by
lambda x: pd.to_datetime(str(x)). Defaults to None.
- Returns:
The imputed DataFrame with the same columns as the original.
- Raises:
TypeError – If the input is not a pandas DataFrame or if the index is not a DatetimeIndex.
ValueError – If the DataFrame’s index does not have a frequency.
Models
FastRidge
- class datafiller.FastRidge(alpha: float = 0.01, fit_intercept: bool = True)[source]
Bases:
objectA simplified Ridge regressor.
This implementation is designed for speed and assumes that the input data is well-behaved (e.g., no NaNs, correct dtypes). It is not a full-featured scikit-learn estimator but provides the necessary fit and predict methods for use within the DataFiller.
- Parameters:
alpha (float) – The regularization strength. Defaults to 0.01.
fit_intercept (bool) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations. Defaults to True.
- fit(X: ndarray, y: ndarray) FastRidge[source]
Fits the Ridge regression model.
- Parameters:
X (np.ndarray) – The training data.
y (np.ndarray) – The target values.
- Returns:
The fitted regressor.
- Return type:
self
- predict(X: ndarray) ndarray[source]
Makes predictions using the fitted model.
- Parameters:
X (np.ndarray) – The data to predict on.
- Returns:
The predicted values.
- Return type:
np.ndarray
ExtremeLearningMachine
- class datafiller.ExtremeLearningMachine(n_features: int = 100, alpha: float = 1.0, random_state: int = 0)[source]
Bases:
objectAn Extreme Learning Machine (ELM) estimator.
This implementation uses a random projection, a ReLU activation, and a FastRidge regressor. It is designed for speed and assumes that the input data is well-behaved.
- Parameters:
n_features (int) – The number of features in the random projection.
alpha (float) – The regularization strength for the FastRidge regressor.
random_state (int) – A seed for the random number generator for reproducibility.
- fit(X: ndarray, y: ndarray) ExtremeLearningMachine[source]
Fits the ELM model.
- Parameters:
X (np.ndarray) – The training data.
y (np.ndarray) – The target values.
- Returns:
The fitted estimator.
- Return type:
self
- get_params(deep: bool = True) dict[source]
Get parameters for this estimator.
- Parameters:
deep (bool) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
Parameter names mapped to their values.
- Return type:
dict
- predict(X: ndarray) ndarray[source]
Makes predictions using the fitted model.
- Parameters:
X (np.ndarray) – The data to predict on.
- Returns:
The predicted values.
- Return type:
np.ndarray
- set_params(**params) ExtremeLearningMachine[source]
Set the parameters of this estimator.
- Parameters:
**params – Estimator parameters.
- Returns:
Estimator instance.
- Return type:
self
Low-Level Functions
optimask
Finds optimal rectangular subsets of a matrix.
This module provides the optimask function, a low-level utility for finding an optimal rectangular subset of a matrix that contains the fewest missing values. This is used to select the best rows and columns for training an imputation model.
- datafiller._optimask.optimask(iy: ndarray, ix: ndarray, rows: ndarray, cols: ndarray, global_matrix_size: tuple[int, int]) tuple[ndarray, ndarray][source]
Finds the largest rectangular area of a matrix for training.
This is the main function of this module. It uses a pareto-optimal sorting strategy to find the largest rectangle of non-NaN values, which can then be used to train a model for imputation.
- Parameters:
iy – Row indices of NaNs.
ix – Column indices of NaNs.
rows – The rows to consider for the mask.
cols – The columns to consider for the mask.
global_matrix_size – The shape of the original matrix (m, n).
- Returns:
A tuple containing the rows and columns to keep for training.