Problem
The scikit-learn Python library has several classes for imputing (predicting missing values in arrays.)
I have a Python program written a little while ago. I made use of the Imputer class in the sklearn.preprocessing package. I set the axis=1
parameter to force a prediction of values row-wise, instead of the default column-wise prediction.
For example, I wanted an array like this (nan = missing value) …
[[ 10. nan 20. 15.]
[ 200. 200. 200. nan]
[ nan nan 5000. 6000.]]
… to have its missing values predicted with the row-wise mean. The expected outcome is:
[[ 10. 15. 20. 15.]
[ 200. 200. 200. 200.]
[5500. 5500. 5000. 6000.]]
Here’s the code, using sklearn.preprocessing.Imputer
:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
# Create simple test array
X = np.asarray([[10, np.nan, 20, 15],\
[200, 200, 200, np.nan],\
[np.nan, np.nan, 5000, 6000]])
# Create imputer object, replacing 'nan' with feature means by row
mean_imputer = Imputer(missing_values=np.nan, strategy='mean', axis=1)
# Train and apply imputor
imputed_X = mean_imputer.fit_transform(X)
Unfortunately, Class Imputer is now deprecated. The code above throws a warning:
DeprecationWarning: Class Imputer is deprecated;
Imputer was deprecated in version 0.20 and will be
removed in 0.22. Import impute.SimpleImputer
from sklearn instead.
Double-unfortunately, impute.SimpleImputer
does not include an axis parameter so I can no longer request a row-wise imputation.
scikit-learn’s GitHub Issue “Remove SimpleImputer’s axis parameter” https://github.com/scikit-learn/scikit-learn/issues/10636 suggests:
Future (and default) behavior is equivalent to axis=0 (impute along columns). Row-wise imputation can be performed with FunctionTransformer (e.g., FunctionTransformer(lambda X: Imputer().fit_transform(X.T).T)).
Eh.
Solution
Why not replace preprocessing.Imputer
with impute.SimpleImputer
as suggested, them directly transpose/untranspose the array while applying the imputer? Works for me.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Create simple test array
X = np.asarray([[10, np.nan, 20, 15],
[200, 200, 200, np.nan],
[np.nan, np.nan, 5000, 6000]])
# Create imputer object, replacing 'Nan' with feature means
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Train and apply imputor, transposing to affect imputation row-wise
imputed_X = mean_imputer.fit_transform(X.T).T
# Review original data
print(f'Original:\n{X}')
# View imputed data
print(f'\nImputed:\n{imputed_X}')
Output:
Original:
[[ 10. nan 20. 15.]
[ 200. 200. 200. nan]
[ nan nan 5000. 6000.]]
Imputed:
[[ 10. 15. 20. 15.]
[ 200. 200. 200. 200.]
[5500. 5500. 5000. 6000.]]
Done!
Reference
scikit-learn’s Imputation of Missing Values with impute.SimpleImputer
scikit-learn’s Imputation of Missing Values using preprocessing.Imputer
Could you provide a detail explanation of how to apply FunctionTransformer() to impute row wise.
I have benn try to do it since 2 days and I am clueless
PS : I am a beginner in ML and any help will be grateful. Thanks in advance ๐
Hi Ashish,
I do not know what issues you are facing specifically. Scikit-Learn.org is a good place to start, it includes several examples: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html
I hope you’ve found your solution ๐
This works fine if fitting and transforming directly on the imputer, but not for imputing within a Pipeline