Unit Tests with Pandas in Python
Using the built in pandas functions for writing faster test cases.
Using the built-in pandas functions for writing faster test cases.
Writing tests for end-to-end ML pipelines can be pretty useful in cases where:
certain assumptions are made about the data
certain assumptions are made about the result of a computation
Unit testing can be a way to ensure the assumptions being made are still holding true, and no side effects are being generated.
Pandas provides a few helpful functions that can make unit-testing easier. These can be found in the pandas.testing module.
assert_frame_equal
As the name suggests, this function checks if two data-frames are equal or not. As we’ll see, this function takes in several arguments to provide varying levels of equality.
import pandas as pd
from pd.testing import assert_frame_equal
# example df's
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
# testing for equality
assert_frame_equal(df1, df2)
# AssertionError
The above will give an AssertionError, because the types of column b in both the dataframes are different; int in df1 and float in df2.
But in some cases, this might be fine, and we only want to compare the values. We can use the check_dtype
parameter.
# let's not worry about the data types
assert_frame_equal(df1, df2, check_dtype=False)
# no-output, implying the assertion was correct
What if we want equality to hold true even if the values are not exactly the same? I.e, some degree of tolerance is acceptable.
The a_tol
& r_tol
parameters are useful here.
import pandas as pd
from pd.testing import assert_frame_equal
# example df's
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0000001, 4.0000001]})
# testing for equality
assert_frame_equal(df1, df2, check_dtype=False, check_exact = False, rtol=1e-2, atol=1e-2)
# no error, as it's within the desired tolerance
assert_series_equal
Similar to the dataframe counterpart, this function tests the equality of two series.
from pandas.testing import assert_series_equal
a = pd.Series([1, 2, 3, 4], index = ['a1', 'b1', 'c1', 'd1'])
b = pd.Series([1, 2, 3, 4], index = ['a2', 'b2', 'c2', 'd2'])
assert_series_equal(a, b)
# AssertionError
The above actually gives an assertion error, because even though the values are the same, the indexes are different.check_index
attribute (new in v1.3) can help with this.
assert_series_equal(a, b, check_index = False)
# no output, assertion was correct
assert_index_equal
This function is for index levels assertions. The check_column_type
parameter is passed to this function via the exact
param.
from pandas.testing import assert_index_equal
a = pd.Index([1, 2, 3])
b = pd.Index([1, 2, 3])
c = pd.RangeIndex(start=1, stop=4, step=1)
assert_index_equal(a, b) # assertions are valid
assert_index_equal(a, c) # assertions are still valid
# indexes not exactly same, assertion fails
assert_index_equal(a, c, exact = True)
References
https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html
https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_series_equal.html
https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_index_equal.html
More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.