Unit Tests with Pandas in Python

Using the built in pandas functions for writing faster test cases.

Jan 08, 2022

Source: https://dribbble.com/shots/10594099-Test-tube-CSS

Writing tests for end-to-end ML pipelines can be pretty useful in cases where:

certain assumptions are made about the data
certain assumptions are made about the result of a computation

Unit testing can be a way to ensure the assumptions being made are still holding true, and no side effects are being generated.

Pandas provides a few helpful functions that can make unit-testing easier. These can be found in the pandas.testing module.

assert_frame_equal

As the name suggests, this function checks if two data-frames are equal or not. As we’ll see, this function takes in several arguments to provide varying levels of equality.

import pandas as pd
from pd.testing import assert_frame_equal

# example df's
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})

# testing for equality
assert_frame_equal(df1, df2)
# AssertionError

The above will give an AssertionError, because the types of column b in both the dataframes are different; int in df1 and float in df2.

But in some cases, this might be fine, and we only want to compare the values. We can use the check_dtype parameter.

# let's not worry about the data types
assert_frame_equal(df1, df2, check_dtype=False)
# no-output, implying the assertion was correct

What if we want equality to hold true even if the values are not exactly the same? I.e, some degree of tolerance is acceptable.
The a_tol & r_tol parameters are useful here.

import pandas as pd
from pd.testing import assert_frame_equal

# example df's
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0000001, 4.0000001]})

# testing for equality
assert_frame_equal(df1, df2, check_dtype=False, check_exact = False, rtol=1e-2, atol=1e-2)
# no error, as it's within the desired tolerance

assert_series_equal

Similar to the dataframe counterpart, this function tests the equality of two series.

from pandas.testing import assert_series_equal
a = pd.Series([1, 2, 3, 4], index = ['a1', 'b1', 'c1', 'd1'])
b = pd.Series([1, 2, 3, 4], index = ['a2', 'b2', 'c2', 'd2'])
assert_series_equal(a, b)
# AssertionError

The above actually gives an assertion error, because even though the values are the same, the indexes are different.
check_index attribute (new in v1.3) can help with this.

assert_series_equal(a, b, check_index = False)
# no output, assertion was correct

assert_index_equal

This function is for index levels assertions. The check_column_type parameter is passed to this function via the exact param.

from pandas.testing import assert_index_equal
a = pd.Index([1, 2, 3])
b = pd.Index([1, 2, 3])
c = pd.RangeIndex(start=1, stop=4, step=1)
assert_index_equal(a, b)  # assertions are valid
assert_index_equal(a, c) # assertions are still valid

# indexes not exactly same, assertion fails
assert_index_equal(a, c, exact = True)

References

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

Ranjan’s Newsletter

Discussion about this post