Beyond Pass or Fail: Diagnosing Flaky Test Cases

·

14 min read

Background

As development teams rely on CI/CD tools and workflows more and more to deploy their applications, testing has become a key integral part of the SDLC. By using automated testing tools like CircleCI or GitHub Actions, we can ensure our code meets requirements, performs reliably, and is maintainable in the long run

However, as codebases increase in complexity and size with the addition of new features over time, so do the number of test cases that we write to ensure proper coverage and verification of our core business logic. With that, most development teams eventually run into the problem with flaky test cases.

In this blog, we will navigate through the various causes of flakiness within test cases and how we can aim to resolve them. For context, we will stick to python3, and Django as examples to follow along with.

How do we define flaky test cases?

Flaky test cases can be normally defined as test cases that can sometimes pass or fail without any changes to the underlying code. These test cases flake due to some form of non-determinism within the code that lead to mismatched results or undefined behavior.

Facing these flaky issues within CI pipelines is a growing problem for development teams of all sizes as it leads to hindrance in productivity, loss of trust in the CI tools, and frustrations all around. Flakiness is a characteristic of test suites that present themselves more within CI environments in general. The reason for this could be the differences in resources when comparing it to local development environments and the under-the-hood dependencies that may give way to conflicts.

Luckily, modern-day CI tools are well suited to help identify flakiness within our test suites as part of their offerings and while they may help in identifying flaky test cases, the responsibility falls upon the development teams to go in and address those issues promptly to keep the codebase clean and up to the mark.

Diagnosing the problem

Test cases can fail for a variety of reasons, as non-determinism can manifest from the most seemingly random or innocent places within the code. Having said that, when dealing with flaky test cases over a long period, it becomes easier to see the few patterns that emerge from test suites that help us identify root causes quickly and be more mindful about how we write our test cases.

Let’s dive deeper into the various causes of flaky test cases and what we can do to remedy them.

1. Asynchronous Functions

When dealing with test cases where the target functionality executes asynchronously, there may be scenarios where the expected data is not ready by the time we run our assertions, leading to flakiness. This scenario is one of the more common reasons why test cases flake.

Let’s take the example of a function that we invoke via celery.

# tasks.py
from celery import shared_task
from .models import YourModel

@shared_task
def internal_task(model_id):
    model_instance = YourModel.objects.get(pk=model_id)
    # Perform some operations on the model instance
    model_instance.some_field = 'modified'
    model_instance.save()

When writing test cases, we generally go from writing unit tests —> integration tests —> end-to-end tests. While the core functionality of our sample function is verified within unit tests, integration tests are often found to be flakier due to the asynchronous behavior we introduce within the entire flow.

In other words, if we were to invoke this function within a function-based view for a Django application, we would likely do something along the lines of

from django.shortcuts import render
from django.http import HttpResponse
from .tasks import internal_task
from .models import YourModel

def my_view(request):
    # Assume you have a model instance
    model_instance = YourModel.objects.get(pk=1)

    # Call the Celery task asynchronously
    internal_task.delay(model_instance.id)

    return HttpResponse("Task has been queued for execution.")

When writing integration tests for this view, we might encounter flakiness if the celery task does not finish execution before our assertions are made.

This is where Mocking comes in handy!

Mocking allows developers to replace parts of the code with mock objects, which simulate the behavior of real objects. This is particularly useful when testing asynchronous functions because it enables us to isolate the function being tested from its dependencies. By mocking external dependencies such as database queries or API calls, we can focus solely on testing the function's logic without worrying about the behavior of other components.

# test_views.py
from django.test import TestCase, Client
from django.urls import reverse
from .models import YourModel
from unittest.mock import patch

class MyViewTest(TestCase):
    @patch('yourapp.tasks.internal_task.delay')
    def test_my_view(self, mock_internal_task_delay):
        # Create a model instance
        model_instance = YourModel.objects.create(id=1)

        # Call the view function
        client = Client()
        response = client.get(reverse('my_view'))

        # Check if the response is successful
        self.assertEqual(response.status_code, 200)

        # Check if the Celery task is called with the correct arguments
        mock_internal_task_delay.assert_called_once_with(model_instance.id)

By doing this, we can still write our integration tests by relying on the individual unit tests to verify functionality.

💡
Mocking is also very useful when dealing with external APIs as we cannot base our tests on the reliability of a third-party source. By utilizing mocking, we can cater to testing how we want our application to perform without worrying about handling what external sources send back!

2. Improper Management of State

This kind of flaky test case is more subtle as during development, it can be easily missed.

When writing new test cases, we often only test the particular suite we are working on. However, when we push our code into our CI pipelines, we run all of our test cases at once. What ends up happening is that test cases that were invoked before the new test case that was pushed as part of PR can unintentionally alter the state of the system leading to flaky test cases.

When running these test cases locally, we might think that our test case is robust but a different test case that was run before yours was responsible for altering the state of the system about which you made assumptions.

This modification of state most commonly stems from improper management of resources and states within test suites.

For example, let us take a look at the following test suite.

from django.test import TransactionTestCase
from django.utils import timezone

class TestFlakyTimeZone(TransactionTestCase):

    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        # Alter global timezone without proper teardown
        timezone.activate("Asia/Kolkata")

    def test_something_with_timezone(self):
        # Test logic that relies on the altered timezone
        assert timezone.is_active is True  # Example assertion

Seemingly harmless, but on closer inspection, we find out that timezone.activate() modifies the timezone globally and is not limited to the function scope. This introduces inadvertent flakiness to any other downstream test cases that make assumptions about the timezone and hence fail due to the unfortunate ordering of the test cases.

💡
Playing around with the order of test executions will always give good insights to isolate issues

What we can do here to mitigate such scenarios is to ensure we cater for proper cleanup of resources within our test suite. For python3, when using the in-built unittest module, we can use the tearDown() or tearDownClass() to handle the cleanup.

import pytest
from django.test import TransactionTestCase
from django.utils import timezone

class TestFlakyTimeZone(TransactionTestCase):
    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        # Alter global timezone without proper teardown
        timezone.activate("Asia/Kolkata")

    def test_something_with_timezone(self):
        # Test logic that relies on the altered timezone
        assert timezone.is_active is True  # Example assertion

    @classmethod
    def tearDownClass(cls):
            timezone.deactivate()

This is just one example of where cleanup is important. However, there can be other scenarios where open files, disconnected signals, or some altered state must be reset to their original state to allow downstream tests to run smoothly.

3. Parallelism and Transactions

To expedite the execution of our test suites, it's common practice to employ various tools that facilitate the distribution of tests to run concurrently. For example, when utilizing pytest, we frequently leverage pytest-xdist to distribute tests across multiple CPUs, significantly improving execution times.

While parallel test execution offers notable performance benefits, it can introduce nuanced challenges, particularly when dealing with database transactions. This is especially evident in the case of pytest-xdist, where tests are parallelized at the function level. Let's illustrate this with an example from Django's TransactionTestCase.

TransactionTestCase encapsulates each test method within a database transaction, ensuring that the database state is reverted to its original state upon test completion. However, a common issue arises when individual test cases attempt to manipulate shared data concurrently.

The crux of the problem lies in the potential for conflicts arising from simultaneous modifications to shared data. Such conflicts can lead to test flakiness, where the outcome of a test becomes unpredictable due to interference from other concurrently executing tests.

For example -

# test_example.py
from django.test import TransactionTestCase
from myapp.models import ExampleObject

class ExampleTestCase(TransactionTestCase):
    @classmethod
    def setUpClass(cls):
        super().setUpClass()
        cls.example_object = ExampleObject()

    def test_operation_1(self):
        self.example_object.perform_operation_1()
        assert self.example_object.get_result() == expected_result_1

    def test_operation_2(self):
        self.example_object.perform_operation_2()
        assert self.example_object.get_result() == expected_result_2

Let’s visualize how the two test cases interact with the shared object :

In this example, we can observe a potential conflict that leads to flaky test cases because, in TransactionTestCase, every test method is wrapped in its transaction. However, since we defined our ExampleObject within the setUpClass method, this essentially acts as a shared resource. So, what we may run into here is an instance where while using pytest-xdist , the two test methods run in parallel but run into a database transaction error due to the shared resource being acted upon in one or the other method.

One approach to solving this would be to use the setUp() method instead.

# test_example.py
import pytest
from django.test import TransactionTestCase
from myapp.models import ExampleObject

class ExampleTestCase(TransactionTestCase):
    @classmethod
    def setUp(self):
        super().setUp()
        self.example_object = ExampleObject()

    def test_operation_1(self):
        self.example_object.perform_operation_1()
        assert self.example_object.get_result() == expected_result_1

    def test_operation_2(self):
        self.example_object.perform_operation_2()
        assert self.example_object.get_result() == expected_result_2

Let’s visualize how this change affects our test case!

It is important to note that this solution is not a silver bullet by any means, and there are a few things to keep in mind when solving concurrency-related issues for flaky test cases!

  1. If we use setUp we will trade off on execution times and memory since we will instantiate the ExampleObject every time for every test method

  2. There can be nuances within the code you are meaning to test that may require handling concurrency differently.

To mitigate such issues, it's essential to design test cases that isolate their data dependencies and avoid modifying shared resources whenever possible. This ensures test reliability and repeatability across different execution environments.

4. Improper Usage of Assertions

This category of flaky test cases is more of a development miss than an environmental dependency or systematic nature of execution. They can occur randomly and are a common “right-under-your-nose” error.

This is the missing semicolon equivalent for test cases after all!

When using assertions, it is important to remember the data types we are dealing with and use appropriate assert statements for varying kinds of expected values.

The most common example here is the use of assertListEqual when comparing 2 lists. Sometimes, if the order is not guaranteed for our actual_value , we can often run into flaky scenarios where our expected_value has mismatched elements.

import unittest
from my_app.utils import foo

class TestListComparison(unittest.TestCase):
    def test_assert_list_equal(self):
        expected_value = [1, 2, 3]
        actual_value = foo()  # Returns [1, 2, 3] in no particular order
        self.assertListEqual(expected_value, actual_value)

In the above scenario, our test case will be flaky as assertListEqual is designed to match the elements in the order of our expected value. As our test case function foo() returns the same elements in no particular order, it is more suitable to use assertCountEqual

import unittest
from my_app.utils import foo

class TestListComparison(unittest.TestCase):
    def test_assert_count_equal(self):
        expected_value = [1, 2, 3]
        actual_value = foo()  # Returns [1, 2, 3] in no particular order
        self.assertCountEqual(expected_value, actual_value)

5. Date-time Flakiness

DateTime flakiness in testing can occur due to various factors, not merely as a systemic issue. Fluctuations in system time, environment differences, and asynchronous operations are the primary reasons behind this unpredictability.

Let’s take a look at a scenario where a test case measures the elapsed time between two operations:

import datetime
import time
from unittest import TestCase, mock

def is_within_business_hours(current_time):
  """Checks if the current time falls within business hours (9am to 5pm)."""
  start_of_business = datetime.time(9, 0)
  end_of_business = datetime.time(17, 0)
  return start_of_business <= current_time <= end_of_business

class DatetimeFlakinessTestCase(TestCase):
    def test_business_hours(self):
          """Tests if the current time is within business hours."""
          current_time = datetime.datetime.now().time()
          assert is_within_business_hours(current_time)

        # This test might fail if it runs exactly at 9:00 or 5:00
        # due to the nature of datetime.now() capturing a specific moment.

This test relies on datetime.now() to get the current time. If the test runs exactly at 9:00 am or 5:00 pm, it might fail because is_within_business_hours checks for strict inequalities (<= and >=). A test run at 9:00 on the dot might capture a time slightly before 9:00, causing the test to fail.

This is a flaky test because its outcome depends on the exact moment it runs, not the functionality it tries to verify.

Although the above example may be the trivial and rather incorrect approach to writing any test case, we can still see the underlying patterns of flakiness emerge.

To address this, we can go back to the idea of mocking described earlier to address the datetime.datetime.now() function using the unittest.mock module

The updated test case would look something like this:

import datetime
import time
from unittest import TestCase, mock

def is_within_business_hours(current_time):
    """Checks if the current time falls within business hours (9am to 5pm)."""
    start_of_business = datetime.time(9, 0)
    end_of_business = datetime.time(17, 0)
    return start_of_business <= current_time <= end_of_business

class DatetimeFlakinessTestCase(TestCase):
    @patch('datetime.datetime')
    def test_business_hours(self, mock_datetime):
        """Tests if the current time is within business hours (using mock)."""
        # Set a fixed time within business hours
        mock_datetime.now.return_value = datetime.datetime(2024, 4, 1, 10, 45)
        current_time = datetime.datetime.now().time() # Now uses the mocked time
        assert is_within_business_hours(current_time)

In this modified test case, we mock datetime.datetime.now() to provide fixed timestamps for the start and end times of the operation. By controlling the timestamps, we eliminate DateTime flakiness, ensuring consistent test outcomes regardless of system time fluctuations or asynchronous operations.

What actions can we take to tackle this?

Since flaky test cases can occur for apparently any reason, there is no one cheat sheet available that can be used for all root causes. Although we can be more proactive and mindful about writing resilient test cases and ensuring thorough reviews, sometimes a flaky test case requires diving into nuances of our application’s core logic.

Some starting approaches we can take to tackle flakiness are ➖

Using tools specifically designed to address flaky test cases

It can be counterproductive to sit and rerun test case after test case waiting for something to happen. Thankfully, there are tools available that allow us to execute test cases locally in bulk so we can identify flaky test cases early and often.

For pytest, you can use pytest-flakefinder . This tool allows you to multiply your tests so they run in bulk without having to restart pytest

A good rule of thumb if using these tools is to run them with a combination of parameters to try and see which conditions cause breakage. An example would be when using pytest-xdist distributed testing alongside pytest-flakefinder

pytest src/tests/test.py --flake-finder -n4 --flake-runs=n --reuse-db
pytest src/tests/test.py --flake-finder -n0 --flake-runs=n --reuse-db
pytest src/tests/test.py --flake-finder --flake-runs=n --reuse-db
pytest src/tests/test.py -n4 --reuse-db
pytest src/tests/test.py --reuse-db

where n in range of [1,2,5,10,50,500] the number of “multiplications” we want to test for.

This approach to testing can ensure that some of the more obvious causes of flakiness are identified early and should be adopted as a practice when raising PRs.

Reproducing CI executions locally

More than likely, when you use a CI tool for automated tests, chances are you would be distributing test cases across multiple containers to speed up execution times.

As mentioned above, test cases running in isolation may not be an issue but test cases that inadvertently cause a "test-on-test dependency", and flakiness may occur.

For larger codebases, there can be a lot of tests that make identifying the “problem child” rather tedious. In cases like this, a good approach would be to inspect what test cases were run before your alleged flaky test case on your CI tool and pull in the same execution parameters locally.

Re-runs, Re-runs, and Re-runs!

Believe it or not, sometimes a good old-fashioned re-run is all it takes. Having a retry mechanism with your test cases can be beneficial to reduce failing builds for your application and improve productivity. Various tools out there can be utilized for this such as pytest-rerunfailures

While this approach may result in frequently failing builds, we should aim to limit the number of retries to avoid flaky tests to continue living on in the system. Finding the right balance of retries can vary but a good starting point that has worked for us is no more than 3.

Conclusion

Flaky test cases are elusive. More often than not they can slip through the cracks of development and reviews, causing undue frustrations for everyone else down the line. As developers, we should adopt a proactive approach to resolving flakiness in test cases.

Sometimes it may seem easier to quarantine or skip the flakes as we find them but having them end up in a bucket that no one addresses can lead to an unstable platform that hasn’t addressed its core issues.

It is important to note that there is never a one-size-fits-all solution to resolving these things and more often than not, the real root causes can be more nuanced and hint towards deeper issues within the system. However, by being mindful of the different caveats of our tests during development and being proactive and prompt with resolving flaky test cases as they creep into our CI builds, we can ensure the reliability of our systems and restore trust in our external tools.