The Most Difficult Bug I Have Encountered

I was asked, recently, to recount the most difficult bug that I have ever encountered. I have worked on distributed systems for some time, including a multi-threaded, distributed system written in C++, which is used in diverse and demanding industrial environments. So, needless to say, I’ve encountered more than a few difficult and memorable bugs. These bugs regularly take days or weeks to reproduce, understand, and, ultimately, resolve.

Finding these bugs often involves a good intuition for the system as a whole, as much as anything else. Discovering a sequence of events that reproduces the bug is essential. Even then, reproducing the bug is often probabilistic in nature. Finding the bug and having confidence the bug is fixed is often also probabilistic. For example, if the bug was originally encountered once in a million trials, on average, then the bug might be declared fixed if the modified code has run 100 million trials without encountering the bug. Of course, without a full understanding, the bug may not actually be fixed, it may just be that the modified code has changed the dynamics to make encountering the bug even less likely.

Once a bug can be reproduced, even probabilistically, then you can have some hope of tracking it down. Even if you don’t know where to begin, just start changing some parameters of the system and see how it impacts the frequency of occurrence. For example, artificially modifying the dynamics by adding a sleep or a yield, or omitting parts of the code in order to bisect the problem, can lead to insights if it happens to increase or decrease the frequency of occurrence. This can easily result in a lot of dead ends, so persistence is of utmost importance.

This essay is a description of probably the most difficult bug that I have ever encountered, at least to date. The example has been significantly modified and greatly simplified in order to highlight just the core of the issue. The story begins by fixing an easily reproduced and well-understood bug.

The Scenario

The multi-threaded, distributed system that I was working on naturally had a lot of code to coordinate operations among various services. The system had a mechanism to subscribe to record-level changes in a database table. When a record was updated, other components in the system that were subscribed for changes to that record would be informed of the change. This mechanism had been in place for over a decade and was used extensively by every customer. Starting or stopping a subscription, however, was a very complicated distributed transaction.

In order to construct a similar scenario to use as an example, consider a database table for a stream processing application where each record represents an offset into a transaction log. As an individual transaction log is processed, the corresponding offset in the database is updated. Dependent systems can subscribe for changes to the offset for an individual transaction log. When the offset changes, the dependent subsystem will be informed of the new offset. Periodically, the current offset is also persisted to disk so the application can pick up where it left off in the event of failures.

The following Record class has an integer value that represent the current offset for a single transaction log. The value can be updated by any number of threads, but assume that modifying the value is thread-safe and synchronized at a higher level. The class also has an integer used as a bit mask to 1) indicate when the record has changed and needs to be persisted to disk (Flags::IsDirty) and 2) to indicate if another system has subscribed for changes to the record (Flags::Subscribe).

enum Flags : int { Dirty = 0x01, Subscribe = 0x10 };

class Record {
public:
  Record()
    : _value(0)
    , _flags(0)
  {}

  Record::Record(Record&& other)
    : _value(other._value)
    , _flags(other._flags)
  {}

  Record& Record::operator=(Record&& other) {
    _value = other._value;
    _flags = other._flags;
    return *this;
  }

  // Prevent copying
  Record& Record::operator=(const Record&) = delete;
  Record::Record(const Record&) = delete;

  void SetDirty() { _flags |= Flags::Dirty; }
  void SetClean() { _flags &= ~Flags::Dirty; }

  void Subscribe() { _flags |= Flags::Subscribe; }
  void Unsubscribe() { _flags &= ~Flags::Subscribe; }

  void SetValue(const int& value) {
    _value = value;
    SetDirty();
  }

  int GetValue() const { return _value; }

  bool IsSubscribed() const { return (_flags & Flags::Subscribe) != 0; }
  bool IsDirty() const { return (_flags & Flags::Dirty) != 0; }

private:
  int _value;
  int _flags;
};

The following Database class defines operations on a set of records. To simplify the locking for this example, the number of records in the database is fixed at construction and does not change. There are methods to Subscribe or Unsubscribe to changes for multiple database records. There is an Update method to update the current value for a set of records. When the current value of a record is updated and the Subscribe bit is set, the current value for the record is pushed onto a concurrent queue in order to notify subscribed services (which have not been represented here). The is also a Persist method to persist all of the dirty records to disk, which involves persisting the current value (left to the imagination of the reader) and clearing the Dirty bit.

class Database {
public:
  Database(std::vector<Record>&& records)
    : _records(std::move(records))
    , _updates()
  {}

  Database() = delete;

  void Subscribe(const std::vector<int>& index) {
    for (const auto i : index) {
      _records[i].Subscribe();
    }
  }

  void Unsubscribe(const std::vector<int>& index) {
    for (const auto i : index) {
      _records[i].Unsubscribe();
    }
  }

  void Update(const std::vector<std::pair<int, int>> updates) {
    for (const auto& i : updates) {
      auto& record = _records[i.first];
      record.SetValue(i.second);
      if (record.IsSubscribed()) {
        _updates.push(i);
      }
    }
  }

  void Persist() {
    for (auto& record : _records) {
      auto value = record.GetValue();
      // Persist the current value to disk.
      // Left to the imagination of the reader.
      record.SetClean();
    }
  }

  bool IsSubscribed(const int index) const {
    return _records[index].IsSubscribed();
  }

  bool IsDirty(const int index) const { return _records[index].IsDirty(); }

private:
  std::vector<Record> _records;
  concurrency::concurrent_queue<std::pair<int, int>> _updates;
};

Find a Bug

Historically, the number of record-level subscriptions added or removed at any one time was small; on the order of tens or hundreds. We began a development effort to scale the system by approximately an order of magnitude for the next release. This meant that the total number of records in the database, the rate at which records were updated, and the number of record-level subscriptions, would all need to scale by an order of magnitude. We were concerned that increasing the number of subscriptions would negatively impact the rate at which records could be updated and degrade a primary function of the system.

I developed a test to understand the scalability of the subscription mechanism. I constructed a table with one million records. While all one million records were being updated at a significant, benchmark rate, I would add subscriptions for 100,000 records and then confirm that each subscription received the expected updates. I noticed that every time I ran this test, a number of subscriptions were missing. On average, 100 out of the 100,000 subscriptions.

It took me some time to track this bug down, but, eventually, a pattern emerged. If the value of a record was updated during the distributed transaction to add the record-level subscription, then the subscription for that record would be missing once the distributed transaction completed. The likelihood of this occurring was a function of how frequently the value updated. My key observation was that if none of the records were updated, then all of the subscriptions would be established successfully.

The bug was the result of a feedback mechanism to remove subscriptions that were not required as updates for those subscriptions were processed by dependent services. Given the complexity of the distributed transaction to add or remove subscriptions, it was possible that a subscription would be removed while it was also being added, leading to this bug. This was a very latent bug since, by all appearances, the subscriptions had been established successfully, but the subscribing system would never be notified of changes for a subset of records.

Fix a Bug and Find a Bug

This bug was resolved by disabling the feedback mechanism to remove unnecessary subscriptions if a distributed transaction to add new subscriptions was in progress. The solution was relatively simple. I ran my test to confirm that I had fixed the bug. The test no longer failed every single time. Instead, it failed approximately one time out of every 60 trials. When it failed, approximately one percent of the subscriptions were missing. There was another bug.

It turned out there was an additional, periodic mechanism to remove unnecessary subscriptions every few minutes. A similar issue existed with this mechanism, but, of course, the bug only manifested when adding subscriptions happened to coincide with the periodicity. With insight from the first bug, I disabled the periodic removal of subscriptions and noticed that the bug went away. Fixing this bug was also relatively straightforward. Similar to the first bug, if a distributed transaction to add subscriptions was in progress, simply skip the periodic removal of subscriptions.

Fix a Bug and Find Another Bug

Reasonably confident that I had tracked down all of the bugs related to this distributed transaction at scale, I let my test run over the weekend. But, sure enough, when I looked at it on Monday, it had failed. The pattern of failure was again different. Only one subscription out of the 100,000 added was missing, seemingly at random, and this only occurred once in approximately 150 trials, meaning 149 times out of 150, all 100,000 subscriptions were established correctly.

Finding this bug took weeks and a lot of persistence. In the end, I think finding it was the result of a good intuition for the dynamics of the system as much as anything else. Similar to the first bug, I noticed that if I stopped updating records while adding or removing subscriptions, this issue did not present itself. The key to finding the bug was examining the variables that were mutated concurrently, which were more than I have presented in this simplified example.

There was an assumption by the original authours of the code that the _flags member variable of the Record class could be mutated in the SetDirty, SetClean, Subscribe, and Unsubscribe methods without additional thread synchronization, since it was an integer and writing to an integer is an atomic operation. However, if one thread modifies the Subscribe bit while at the same time another thread modifies the Dirty bit, one of these changes will be lost. The integer is written atomically, but only within the context of one thread and only relative to the last read of that variable. Basically, the last thread to modify the integer wins and the modifications from the other threads will be lost.

With an understanding of the issue, it was actually much easier to reproduce and it did not, like the first two bugs, require adding or removing a large number of subscriptions to increase the likelihood of occurrence. In fact, it only required one record in the database table. The following sample code will readily reproduce the bug in a matter of milliseconds. One task continuously subscribes and unsubscribes for a record while another concurrent task updates and persists the same record. After each operation, the test verifies that the state of the bit mask is correct. Since there are only two concurrent tasks, this can be done deterministically.

#include <atomic>
#include <chrono>
#include <future>
#include <iostream>
#include <vector>

using namespace std;
using namespace chrono;

int main() {
  Database database(vector<Record>(1));
  atomic<bool> running(true);

  // Subscribe task
  auto t1 = async([&]() {
    auto indexes = {0};

    while (running) {
      database.Subscribe(indexes);
      if (!database.IsSubscribed(0)) {
        throw exception("Subscription missing.");
      }

      database.Unsubscribe(indexes);
      if (database.IsSubscribed(0)) {
        throw exception("Subscription remains.");
      }
    }
  });

  // Update task
  auto t2 = async([&]() {
    while (running) {
      auto updates = {pair<int, int>(0, 0)};

      database.Update(updates);
      if (!database.IsDirty(0)) {
        throw exception("Record missing Dirty bit.");
      }

      database.Persist();
      if (database.IsDirty(0)) {
        throw exception("Record should not be Dirty.");
      }
    }
  });

  // Run for 60 seconds or until an exception is thrown
  std::async([&]() {
    const auto start = steady_clock::now();
    while (steady_clock::now() - start < seconds(60) &&
           t1.wait_for(seconds(1)) != future_status::ready &&
           t2.wait_for(seconds(1)) != future_status::ready) {
    }
    running = false;
  });

  try {
    t1.get();
    t2.get();
  } catch (const std::exception& e) {
    cout << "Test Failed : " << e.what() << endl;
    return -1;
  }

  cout << "Test Passed" << endl;
  return 0;
}

Once understood, fixing this bug was relatively straightforward. By making the _flags member variable a std::atomic type, all bit-wise operations will be atomic, performed as a read-modify-write operation. Here is the modified Record class.

enum Flags : int { Dirty = 0x01, Subscribe = 0x10 };

class Record {
public:
  Record()
    : _value(0)
    , _flags(0)
  {}

  Record::Record(Record&& other)
    : _value(other._value)
    , _flags(other._flags.load())
  {}

  Record& Record::operator=(Record&& other) {
    _value = other._value;
    _flags = other._flags.load();
    return *this;
  }

  // Prevent copying
  Record& Record::operator=(const Record&) = delete;
  Record::Record(const Record&) = delete;

  void SetDirty() { _flags |= Flags::Dirty; }
  void SetClean() { _flags &= ~Flags::Dirty; }

  void Subscribe() { _flags |= Flags::Subscribe; }
  void Unsubscribe() { _flags &= ~Flags::Subscribe; }

  void SetValue(const int& value) {
    _value = value;
    SetDirty();
  }

  int GetValue() const { return _value; }

  bool IsSubscribed() const { return (_flags & Flags::Subscribe) != 0; }
  bool IsDirty() const { return (_flags & Flags::Dirty) != 0; }

private:
  int _value;
  std::atomic<int> _flags;
};

I should note that I addressed this bug before std::atomic was part of the C++ standard by using platform-specific atomic methods (_InterlockedOr and _InterlockedAnd for Windows Server). Using std::atomic for the bit mask is even more straightforward, requires much less code, and, attractively, remains platform independent.

After making this final change, my test ran many millions of times without ever encountering an error. This gave me confidence that all of the bugs with this distributed transaction, at least exercised by this specific test, were finally resolved.

Of course, three years later, after the system had evolved to allow even greater concurrency, we uncovered yet another bug with similar symptoms. This bug was uncovered by this same test, which was part of our regression test suite. Distributed transactions can be very difficult to get right.

Conclusion

All three of these bugs involved the interaction of more than one component. In fact, none of these bugs would ever be encountered with a unit test or a functional test that only exercised one component in isolation. These types of bugs are inherent to multi-threaded programs and distributed systems. The bit mask was also used for more than just the subscription and persistence mechanisms, so discovering this bug also led to discovering the potential for equally latent bugs in a number of other components of the system.

Before the system was scaled up, the likelihood of an end user encountering any one of these bugs was very, very small. The chance of encountering the third bug, related to modifying bits in the bit field, was infinitesimally small. Yet the possibility existed. Because bugs like these are so latent, finding them in infrastructure systems used for real-time monitoring, process control, safety critical applications, billing, and the like, is of critical importance. The chance of bugs like these being noticed and reported is remote. By all appearances, the subscriptions were established correctly, as the Subscribe call completed without error. For a record that updates infrequently, no one may even notice that the subscription was missing. If the end user does notice and attempts to reproduce the situation that produced the bug, the chances of encountering the bug twice in a row in almost nil. So as designers and builders of these infrastructure systems, it must be our responsibility to test for and find these latent bugs.