A Tale of Failure on Multiple Levels

This blog is about making mistakes. I make a lot of them, but I try to learn from them and not make them again. Here’s a recent example.

A recent project I worked on was tested thoroughly. Pipelines were setup to automate all unit and functional tests on every push to the remote repository. Coverage was excellent. The testing process gave me a feeling of confidence that the project had a solid codebase. Even so, weeks into production, the occasional minor glitch required a bug fix.

The strange thing was, the tests were not failing. The real world had caused new situations that had not been anticipated in testing.

Luckily, none of these situations caused a major issue; merely the client sending a confused email about something noticed on the live application. The failure was related to three systems merging their data: the client’s original API (considered the source of truth), the application database, and a third-party API that provided additional meta data based on what the client’s API provided.

The response from both API’s was parsed and the data inserted into a number of related tables. The complete response from each API was also saved as a JSON string in the primary table. That’s where the problem arose. At one point in the program, early on, the saved JSON string had been decoded and used to retrieve a piece of data. As the program developed, the JSON data was not used at all, with all requests targeting the parsed data in the related tables.

For some reason, the original (first) merge of JSON data from both APIs was retained, but never updated on future merges, despite updating the data in the related tables. That’s where the problem arose.

A single function, thoroughly tested, still utilized the JSON data! Every test continued to pass, even this function. Why wouldn’t it? The JSON data in the primary table was the same data in the related table, the only difference being that it was never updated after initial insert!

In 99.9% of the time, the strange, hidden glitch had no impact at all on the system. It only arose when the client added a new record that referenced an older record, something that didn’t occur until the application had been in production for weeks. The application dutifully attempted to access the data… from the JSON string, not from the related table. The result was that the output showed the older JSON data rather than the newer data.

All tests, functional and unit, continued to pass.

Debugging this situation took an hour or two. An initial look at the data (saved in the related table) showed that the correct information was there, but it was not the data that was being displayed. At some point a light bulb went off. The fix was a single line in one function to change the source of data… from the saved JSON string to the related table.

Design decisions are made during development, tested, vetted and passed. The JSON response had been saved as a type of redundancy, so it could be compared to the parsed data if a problem arose. In this case, one function remained, accessing the JSON data, working perfectly, despite accessing outdated information.

Part of the problem was the mocks. The application couldn’t be polling the API every time a test was run, so a variety of different JSON responses from both API’s were saved in text files for use in mock model objects during testing. That information never changed, and so the situation that caused the glitch never arose.

The other problem was that there were two sources of data in the system: the original JSON response, and the parsed data in a series of related tables. The JSON string had been used early in the development process, before switching to the data in the related tables. When the change occurred, the saved JSON should have been immediately removed. If it had been, the function in question still might not have failed, but the issue would not have come up, either.

Luckily it was a minor issue, noticed by a diligent client, and it was corrected quickly. But I’d rather not face this issue again.

So, in future…

  • try to anticipate more edge cases during testing
  • do not retain two live sources of the same information
  • figure out a better way to mock data that accounts for changes over time