Machine Learning - The High-Interest Credit Card of Technical Debt
This post is based on a paper called “Machine Learning: The High-Interest Credit Card of Technical Debt from Google, Inc.
The papers highlights the four key challenges to technical debt from the perspective of a ML system.
Traditional methods of paying off technical debt include refactoring, increasing coverage of unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation. The goal of these activities is not to add new functionality, but to make it easier to add future improvements, be cheaper to maintain, and reduce the likelihood of bugs.
However…ML systems have larger system-level complexity on top of normal code.
Thus, refactoring these libraries, adding better unit tests, and associated activity is time well spent but does not necessarily address debt at a systems level.
Four key challenges are examined:
Complex Models Erode Boundaries
- Entanglement
- Changing anything (e.g. hyper-params, data, features etc.) changes everything.
- Shipping the first version of a machine learning system is easy, but that making subsequent improvements is unexpectedly difficult.
- This consideration should be weighed carefully against deadline pressures for version 1.0 of any ML system.
- Hidden Feedback Loops
- Usage of features whose results may not fully surface for a period of time makes analyzing the effect of proposed changes extremely difficult, and add cost to even simple improvements.
- We recommend looking carefully for hidden feedback loops and removing them whenever feasible.
- Undeclared Consumers
- The expense of undeclared consumers is drawn from the sudden tight coupling of model A to other parts of the stack. Changes to A will very likely impact these other parts,sometimes in ways that are unintended, poorly understood, or detrimental.
- Undeclared consumers may be difficult to detect unless the system is specifically designed to guard against this case. In the absence of barriers, engineers may naturally grab for the most convenient signal, especially when there are deadline pressures.
Data Dependencies cost more than Code Dependencies
- Unstable Data Dependencies
- This can happen implicitly, when the input signal comes from another machine learning model itself that updates over time, or a data-dependent lookup table, such as for computing TF/IDF scores or semantic mappings.
- One common mitigation strategy for unstable data dependencies is to create a versioned copy of a given signal and use it until such a time as an updated version has been fully vetted.
- Versioning carries its own costs, however, such as potential staleness. And the requirement to maintain multiple versions of the same signal over time is a contributor to technical debt in its own right.
- Underutilized Data Dependencies
- Underutilized data dependencies include input features or signals that provide little incremental value in terms of accuracy. Underutilized dependencies are costly, since they make the system unnecessarily vulnerable to changes.
- Legacy Features: Keeping old features that may not be useful anymore
- Bundled Features: Bundling up multiple features, some of which may not be so useful.
- ǫ-Features: Keeping features that minimally improve accuracy.
- Underutilized data dependencies include input features or signals that provide little incremental value in terms of accuracy. Underutilized dependencies are costly, since they make the system unnecessarily vulnerable to changes.
- Static Analysis of Data Dependencies
- Keeping track which system uses which data.
- Are there references to current codebase or production instances with older binaries that uses it?
- Correction Cascades
- Learn a model a′ that takes a as input and learns a small correction,
- This can easily happen for closely related problems, such as calibrating outputs to slightly different test distributions.
- Correction cascade will create a situation where improving the accuracy of a actually leads to system-level detriments.
System Level Spaghetti
- Glue Code
- Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.
- The glue code pattern implicitly embeds this construction in supporting code instead of in principally designed components.
- Pipeline Jungles
- System for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output.
- Managing these pipelines, detecting errors and recovering from failures are all difficult and costly. Testing such pipelines often requires expensive end-to-end integration tests. All of this adds to technical debt of a system and makes further innovation more costly.
- When machine learning packages are developed in an ivory-tower setting, the resulting packages may appear to be more like black boxes to the teams that actually employ them in practice.
- Dead Experimental Codepaths
- Perform experiments with alternative algorithms or tweaks by implementing these experimental codepaths as conditional branches within the main production code.
- Maintaining backward compatibility with experimental codepaths is a burden for making more substantive changes. Furthermore, obsolete experimental codepaths can interact with each other in unpredictable ways.
- Configuration Debt
- Any large system has a wide range of configurable options, including which features are used, how data is selected, a wide variety of algorithm-specific learning settings, potential pre- or post-processing, verification methods, etc.
Dealing with Changes in the External World
- Fixed Thresholds in Dynamic Systems
- It is often necessary to pick a decision threshold for a given model to perform some action. However, such thresholds are often manually set.
- A useful mitigation strategy for this kind of problem appears in [8], in which thresholds are learned via simple evaluation on heldout validation data.
- Monitoring and Testing
- Unit testing of individual components and end-to-end tests of running systems are valuable, but in the face of a changing world such tests are not sufficient to provide evidence that a system is working as intended. Live monitoring of system behavior in real time is critical.
- Prediction Bias: In a system that is working as intended, it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels.
- Action Limits: In systems that are used to take actions in the real world, it can be useful to set and enforce action limits as a sanity check.
Leave a comment