How to Not Do Machine Learning

How to Not Do Machine Learning


Machine learning provisions a robust set of tools and frameworks for building useful inferences and prediction systems upon a complex dataset or taxonomies of data sources. However, it is increasingly common and fairly simple to incur large maintenance costs and production resources. From a system design perspective, there are many risk factors at play.

Machine learning thrives on three principles: Data laundering, labor externalization and risk privatization.

It is by these same principles a design system can fail because they are not discussed by the nature of the arguments associated with them. That is to say, they sound bad and stakeholders naturally avert themselves from negative sentiments.

So in the wild, we discover phenomena such as recycling of input signals causing runaway feedback cascades or black-box modeling. Assumptions get locked in. Calibrations become arbitrary. Monitoring can be exclusive in the sense that you can only see what you are looking for.

Model sensitivity is an issue with the input/outputs of a specific pipeline. In a sense, the concept of a highly-connected graph where feedback and chaotic behavior is desired as a feature rather than a bug also illustrates that subtle alterations can trigger a sort of Butterfly Effect. Indeed, the same idea applying to a system of differential equations holds true in that the inputs/outputs being also inputs/outputs can dramatically change results based on the sensitivity of the initial conditions.

As models are chained together, this effect increases, often by the magnitude of inputs and parameters involved. It is often the case where improvement of an individual component could cause the entire design system to buckle. Ensembles must be considered holistically rather than individually, so isolation and independent variables are weak assumptions. However, analysis of this problem is computationally expensive as combinatorial explosions occurs as complexity arises. This is known as correction cascading.

Undeclared consumers or invisible debt can cause runaway feedback loops in the same fashion. Access restrictions and service-level agreements for complex data infrastructures are not data scientists' strong suit. By property, inference graphs are often tightly coupled incrementing the cost of technical debt and maintenance required to either modify or remove an individual component.

The vast majority of open-source repositories or "production"-grade models rely on package smoothies and expensive data dependencies. Often it is the assumption that the signal of input is stable. In reality, calibrations falter and versioning can cause widespread failures.

From a holistic standpoint. Consider the following:

  • Configuration
  • Data Curation
  • Feature Extraction
  • Data Validation
  • Machine Learning Resource Management
  • Analysis Toolkits
  • Process Management
  • Service Infrastructure
  • Monitoring

It is incredibly common to find workflows that maybe do one or two of these things.

In a sense, all of these components can feedback into each other viciously, either to the benefit or disadvantage of their respective organizations. From a pragmatic perspective of artificial intelligence augmenting rather than automating human decision making, we can also consider the following precept from information security: humans are the weakest link in the supply chain.

And so, a direct feedback loop can result the inability to scale a problem within its action or decision space. Some decisions should and should not be automated due to their complexity simply put. Using non-deterministic or random methods can mitigate and isolate aspects of data being poisoned by itself, but is not an ample safeguard.

Most of the cost is always hidden. Behavior is always steered by the ultimate black-box model, which is humanity. Improving one feature can cause preferences to arise. Sudden changes can result in discontinuation of use due to poor reliability or confusing design patterns. The psychosocial and even neurobiological aspect of how we interface these models must always be considered. The difference of human-computer interaction can be stated analogously between 25ms and 5ms for a click response time.

There is virtually no anti-pattern except for the most egregious one, mathematics. Code is glued on. Packages are smoothied. Most practitioners are accustomed to minor adjustments and tweaking. There is virtually no automated workflows or versioning for the laity. Pipelines are often adherent to organizational dogmatic principles rather than best practices or standardization. The very existence of data science departments as being distinct from backend or frontend frameworks can pose difficult in even communicating this concern. Many of the polemic arguments can be distilled due to friction between research and engineering.

One must consider a machine learning model like an ecosystem. Detritus must be cleared. Many host their experiments on public-facing addresses and include them in the code as examples. For example, how would characterize the boundary or abstraction of a data stream? Given the fact that data is being sourced from a ridiculous number of trackers or telemetry inputs, it can be practically impossible to discern the quality of these experiments. It is like sourcing from the ocean to distill wine.

Data rot has smells. The data can be old. Too many languages can be at play, all of them promising state-of-the-art results or performance improvements. Prototypes and experiments smell from being too fresh. Configurations and environments can smell. Ask anyone how they install Python.

No model exists in isolation. The ultimate artificial intelligence is the social network. Where there are rules, there is a game. Where there is a game, there are players. Where there are players, there are cheaters. One must consider the privacy and security while balancing transparency and robustness of a model with regards how it works in precision. Testing, monitoring, fixed thresholds, limits on actions, up-stream production. These are all practical concerns unless you want a million fake users mining your model for what it's worth.

We are indebted to the world, and we are always born in someone else.

Test your data. Curate. Clean. Preprocess.

Reproduce it. Randomize it. Parallelize.

Manage process. Implement graceful failures.

Ask yourself:

What is the computational or organizational complexity of this object?

What is the algebraic closure of its dependencies?

What is the precision factor in which new changes can be measured?

What is the connectivity of one model compared to other components?

What is the quality, efficiency and effectiveness of training and testing humans to use this?