// article

On disabling ML in production

May 13, 2026 Article

Part 3 in the Atlas series. Earlier: the database optimization case study and the forecasting system in production.

In November of last year I disabled the machine learning models in my live algorithmic trading system and replaced the signal generator with transparent technical analysis. The models had been running for months and the system had been making money. The decision was straightforward and it was the hardest one I have made in production engineering.

This is a note about that decision, what led to it, and the disciplined reintroduction of machine learning that came after.

The system, briefly

Atlas trades short-dated SPY options through a managed broker stack. Decisions are made by a coordinated set of services running on a small fleet of OS processes, with market data flowing in through Interactive Brokers and Alpaca, and orders flowing out through the same. The system has been live with real capital since early 2026. It is also my primary income, which means the engineering bar is not academic. Whatever runs has to run every market day, and whatever fails has to fail in ways I can catch before the failure costs me.

The system originally ran an ensemble. An XGBoost classifier for short-horizon entries, a TensorFlow LSTM for sequence context, and a regime classifier for state. The models had been trained on years of tick-level data and they performed reasonably in backtest. In live trading they continued to perform, more or less, for a while.

The slow leak

The interesting thing about silent ML degradation is that you find it the way you find a slow leak in a tire. Not from the leak itself, but from a downstream inconvenience some weeks later.

In my case the inconvenience was small and procedural. I was upgrading a feature pipeline. Routine work, a clean refactor, the kind of change that gets shipped on a Saturday morning. While I was in the code I noticed that the XGBoost model at inference was being fed a feature vector that did not match what it had been trained on. It had been wrong for weeks. Not erroring. Not crashing. Just receiving an input it had no representation of, returning a number it had no business returning, and propagating that number through to a position-sizing calculation that quietly went on doing what it had been doing all along.

The LSTM was worse. It had been trained against forty-three features and was being served twenty-six in production. The forward pass still computed. The output still looked plausible. Nothing in the system said anything had broken.

Both models had been making decisions every market open for weeks. Both had been making money. Neither was doing the thing I thought it was doing.

Research ML lives next to its training data. You sanity-check inputs every time you reload a notebook. Production ML lives behind a pipeline that abstracts the contract away, and once the contract is abstract enough, drift becomes invisible.

The discipline

The decision to disable both models was a one-line config change. The discipline was in making it.

A working trading system that is, in fact, making money is the worst possible time to rip out the part of it that you cannot prove is making the money. I had a perfectly reasonable narrative in front of me, supported by the P and L, that the ensemble was contributing. The narrative was wrong. The contribution was, at best, partial, and at worst, accidental. But the narrative was supported, and the system was running, and the easy decision was to fix the contract instead of removing the dependency.

Fixing the contract would have taken a week. Removing the dependency took an afternoon and exposed everything underneath.

I removed it. What was underneath was a level-based decision engine that I had originally written as a sanity check on the ML output, and that had, over months, accumulated a structure of its own. Multi-source level aggregation. Regime-aware confidence scoring. A small set of guards that decided whether to trade in the first place. I had not noticed how complete it had become because I had been thinking of it as scaffolding around the models, rather than as a signal generator in its own right.

After I disabled the ensemble, the level engine kept trading. The system kept making money. The dashboards got more readable, because every decision was now explicable in terms I could write on the back of a napkin. And the failure modes were no longer silent. If the engine made a bad call, I could read the call and tell you why.

This is what transparency does for you in production. It is not that the system becomes better at picking the right trade. It is that when it picks the wrong one, you can find out, and you can fix it.

Reintroducing ML, slowly

I do not think machine learning has no place in production trading. I think it has a specific place, and most of the writing about how to deploy it is wrong about which place that is.

The model I am bringing back is not the same ensemble. It is a single LightGBM classifier, trained on a labeled corpus of decision outcomes generated by the level engine itself, with deliberate guards on the integration. It is not a free-running signal source. It is an opinion that gets blended into the level engine’s confidence at a weight I set deliberately and that I can roll back to zero with a config change in five seconds.

The blend weight started at zero, ran in shadow mode against live decisions for weeks, was promoted to a small fraction of its target, and is now climbing on a schedule that I can pause at any point. Each step is gated on validation metrics that I defined before training. Per-regime calibration. Log-loss versus a naive baseline. Hyperparameter stability across folds. If any gate fails the blend goes back to zero automatically.

There is nothing clever about this. It is what production ML deployment should look like and almost never does. The thing that makes it work is not the methodology. It is that I am running my own enterprise customer engagement against my own system, and the only person who suffers when I cut a corner is me.

What the experience taught me

Two lessons. Both are obvious once you have learned them and neither is obvious before.

The first is that production ML is not research ML with deployment glue. It is a different discipline. Research ML asks whether a model can learn a pattern from data. Production ML asks whether a model continues to receive the data it was trained on, in a system where the data pipeline is itself under active development by people who do not know the model’s exact requirements. The two disciplines have different failure modes, different testing regimes, and different review cadences. A research-good model deployed naively is not a production-good model. It is a stochastic black box wired into something that matters.

The second is that transparency is a load-bearing engineering property, not a soft preference. When the level engine makes a bad call, I read the call, write a test, fix the rule, and ship. When the XGBoost model made a bad call, I could not even tell it was making a bad call. The whole system depended on me trusting that the training generalized. Most of the time it did. The cost of the rest of the time was that I had no instrument to detect when it had not.

I would happily run a model I cannot read. I will not run a model I cannot read alongside a system that has no other source of truth. The discipline is not anti-ML. It is anti-monolith.

A note on running this on yourself

There is a version of this work that exists at every large model lab. Customers ask AI infrastructure companies to embed engineers who can take a model that works in research and make it work in production. The engineers who do that work usefully are not the ones who know the most about the model. They are the ones who know the most about the difference between a system that works under observation and a system that works under load. They are the people who have, somewhere in their past, looked at a perfectly running production system and asked the uncomfortable question of whether it is doing the thing they think it is doing.

I have spent the last year asking that question of my own system. The answer was no, then I made it yes, and now I am building the discipline to add back the parts I disabled, without accepting again that what works is the same as what is understood.

That is what the next year of Atlas is about.