[Paper Review] Deep learning with long short-term memory networks for financial market predictions


Jin Won Choi


Machine Learning


July 8, 2021

Ironman Toy

from PxHere

The sentence that surprised me most while reading Fischer and Krauss’ paper ‘Deep learning with long short-term memory networks for financial market predictions’ was as follows:

“To our knowledge, there has been no previous attempt to deploy LSTM networks on a large, liquid, and survivor bias free stock universe to assess its performance in large-scale financial market prediction tasks”.

‘LSTM’ stands for Long Short-Term Memory networks, a type of neural network that has found remarkable success in a wide range of applications from speech recognition to video game playing agents. Fischer and Krauss do a masterful job of explaining the mathematical workings of LSTMs, but let me boil the essence down into plain language.

Consider what happens to the memory banks in our brains when we watch a superhero movie. Our memory doesn’t dwell on every detail in each scene, but rather on a few important pieces of information such as the health of our hero, and the insidious plans of the villain. With each new scene, we subconsciously make decisions on which new information to admit into our memory, and which to drop. Bystanders who escape disasters unscathed are deemed unimportant, and never enter our memory. We do notice punches thrown at our hero, and commit them to memory for a time, but quickly forget them once it’s clear they don’t have any lasting effects. If you pause for a moment during the movie to peer into your inner self, you’d notice that your memory has created emotions ranging from fear to excitement.

LSTMs mimic this memory mechanism by running the data through from beginning to end, like playing movies frame by frame. It remembers important information, represented as a set of numerical variables, and calls them ‘cell states’. LSTMs make three decisions as they ingest each new data ‘frame’: A) whether to admit new information to its cell states, B) whether to forget any of its existing cell states, and C) how cell states manifest into outward behaviour. LSTM training entails optimizing each of these three decision making processes.

These memory mimicking mechanisms make LSTM ideally suited to analyzing ‘time series’ data, which refer to data consisting of snapshots queued up in chronological order. Speech recognition, for instance, involves making sense of each syllable as they enter our ear. Many types of financial data, including price histories across asset classes, also constitute time series data and appear suited for LSTMs. These successes lead to the question - why did academics wait until 2017, when Fischer and Krauss published their paper, to apply LSTMs to trading? Well, actually, I can take a good guess.

Neural networks are like car factories. They can produce impressive outputs using sophisticated machinery, but only if you feed them good raw material. You can’t make crash-safe cars with brittle steel. But most machine learning neophytes pay scant attention to refining and preprocessing data. My guess is that many PhD students have tried to apply LSTMs without those steps, and couldn’t produce results worthy of publishing.

Fischer and Krauss avoided this mistake. Instead of feeding the raw daily performance of S&P 500 securities as inputs to their models, they first z-scored the daily performances. Z-scoring involves subtracting the daily performances by the average of all security performances, and then dividing by the daily volatility. Without this step, the input data would alternate between periods of extreme price swings, such as in late 2008, and periods of calm, such as in 2017. What’s considered a normal price movement in one period would be seen as abnormal in another. In statistical parlance, we say the data is ‘non-stationary’.

Statistical models have trouble digesting non-stationary data. Imagine listening to someone on the phone whose mouth strays randomly from the mic. It would be hard to understand that person as your ear would have to constantly adapt to different decibel norms. Machine learning algorithms, which lack common sense, have an even tougher time than we humans at adapting to such changes. Z-scoring is one remedy that forces the data to fall within a more consistent range.

The authors also made wise choices regarding their model targets. Instead of targeting raw performances, the authors assigned binary labels of 1 or 0 depending on whether a given stock outperformed half of all stocks during the same period, effectively dividing all targets equally between 1s and 0s. Many machine learning models perform better when the number of labels is balanced.

We can verify the importance of data preprocessing by examining the performance of benchmark models. In addition to LSTM, the authors trained Random Forests, Dense Neural Networks (a.k.a. ‘multilayer perceptrons’), and Logistic Regression models. Every one of these models would have yielded investment strategies with positive excess returns, indicating that the data was set up to make it easy for any model to extract signals from it.

Random Forests in particular performed well, to the point that it raised questions about the benefits of LSTM’s architecture. The strategy that used Random Forests yielded a Sortino Ratio of 3.41 after transaction costs, versus LSTMs’ 3.85. Although the difference is significant enough (in a statistical sense) to declare LSTMs the winner over Random Forests, the paper may not have given Random Forests a fair shake. The authors didn’t appear to have tried to tune Random Forests’ hyperparameters, and if such tuning had led to a 10% improvement in Random Forests’ Sortino ratio to 3.75, it would have thrown LSTM’s superiority in doubt.

Whenever I read a paper that applies a novel machine learning structure, I ask myself whether that structure can extract insights from data in a way other structures can’t. LSTMs can do so, in theory, by mimicking human memory. But theories routinely dissipate upon being illuminated by data. Judging by the Sortino ratios alone, I’m not 100% convinced that LSTMs mined insight that Random Forests couldn’t. I would be more convinced if the authors showed that LSTMs and Random Forests picked different stocks, but that information is sadly missing.

Let’s set the model structures aside for now, and discuss the size of the Sortino ratios, which seem remarkably large. Can an academic paper really contain trading secrets with the potential to rival Renaissance Technologies? Alas, there do appear to be a couple of caveats.

The first caveat is alpha decay. The authors divided the LSTM backtest into three periods - the early period from 1993 to 2000, the moderation period from 2001 to 2009, and the deterioration period from 2010 to 2015. The strategy performed exceedingly well during the early period, and very well albeit less spectacularly during the moderation period. By the deterioration period, however, the strategy stopped producing profits after transaction costs.

The authors made a good case for the reasons behind the performance decline. LSTMs were not invented until 1997, which made it impossible for other traders to have employed similar models before that time, and were not widely used for some time after that. By the 2010s, however, LSTMs not only became well known but widely used in multiple industries. The authors speculated that some traders adopted LSTMs or similar models, and arbitraged out the alpha.

The second caveat that thwarts us from realizing high Sortino ratios involves a clog in the strategy implementation mechanics. The LSTM model’s target, as a reminder, is the binary label indicating the next day performance of the stock. If the input data spans until June 22nd at 4pm, the target embodies the stock’s performance from June 22nd at 4pm until June 23rd at 4pm. Replicating the paper’s strategy would involve collecting data at 4pm on June 22nd, running the data through the model to generate predictions, and then submitting buy orders before 4pm on the same day, which in the absence of a time machine, is impossible to do.

We could work around this problem by submitting our orders on June 23rd at 9:30am, when the market opens. If each stock’s 4pm to 4pm performance is identical to the stock’s 9:30am to 4pm performance, we’d have been able to achieve the results shown in the paper. But there are good reasons to think performances will differ. Other traders who run similar models may submit buy orders at 9:30am for the same stocks that we want to buy. Those orders could push prices up to the point where they squeeze all profit potential out of the trade. Our own investigation has found that 4pm to 4pm performances are often significantly different from 9:30am to 4pm performances.

An alternative workaround is to run the model before the market’s close, say at 3:30pm, and then to submit orders at 4pm. The authors implemented a variant of their strategy that does exactly this, and they found that after cost annual returns decreased by 20%, providing further reason for us to be mindful of competitors wielding similar models.

I would not, due to these caveats, trade real money using the strategy outlined in Fischer and Krauss’ paper. But I don’t view the strategy itself as the paper’s main contribution. The strategy is like Wright brothers’ airplane, where its purpose is not so much to carry investors today, but to show that a prototype that uses LSTM, with some proper data preprocessing, can be made to fly higher than traditional statistical models can.

The paper showed, for instance, that LSTM models didn’t rediscover well known factors. Quants often criticize machine learning models for tracking the momentum factor in particular, but LSTM results were actually negatively correlated with that factor. Neither did the results appear to piggyback off of other well known factors including size and value. Rather, the LSTM models favoured underperforming stocks that hit particularly rough patches in their final weeks running up to the trading dates. I appreciated how the authors illustrated this point, by charting the average performance of the LSTM’s picks, and contrasting them against the average performance of all stocks. But there are limits to how much such graphs can help us interpret LSTMs’ decision making process. A simple strategy that replicated short term reversals could only explain half of the LSTM’s decisions.

At ENJINE, our philosophy is to implement many individual strategies, and to use them as Lego blocks that form umbrella strategies. We plan on addressing the caveats identified earlier to create modified versions of the LSTM model, which may involve targeting different time horizons, or addressing non-stationarity through different remedies. We look forward to adding those models to our cache of Lego blocks.

Quantocracy Badge
Latest posts