24 Strategies Walk Into a Financial Crisis
45.7-point separation between good and bad strategy composites
The 2008 Global Financial Crisis remains the gold standard for strategy stress testing. Between October 2007 and March 2009, the S&P 500 lost 56.8% of its value. Correlations spiked across asset classes. Strategies that appeared diversified turned out to share a single underlying bet on continued liquidity.
We trained 24 strategies on 2004-2007 data and ran them through the full validation pipeline against the crisis period. The set spans four complexity tiers, three quality categories, and includes strategies with documented failure modes ranging from p-hacking to survivorship bias to unnecessary ML complexity.
The Crisis
The Results
Two strategies passed cleanly. Risk parity scored 63.4 with zero PBO, confirming that diversification by risk contribution (rather than by asset count) holds up under stress. Mean-variance constrained scored 62.0, also with zero PBO. Ledoit-Wolf shrinkage and turnover limits generated stable returns even as the covariance structure shifted.
At the other end, eight strategies hit critical failure. Vol-inflated Sharpe (PBO 0.59), survivorship bias (PBO 0.50), and hyperparam mined (PBO 0.36) all scored 20-22. The framework flagged them without needing to see the crisis. PBO alone was sufficient.
The borderline strategies landed exactly in between. Lucky momentum (34.7), regime timing (37.1), and overparameterized factor (29.5). These are strategies with genuine economic logic but structural weaknesses. The framework didn't force them into pass/fail. It flagged them for investigation.
Discrimination
Good strategies averaged 71.7. Bad strategies averaged 26.0. The 45.7-point gap is wide enough to make allocation decisions on. It's not a borderline finding. It's a clear, actionable signal.
Some genuine strategies scored poorly in the crisis. That's information, not a bug. Carry (28.3) and VRP (22.8) are academically validated strategies that have real crisis exposure. An allocator should know this before sizing a position. The framework surfaces it.
Forward Validation
We computed forward Sharpe ratios from the crisis through 2022 and compared them to composite scores.
At Spearman rho 0.422, the composite score carries real forward-looking information. Not perfect prediction, but a material edge over "trust the backtest."
Risk parity, the highest-scoring genuine strategy (63.4), delivered a 0.93 forward Sharpe with 54.4% cumulative return. Vol-inflated Sharpe, correctly flagged as bad (20.0), lost 94.5% of its capital. Survivorship bias (22.4) lost 53.7%.
The concordance rate of 67.8% means that for roughly two-thirds of all strategy pairs, the framework correctly predicted which one would outperform in the forward period.
What PBO Catches
The most interesting finding is how PBO works across strategy types. Phacked factor had a PBO of only 0.07 on GFC data, yet the framework still scored it 26.3 (fail). The composite caught it through other channels: low L1 score, poor crisis Sharpe.
Vol-inflated Sharpe hit PBO 0.59. Survivorship bias hit 0.50. Overparameterized factor reached 0.77. These aren't strategies that blow up spectacularly. They're strategies that quietly underperform, and PBO quantifies why.
Compare Any Two
Across Crises
A Note on ML Memorization
Three strategies in the bad category (RF Overfit, LSTM Leakage, Complexity Theater) scored above 84. They "passed" validation because CSCV cannot detect memorization-based overfitting in tree and neural models. This is a known framework limitation. PBO measures parameter sensitivity, not model capacity. When a random forest memorizes the training set, it gets the same PBO across all parameter combinations: zero.
These strategies are excluded from the primary visualizations and discrimination calculations. They represent an honest boundary of what statistical validation can detect. Addressing ML memorization requires model-specific checks (generalization gap analysis, feature importance stability) that go beyond CPCV/PBO.
The Takeaway
A crisis is the most demanding test environment for any validation framework. Correlations shift. Liquidity vanishes. Strategies that looked independent become correlated.
Three properties of the framework held up under this stress. First, it maintained a 45.7-point discrimination gap. Second, it correctly identified genuine risk in genuine strategies (carry and VRP aren't bad, but they have crisis exposure). Third, forward validation confirmed that composite scores contain predictive information about post-crisis performance.