← Back to Case Studies
framework validationGFCcrisis testingPBO

24 Strategies Walk Into a Financial Crisis

45.7-point separation between good and bad strategy composites

Sigmentic Research12 min read

The 2008 Global Financial Crisis remains the gold standard for strategy stress testing. Between October 2007 and March 2009, the S&P 500 lost 56.8% of its value. Correlations spiked across asset classes. Strategies that appeared diversified turned out to share a single underlying bet on continued liquidity.

We trained 24 strategies on 2004-2007 data and ran them through the full validation pipeline against the crisis period. The set spans four complexity tiers, three quality categories, and includes strategies with documented failure modes ranging from p-hacking to survivorship bias to unnecessary ML complexity.


The Crisis

Crisis Timeline
2007-10-09S&P Peak
S&P 5000%
VIX16

S&P 500 hits all-time high of 1,565

1 / 6

The Results

COMPOSITE SCORES BY STRATEGY
GoodBorderlineBad
* 3 ML memorization strategies excluded. These score high because CSCV cannot detect in-sample memorization by tree-based and neural models. They're a known framework limitation, not genuine passes.

Two strategies passed cleanly. Risk parity scored 63.4 with zero PBO, confirming that diversification by risk contribution (rather than by asset count) holds up under stress. Mean-variance constrained scored 62.0, also with zero PBO. Ledoit-Wolf shrinkage and turnover limits generated stable returns even as the covariance structure shifted.

At the other end, eight strategies hit critical failure. Vol-inflated Sharpe (PBO 0.59), survivorship bias (PBO 0.50), and hyperparam mined (PBO 0.36) all scored 20-22. The framework flagged them without needing to see the crisis. PBO alone was sufficient.

The borderline strategies landed exactly in between. Lucky momentum (34.7), regime timing (37.1), and overparameterized factor (29.5). These are strategies with genuine economic logic but structural weaknesses. The framework didn't force them into pass/fail. It flagged them for investigation.


Discrimination

FRAMEWORK DISCRIMINATION
71.7
Good Avg
11 strategies
45.7pt gap
26.0
Bad Avg
7 strategies
21
Strategies
0.422
Spearman rho
67.8%
Concordant pairs
173/255
Pairs ratio

Good strategies averaged 71.7. Bad strategies averaged 26.0. The 45.7-point gap is wide enough to make allocation decisions on. It's not a borderline finding. It's a clear, actionable signal.

Some genuine strategies scored poorly in the crisis. That's information, not a bug. Carry (28.3) and VRP (22.8) are academically validated strategies that have real crisis exposure. An allocator should know this before sizing a position. The framework surfaces it.


Forward Validation

We computed forward Sharpe ratios from the crisis through 2022 and compared them to composite scores.

FORWARD VALIDATION: COMPOSITE SCORE VS FORWARD SHARPE
Spearman rho = 0.422(moderate forward correlation)
GoodBorderlineBad
ML memorization strategies excluded (forward Sharpe > 8.0 would compress the scale). Strategies with zero forward Sharpe (insufficient data) also excluded.

At Spearman rho 0.422, the composite score carries real forward-looking information. Not perfect prediction, but a material edge over "trust the backtest."

Risk parity, the highest-scoring genuine strategy (63.4), delivered a 0.93 forward Sharpe with 54.4% cumulative return. Vol-inflated Sharpe, correctly flagged as bad (20.0), lost 94.5% of its capital. Survivorship bias (22.4) lost 53.7%.

The concordance rate of 67.8% means that for roughly two-thirds of all strategy pairs, the framework correctly predicted which one would outperform in the forward period.


What PBO Catches

The most interesting finding is how PBO works across strategy types. Phacked factor had a PBO of only 0.07 on GFC data, yet the framework still scored it 26.3 (fail). The composite caught it through other channels: low L1 score, poor crisis Sharpe.

Vol-inflated Sharpe hit PBO 0.59. Survivorship bias hit 0.50. Overparameterized factor reached 0.77. These aren't strategies that blow up spectacularly. They're strategies that quietly underperform, and PBO quantifies why.


Compare Any Two

Strategy Comparator
vs
goodgood
MetricRisk ParityMean Var Constrained
Composite63.462.0
PBO0.000.00
Crisis Sharpe-0.04-0.84
Fwd Sharpe0.9270.129
Fwd Return54.4%2.7%
Verdictpasspass

Across Crises

Cross-Event Comparison
Appears in all 3 events
Global Financial Crisis
22.8
PBO: 0.00Fwd: 0.00bad
COVID-19 Pandemic
22.8
PBO: 0.00Fwd: 0.00bad
Volmageddon
20.0
PBO: 0.91Fwd: 0.28bad

A Note on ML Memorization

Three strategies in the bad category (RF Overfit, LSTM Leakage, Complexity Theater) scored above 84. They "passed" validation because CSCV cannot detect memorization-based overfitting in tree and neural models. This is a known framework limitation. PBO measures parameter sensitivity, not model capacity. When a random forest memorizes the training set, it gets the same PBO across all parameter combinations: zero.

These strategies are excluded from the primary visualizations and discrimination calculations. They represent an honest boundary of what statistical validation can detect. Addressing ML memorization requires model-specific checks (generalization gap analysis, feature importance stability) that go beyond CPCV/PBO.


The Takeaway

A crisis is the most demanding test environment for any validation framework. Correlations shift. Liquidity vanishes. Strategies that looked independent become correlated.

Three properties of the framework held up under this stress. First, it maintained a 45.7-point discrimination gap. Second, it correctly identified genuine risk in genuine strategies (carry and VRP aren't bad, but they have crisis exposure). Third, forward validation confirmed that composite scores contain predictive information about post-crisis performance.