The Pandemic Stress Test: 25 Strategies Through COVID-19
0.706 Spearman rho: strongest forward correlation across all five events
On March 16, 2020, the S&P 500 fell 12% in a single session. VIX hit 82.69. The crash was faster and more violent than 2008, but the recovery was also faster: the S&P reclaimed its February high by August.
This creates a specific challenge for validation frameworks. The crisis was sharp but brief, followed by one of the strongest bull runs in history. A strategy that performs terribly for three weeks but recovers in three months will look fine in annual return metrics. The validation pipeline doesn't measure calendar returns. It measures regime behavior.
We ran 25 strategies through the pipeline. Trained on 2016-2019, evaluated against the pandemic period. This is the largest strategy set applied to any single event in the suite.
The Crisis
The Results
Mean-variance constrained was the top genuine scorer at 49.5 (weak pass). Risk parity followed at 44.6. TSMOM scored 39.8. Even genuine strategies struggled. When markets crash 34% in 23 trading days, the honest score is rarely high.
The bad strategies scattered across the lower half. Phacked factor (35.2) had a PBO of 0.79. Vol-inflated Sharpe (20.3) hit 0.85. Regime overfit (30.0) reached 0.83. These are strategies specifically constructed to exploit data mining, and PBO confirmed it.
Crisis Sharpe tells the real story of what happened during the crash itself. TSMOM recorded -1.41. Yield curve rotation hit -1.17. Regime timing reached -1.61. These are honest measurements of what happens to trend-following and macro rotation when markets fall off a cliff. The framework quantifies the pain rather than hiding it.
Discrimination
Good strategies averaged 67.0. Bad strategies averaged 27.7. A 39.3-point separation across 25 strategies. Narrower than the GFC gap (45.7), which makes sense: the brief crisis duration gave bad strategies less time to reveal their flaws.
Borderline strategies earned their label. Lucky momentum scored 37.1 (PBO 0.53). Regime timing scored 25.9 (PBO 0.39). Overparameterized factor scored 20.0 (PBO 0.26). These strategies have real economic logic but structural issues that make them unreliable under stress. The framework ranked them between the genuine and the fraudulent, which is exactly right.
Forward Validation
Forward performance was measured from 2020 through 2022.
Spearman rho: 0.706. This is the strongest forward correlation across all five events in the case study suite. The composite score was a strong predictor of post-COVID performance.
For more than three-quarters of strategy pairs (77.3%), the framework's ranking matched the forward outcome. Mean-var constrained (49.5) delivered 63.3% cumulative return with a 0.60 forward Sharpe. Regime timing (25.9) lost 45.7%. Overparameterized factor (20.0) lost 25.7%.
One notable outlier: vol-inflated Sharpe scored 20.3 (correctly flagged as bad) but posted a 0.43 forward Sharpe. This kind of false negative is expected in a subset of cases. The 77.3% concordance rate means roughly 1 in 4 pairs will contradict the ranking. The question for an allocator isn't whether every single ranking is correct, but whether the signal is strong enough to improve decisions on average. At 0.706, it is.
The Conservatism Signal
The COVID results reveal something about the framework's temperament. Several genuine strategies scored below 35. The highest genuine score was 49.5. In a period where most strategies suffered, the framework refused to hand out high marks.
That's the right behavior from a validation system. It should be hardest to pass when markets are hardest to trade. A framework that gives strategies 80s during a pandemic is optimistic, not accurate.
The practical consequence: if your strategy scores 45+ on COVID-era data, it handled one of the most violent dislocations in modern markets. That number carries more information than a 75 scored against calm markets.
Compare Any Two
Across Crises
What the Numbers Mean for Capital Allocation
The strongest result in this study isn't the discrimination gap or the PBO catches. It's the 0.706 Spearman rho. It means that during a period of extreme dislocation followed by a historic recovery, the framework's assessment of strategy quality predicted which strategies would compound value and which would destroy it.
For an allocator evaluating a pool of strategies, a 0.706 rank correlation between validation scores and forward performance is operationally useful. It doesn't eliminate judgment. It gives judgment something to work with besides a backtest Sharpe ratio and a pitch deck.