I think the theory for this is in the Shannon-Hartley theorem [1]: Basically, phone modes require a higher channel capacity than CW, in bits per second.
The Nyquist theorem tells us that we need at least 4000 samples per second for representing audio frequencies up to 2 kHz, which is a sampling frequency of 4 kHz. With 1 bit resolution, this means 4000 bps.
If we assume that a dit in CW lasts 40 ms (ca. 30 wpm), then a cycle of one dit and its pause lasts two dits, i.e. 80 ms.
80 ms is equal to a frequency of 12.5 Hz. Now, since a series of dits is the highest frequency we must represent/recover, we need a sampling frequency of twice that, i.e. 25 Hz, which is 25 bps at 1 bit resolution.
So CW needs a channel capacity of 25 bps, while voice audio needs 4000 bps, i.e. 1:160 - for voice, you need a channel capacity 160 times larger compared to CW.
The channel capacity, the bandwidth, and the signal-to-noise ratio are approximately linear to each other (see [1]).
If we assume a filter bandwidth of 2400 Hz for SSB and of 400 Hz for CW, the bandwidth available for SSB is six times that of CW.
However, the bit rate is 160 times as much.
Hence, the SNR for voice must be larger by the factor of 160/6 = 16.6 as compared to CW, all other things equal.
This would mean that, in an ideal world and with all my simplifications and assumptions, 5 W in CW are comparable (in terms of error rate, as a proxy for successful QSOs), ca. 80 W SSB.
Now, I am simplifying a bit regarding the bit rates, real QSOs include some degree of redundancy, and human operators can guess missing parts from character and word n-grams in natural language. Also, my reasoning might have a few flaws; I just tried to understand the theory from Wikipedia tonight (one of the fascinating side-effects of our hobby) .
But nonetheless I think that this calculated figure is a surprisingly good fit to the numbers reported by practitioners.
Any feedback and corrections are warmly welcome!
73 de Martin, DK3IT