Does OpenAI Lose Money on GPT-3 Inference?
I suspected that OpenAI was losing money on GPT-3 inference, but Ethan Caballero disagreed. We ended up doing some back-of-the-envelope calculations together. It turns out OpenAI is doing pretty well, and inference is surprisingly cheap even with very large models.
Throughput. It should actually be possible to run GPT-3 on a single Nvidia DGX A100 (640 GB). GPT-3's forward-pass compute is around 350B flop/token (see the GPT-3 paper, Table D.1), and the DGX A100 datasheet gives the available flop/s. Combining the two, a DGX would be able to produce roughly 8k tokens/s.
Energy. The DGX's max power consumption is around 6 kW. At 8k tokens/s it produces 1B tokens in about 35 hours. That works out to roughly 210 kWh per 1B tokens, or about $21 in energy. OpenAI charges $0.06 per 1k tokens, which is $60,000 in revenue per 1B tokens. The energy cost to serve GPT-3 is essentially negligible.
Hardware. The hardware cost is more significant. The DGX 320 GB cost around $200k (AnandTech), so let's assume the 640 GB version is around $400k. Even at that price, you only need about 6.6B tokens, or roughly 10 days of running GPT-3 flat out, to recover the hardware cost.
Caveats. We did the same calculation with V100s and the numbers came out roughly the same. Of course these are very rough estimates. Using max flop/s is unrealistic, and actual throughput will be lower in practice. Datacenter overhead and elasticity costs are also not accounted for here.