Thoughts on Rich Sutton’s “Bitter Lesson”


I’ve been thinking on and off about Rich Sutton’s piece The Bitter Lesson. Here are three corners that I think Sutton is cutting too quickly:

1. Moore’s Law is way too slow to make any real progress in AI

Of course we should aim for technology that scales with compute power (how could you possibly disagree with this). But compute power alone actually scales too slow. Moore’s Law scales as 2^(N/1.5)  (double compute power ever 1.5 year). But search spaces grow much faster: chess is 35^N, Go is 250^N, natural language ambiguity interpretation is somewhere between 1-10 per word (estimate by Piek Vossen), etc.
So yes, Moore’s Law helps, but not a lot. The real progress has to come from algorithms and representations. Of course those should scale with compute, but that’s just a nice icing on the cake, not the essence of the breakthroughs.

2. Compute power is not the only bottleneck

Saying that our methods should scale with compute is not the same as saying that we should aim for technologies whose only bottleneck is compute power. If that were possible, it would be great: just wait for the chips to get faster. But even the methods of which Sutton says that they scale with compute power (read: ML) actually don’t scale with only compute power. Lack of training data (for example) is a major bottleneck on many ML techniques, no matter how well the compute power increases. So yes, of course things should scale with compute power, but there will always be other bottlenecks too, and Sutton is pretending that for ML there aren’t.

3. Knowledge-based methods do scale with compute

Sutton asserts: “the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation”.  That ignores KR from the past decade. One of the main reasons that we can now compute with knowledge bases on the order of a billion edges is because memory got so cheap. So these methods do scale scale with compute. And yes, they have other bottlenecks (knowledge acquisition suffers from a size/quality trade-off), and therefore more compute does not simply mean more power. But something similar applies to ML: other bottlenecks (e.g. data availability) stop it from scaling arbitrarily with more compute.

So in short:

1. Of course good methods should scale with more compute power, but other bottlenecks will kick in with all methods, with ML as much as with KR.
2. Relying on Moore’s Law will put you in for a very long wait, because it’s growth curve is much too slow for the size of search spaces
3. KR methods do scale with compute (in particular with memory)

Other interesting responses to The Bitter Lesson