A small reflection on showing something before naming it while teaching.

I’ve been thinking for a while about the importance, when teaching, of showing something first and only then putting a name on it. I believe I’ve finally understood some of its virtues through one interesting case.

I’ve noticed that sometimes students know a lot of words colloquially, but those words don’t carry the same meaning in their minds as they do in mine. Two concrete examples: efficiency (of algorithms) and exponential growth. The colloquial use of those two words has washed out a bit of their meaning.

To teach backpropagation, I made the whole point of the class to show that it isn’t just an algorithm for computing derivatives using the chain rule, but that it is one that does so efficiently.

Obviously, put that way, it might not mean much.

What I did in the middle of the class, after showing a simple version using dynamic programming (which I didn’t name), was to write another one that computed the gradients “directly,” without the intermediate accumulations. Then I programmed a toy “resnet”: h(x) = f(x) + x, where f(x) was simply a sigmoid. I composed 25 of these resnets, and I asked the students to predict which version would be faster.

The simple version of backprop took 156 microseconds. I asked again for predictions about the second one, and ran it. 2.6 seconds. During that time, the ones who had bet on this one being faster got to talk through their mistake.

But I didn’t stop there. I asked them: what happens if instead of 25 compositions we have 26? I ran it, and they saw that it took longer. 5.2 seconds. At that point it was evident what was going on. And I compared this with GPT-3, which has ~200 compositions of resnets, and I commented—now with evidence—on what I had meant when I said that backprop is an efficient way to compute gradients.