How Statistics Solved the German Tank Problem
Learn what the German tank problem is and how statistical estimators helped solve it and win World War II.
In the last article, we talked about what statistics and estimators are. But we didn’t talk about them for their own sake; we talked about them because they can help us come up with some amazing solutions to real world problems. So get ready, because those problems and their solutions are our topic for today.
Do you remember our thought experiment from last time? We imagined being handed a bag containing a bunch of tiles that were all carved into the shapes of integers. (Weird, I know.) The first tile put in the bag was shaped like the number “1,” the second like the number “2,” and so on. Which means that the last tile put in the bag was shaped like the total number of tiles in the bag. Now, you didn’t put the tiles in the bag, so you have no idea how many are in there. But it’s your job to come up with an estimate of the total. And all you’re allowed to use to make this estimate are the integer values you got when you randomly pulled six tiles from the bag. Let’s say you pulled: 10, 23, 17, 9, 35, and 3. Can you come up with an estimator that uses this sample of numbers to estimate the total number of tiles in the bag?
Well, before we jump into answering that, we’d better take a second to remember what an estimator is. First of all, remember that the six numbers you pulled from the bag are called a sample of data. And all the numbers in the entire bag are called the population. An estimator is a rule that tells you how to use the numbers in a sample of data to estimate some property of the entire population. And that’s exactly what we need to do: come up with a rule to use our sample to estimate the total number of integers in the entire population of the bag. And this rule we need to use is called the population maximum.
So, how does it work? What’s the best way to use your sample to estimate the size of the population? I’m going to let you in on a little secret. I already know that there are 42 tiles in the bag. How? Because I made the problem up! And I had to set this number in order to randomly come up with the six numbers in our random sample. Sorry to spoil the surprise, but knowing the actual number beforehand is going to allow us to check our progress along the way.
At the end of the last article, I suggested a number of possibilities for how we might best calculate the population maximum. The first was to use twice the maximum value of the sample. How would that work in our case? Well, the biggest integer in our six number sample is 35. So twice that would be 2 x 35 = 70. Remember, the actual maximum value in our sample is 42—so twice the maximum value, 70, gives us a number that’s way bigger than the real value. Okay, how about twice the mean value? Well, the mean value of the six number sample—10, 23, 17, 9, 35, and 3—is just over 16. So twice the mean value would be about 32. That’s definitely better than 70, but it’s off by about 25%, which is still quite a bit. How about twice the median instead? That turns out to give us an estimate of 27, but that differs from the actual value by about 35%. Clearly, these estimators just aren’t good enough.
The Minimum-Variance Unbiased Estimator
We don’t have to keep on guessing forever in our effort to find the best estimator. Instead, let’s let math figure out the answer for us. The details are a little too complicated for us to go over right now, but after all the math is said and done, the gist is that we want to use what’s called the minimum-variance unbiased estimator (a.k.a., MVUE) to calculate the population maximum. And the good news is that it’s easy to do! To find the population maximum, we just need to know two numbers:
-
The number of items in the sample—also called the sample size
-
The largest value in that sample—also called the sample maximum.
In our case, the total number of items in the sample is 6, and the sample maximum is 35. To come up with our estimate of the population maximum, we just need to plug these numbers into the formula that says:
That means that since our sample maximum is 35 and our sample size is 6, our estimate of the population maximum is 35 + (35 / 6) – 1, which is just a hair under 40. That’s not too bad! Remember, the actual maximum value is 42, so our estimate of 40 is only about 5% less than the actual value! By the way, if you’re wondering what this equation means, you can think of it as saying that the population maximum is estimated to be equal to the sample maximum…plus a little bit more. And that little bit more is basically equal to the average gap between the numbers in the sample. Okay, now that we understand the method, let’s see how it can be applied to the real world.
[[AdMiddle]Let’s take a trip back in time and look at what’s known as the “German tank problem” from World War II. At that time, for fairly obvious reasons, allied forces wanted to know how many tanks the German military was producing. They tried using traditional espionage techniques like spying, decoding , and interrogation, but they kept coming up with absurdly high estimates. So they started looking for another way. And once they realized that each German tank they captured was marked with serial numbers saying “this is the [insert number her tank that has been produced this month,” they figured out how to come up with a very accurate estimate.
What did they use? None other than the exact same estimator that we used to estimate the total number of tiles in the bag: the population maximum. Think about it for a second and you’ll see that these two problems are nearly identical—just switch total number of tanks for total number of tiles, and tank serial numbers for tile integer values. Pretty cool, right? And the statistical approach worked very well. Using traditional techniques (like spying and decoding encrypted messages), allied intelligence estimated that the Germans were producing about 1400 tanks per month. But calculating the population maximum using the serial numbers from the captured tanks yielded an estimate of only 256 per month. And after the war when all the official German documents were analyzed, it was found that the true value was 255 per month. That’s only 1 less than was estimated! Not too shabby. In truth, allied intelligence got a little lucky to get such a precise estimate, but regardless, the technique is very effective.
mailto:mathdude@quickanddirtytips.comcreate new emailhttps://www.youtube.com/user/jasonmarshallTVhttps://twitter.com/jasonmarshallhttps://www.facebook.com/TheMathDude