Python Tutorial: Statistical Thinking in Python II (Part 2)
Part 2 of our Statistical Thinking in Python II course by Justin Bois. Learn more about the course here: https://www.datacamp.com/courses/statistical-thinking-in-python-part-2
To "pull yourself up by your bootstraps" is a classic idiom meaning that you achieve a difficult task by yourself with no help at all. In statistical inference, you want to know what would happen if you could repeat your data acquisition an infinite number of times. This task is impossible, but can we use only the data we actually have to get close to the same result as an infinitude of experiments? The answer is yes! The technique to do it is aptly called bootstrapping. This chapter will introduce you to this extraordinarily powerful tool.
In the prequel to this course we computed summary statistics of measurements, including the mean, median, and standard deviation.
But remember, we need to think probabilistically. What if we acquired the data again? Would we get the same mean? The same median? The same standard deviation? Probably not.
In inference problems, it is rare that we are interested in the result from a single experiment or data acquisition. We want to say something more general. Michelson was not interested in what the measured speed of light was in the specific 100 measurements conducted in the summer of 1879. He wanted to know what the speed of light actually is. Statistically speaking, that means he wanted to know what speed of light he would observe if he did the experiment over and over again an infinite number of times.
Unfortunately, actually repeating the experiment lots and lots of times is just not possible. But, as hackers, we can simulate getting the data again. The idea is that we resample the data we have and recompute the summary statistic of interest, say the mean. To resample an array of measurements, we randomly select one entry and store it. Importantly, we replace the entry in the original array, or equivalently, we just don't delete it. This is called sampling with replacement. Then, we then randomly select another one and store it. We do this $n$ times, where $n$ is the total number of measurements, five in this case. We then have a resampled array of data. Using this new resampled array, we compute the summary statistic and store the result.
Resampling the speed of light data is as if we repeated Michelson's set of measurements. We do this over and over again to get a large number of summary statistics from resampled data sets. We can use these results to plot an ECDF, for example, to get a picture of the probability distribution describing the summary statistic.
This process is an example of bootstrapping, which more generally is the use of resampled data to perform statistical inference. To make sure we have our terminology down, each resampled array is called a bootstrap sample. A bootstrap replicate is the value of the summary statistic computed from the bootstrap sample. The name makes sense; it's a simulated replica of the original data acquired by bootstrapping.
Let's look at how we can generate a bootstrap sample and compute a bootstrap replicate from it using Python. We will use Michelson's measurements of the speed of light.
First, we need a function to perform the resampling. The Numpy function random.choice() provides this functionality. Conveniently, like many of the other functions in the Numpy random module, it has a size keyword argument, which allows us to specify how many samples we want to take out of the array. Notice that it chose the number five three times; the function does not delete an entry when it samples it out of the array.
Now, we can draw 100 samples out of the Michelson speed of light data. This is a bootstrap sample, since there were 100 data points and we are choosing 100 of them with replacement.
Now that we have a bootstrap sample, we can compute a bootstrap replicate. We can pick whatever summary statistic we like. We'll compute the mean, median, and standard deviation. It's as simple as treating the bootstrap sample as though it were a data set.
Now it's time for you to do some bootstrap sampling yourself!
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial: Statistical Thinking in Python II (Part 2)», вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.