Python Tutorial: Data outliers and scaling смотреть онлайн
Want to learn more? Take the full course at https://learn.datacamp.com/courses/practicing-machine-learning-interview-questions-in-python at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
In the last lesson, we discussed data distributions and transformations. In this video, we'll cover two additional preprocessing steps, finding and handling outliers and how and when to scale your data.
Outliers are defined as one or more observations that are distant from the rest of the observations in a given feature. When looking at a histogram of a feature, outliers tend to show up in the tails as you see in this image.
The inter-quartile range or IQR is defined as the difference of the values at the 1st and 3rd quartiles, which are at 25% and 75%, respectively, with the median exactly between at 50%. In general, those points above and/or below 1.5 times the IQR should be suspected as possible outliers, which corresponds to the shaded regions seen here. Individual points carry less weight overall in a large dataset than the same datapoint in a smaller dataset. And, a point that is only twice as large as your upper boundary is less concerning than one that is ten times as large.
Looking at a simple linear regression model of a dataset with and without outliers reveals just how influential the extreme points are for this particular data. The slope and intercept coefficients are vastly different between the two. A thorough investigation should be undertaken to justify why to remove them or not. And, it's totally possible these anomalies are considered crucial when designing a ML model whose purpose is to detect such anomalous behavior.
Some of the functions you'll encounter in the exercises are from the seaborn module where the boxplot function used on our target variable Loan Status supplied to y gives conditioned boxplots, distplot gives a histogram with a kde. Numpy's abs function returns an absolute value. From the scipy module stats.zscore calculates the z-score and mstats.winsorize is a handy function that, given a list of limits replaces outliers, in this example with the 5th percentile and 95th percentile data values. And, finally, numpy's .where function evaluates a condition given as the first argument, and replaces it with the values specified by the second when true or by the last when it evaluates to false.
This image shows 2 normal distributions that have different variances, which represents the average deviation from the mean in a distribution. In a machine learning framework, the high variance feature will be chosen more often than a low variance feature, making it seem more influential, when it may not be. The solution to this problem is to scale your data when the dataset contains features that have ranges that vary greatly.
Sometimes the terms for scaling, most notably normalization and standardization, are used interchangeably. But let's clarify their definitions to avoid any confusion. Standardizing your data, also known as z-score, takes each value minus the mean and divides it by the standard deviation, giving it a mean of zero and variance one. Normalization, also seen as min max normalizing takes each value minus the minimum and divides by the range. This has the effect of scaling the features between zero and one. So both approaches are scaling the data, they just do so differently.
In the exercises, you'll use two functions from scitkitlearn's preprocessing module. StandardScaler standardizes to mean 0 and sd 1 while MinMaxScaler normalizes the data to lie from 0 to 1.
Here is another multiple choice question before heading over to the exercises. How should outliers be identified and properly dealt with? What result does min/max or z-score standardization have on data? Select the statement that is true. If the answer is not immediately apparent, pause this video to read through the possible answers and give yourself a moment to think about it. If you still aren't sure, consider re-watching this video lesson up to this point and pay particular attention to the definition of outliers, when outliers are helpful, and what each type of scaling does to the data before revealing the answer in the next slide.
The correct answer is that, in certain contexts where the goal is to find fraud or cybersecurity events, for example, data anomalies are required in order to create a predictive ML model to detect them in the future.
These are the reasons why the other answers are incorrect, make sure you understand them.
To put everything we've covered so far into better perspective, these are the steps and the order they should be followed.
Now, it's time for some practice.
#Python #PythonTutorial #DataCamp #Practicing #MachineLearning #InterviewQuestions #Data #outliers #scaling
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial: Data outliers and scaling» бесплатно и без регистрации, вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.