Python Tutorial: Exploratory data analysis
Want to learn more? Take the full course at https://learn.datacamp.com/courses/cleaning-data-in-python at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
In this video, I will show you how we can use exploratory data analysis to help identify data that need further investigation.
The most basic analysis we can do is count the unique values in our data.
We can use the info method to get the data type of each column. Here, I will show you the frequency counts for the non-numeric continent, country and fertility columns, and the numeric population column.
To perform a frequency count, we first select the column we want to perform a frequency count on.
If the column name does not contain any special characters, spaces, and is not a name of a Python function, we can select a column directly by its name using dot notation.
It works the same way as subsetting using bracket notation.
Once we have the column selected, we can use the value_counts method on the selected column.
I like to use the dropna equals False parameter since it will also count the number of missing values if there are any.
The continent column does not have a missing value, so none will be reported.
value_counts will print out the counts for each unique value of a column in descending order.
Note that even though we counted a column of the object dtype, the results of value_counts will be of dtype int.
Another way we can select columns is using the bracket notation. Here is the same code and output as before, this time using the bracket notation to select a column.
Now we will count the number of observations for each country in our data. Since there are too many countries to show at once, I am using the head method to only return the top 5 counts.
In this example I am chaining together methods, I am slicing and getting the value counts just like before.
We expect each country to have only 1 observation, but Sweden has 2. This will require us to investigate this data point further.
The fertility column is a column we expected to be numeric, but stored as a string.
This is because we have a string value named missing in this column, this is why the fertility column has the wrong data type. it also alerts us that we need to recode the missing string.
If your column has missing values, they will also be counted, provided you pass the dropna equals False parameter. Here you see that we have 42 missing values in the column.
Another type of EDA we can do is calculate summary statistics on numeric columns.
This can help us spot outliers in our data.
There are many working definitions for outliers. One definition is a value that is considerably higher (or lower) than the rest of the data. You can consult the DataCamp statistics course for more detailed definitions of outliers.
Outliers are observations of interest we want to investigate further for data cleaning.
We can quickly calculate summary statistics on our data by using the describe method. Only the columns that have a numerical type will be returned.
describe returns the number of non-missing values, the mean, standard deviation, minimum, 25, 50, and 75 percent quartiles of our data, where the 50% quartile is our median, and finally, the maximum value of our data.
A quick scan down the population results, show that the maximum population value is 2-point-3 billion people. Our data comes from 2012, no country had this population then.
Now it's your turn to calculate descriptive statistics for exploratory data analysis to see what needs cleaning in your data.
#PythonTutorial #DataCamp #Cleaning #Data #Python #Exploratory #dataanalysis
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial: Exploratory data analysis», вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.