Python Tutorial: Tools of the data engineer
Want to learn more? Take the full course at https://www.datacamp.com/courses/introduction-to-data-engineering at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Hello again. Great job on the exercises! You should now have a good understanding of what it means to be a data engineer. The data engineer moves data from several sources, processes or cleans it and finally loads it into an analytical database. They do this using several tools. This video acts as an overview to get a feeling for how data engineers fulfill their tasks using these tools. We'll spend some more time to go into the details in the second chapter.
First, data engineers are expert users of database systems. Roughly speaking, a database is a computer system that holds large amounts of data. You might have heard of SQL or NoSQL databases. If not, there are some excellent courses on DataCamp on these subjects. Often, applications rely on databases to provide certain functionality. For example, in an online store, a database holds product data like prices or amount in stock.
On the other hand, other databases hold data specifically for analyses. You'll find out more about the difference in later chapters. For now, it's essential to understand that the data engineer's task begins and ends at databases.
Second, data engineers use tools that can help them quickly process data. Processing data might be necessary to clean or aggregate data or to join it together from different sources. Typically, huge amounts of data have to be processed. That is where parallel processing comes into play. Instead of processing the data on one computer, data engineers use clusters of machines to process the data. Often, these tools make an abstraction of the underlying architecture and have a simple API.
Third, scheduling tools help to make sure data moves from one place to another at the correct time, with a specific interval. Data engineers make sure these jobs run in a timely fashion and that they run in the right order. Sometimes processing jobs need to run in a particular order to function correctly. For example, tables from two databases might need to be joined together after they are both cleaned. In the following diagram, the JoinProductOrder job needs to run after CleanProduct and CleanOrder ran.
Luckily all of these tools are so common that there is a lot of choice in deciding which ones to use. In this slide, I'll present a few examples of each kind of tool. Please keep in mind this list is not exhaustive, and that some companies might choose to build their own tools in-house. Two examples of databases are MySQL or PostgreSQL. An example processing tool is Spark or Hive. Finally, for scheduling, we can use Apache Airflow, Oozie, or we can use the simple bash tool: cron.
To sum everything up, you can think of the data engineering pipeline through this diagram. It extracts all data through connections with several databases, transforms it using a cluster computing framework like Spark, and loads it into an analytical database. Also, everything is scheduled to run in a specific order through a scheduling framework like Airflow.
A small side note here is that the sources can be external APIs or other file formats too. We'll see this in the exercises.
Enough talking, let's do some exercises!
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial: Tools of the data engineer», вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.