Python Tutorial : Introduction to data ingestion with Singer смотреть онлайн
Want to learn more? Take the full course at https://learn.datacamp.com/courses/building-data-engineering-pipelines-in-python at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Hey, welcome back! Now that we know a bit about major concepts in a data platform, we should learn about getting data into its data lake! We’ll be exploring Singer, an open-source library that can connect with many data sources.
We spoke of data pipelines in the previous lesson. They bring data from one place to another, like water and gas pipes. To get data into your data lake, at some moment you need to ingest it. There are several ways to do so, but it is convenient if, within an organizational unit, the process is standardized.
That is the aim of Singer as well: to be the “open-source standard for writing scripts that move data”.
At its core, Singer is a specification that describes how data extraction scripts and data loading scripts should communicate using a standard JSON-based data format over stdout. JSON is similar to Python dictionaries. And stdout is a standardized “location” to which programs write their output.
Because Singer is a specification, these extraction scripts, which are called “taps”, and the loading scripts, which are called “targets”, can be written in any programming language. And they can easily be mixed and matched to create small data pipelines that move data from one place to another.
Taps and targets communicate using 3 kinds of messages, SCHEMA, STATE, and RECORD, which are sent to and read from specific streams.
A stream is a named virtual location to which you send messages, that can be picked up at a downstream location. We can use different streams to partition data based on the topic for example error messages would go to an error stream and data from different database tables could go to different streams as well.
Imagine you would need to pass this set of data to a process. With the Singer spec, you would first describe the data, by specifying its schema.
The schema should be given as a valid “JSON schema”, which is another specification that allows you to annotate and even validate structured data. You specify the data type of each property or field. You could also impose constraints like stating that the age should be an integer value between 1 and 130, as we’ve done here, or that a phone number should be in a certain format. The last two keys in this JSON object are the “$id” and “$schema”. They allow you to uniquely specify this schema within your organization and tells others which version of JSON schema is being used. They’re optional, but highly recommended in production-grade code.
You can tell the Singer library to make a SCHEMA message out of this JSON schema. You would call Singer’s “write_schema” function, passing it the “json_schema” we defined earlier.
With “stream_name” you specify the name of the stream this message belongs to. This can be anything you want. Data that belongs together, should be sent to the same stream.
The “key_properties” attribute should equal a list of strings that make up the primary key for records from this stream. If you’re processing data from motorized vehicles on a parking lot for example, this could be the license plate or a surrogate key. If there is no primary key, then specify an empty list.
As you can see, the “write_schema” call simply wraps the actual JSON schema into a new JSON message and adds a few attributes. The message would be printed on a single line, but we’ve added line breaks here to fit the screen.
JSON is a common format, not just in Singer, but in many other places. Python provides the JSON module to work with JSON.
To get objects in your code serialized as JSON, you would call either “json.dumps()” or “json.dump()”. The former simply transforms the object to a string, whereas the latter writes that same string to a file.
Okay, now that we understand what JSON schema is and how Singer wraps that into messages, let’s put this into practice.
#DataCamp #PythonTutorial #BuildingDataEngineeringPipelinesinPython
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial : Introduction to data ingestion with Singer» бесплатно и без регистрации, вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.