RUVIDEO
Поделитесь видео 🙏

Python Tutorial: Introduction to tokenization

Want to learn more? Take the full course at https://learn.datacamp.com/courses/introduction-to-natural-language-processing-in-python at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---
In this video, we'll learn more about string tokenization!

Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing.

There are many different theories and rules regarding tokenization, and you can create your own tokenization rules using regular expresssions, but normally tokenization will do things like break out words or sentences, often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet.

One library that is commonly used for simple tokenization is nltk, the natural language toolkit library. Here is a short example of using the word_tokenize method to break down a string into tokens. We can see from the result that words are separated and punctuation are individual tokens as well.

Why bother with tokenization? Because it can help us with some simple text processing tasks like mapping part of speech, matching common words and perhaps removing unwanted tokens like common words or repeated words.

Here, we have a good example. The sentence is: I don't like Sam's shoes. When we tokenize it we can clearly see the negation in the not and we can see possession with the 's. These indicators can help us determine meaning from simple text.

Beyond just tokenizing words, NLTK has plenty of other tokenizers you can use, including these ones you'll be working with in this chapter.

The sent_tokenize function will split a document into individual sentences.

The regexp_tokenize uses regular expressions to tokenize the string, giving you more granular control over the process.

And the tweettokenizer does neat things like recognize hashtags, mentions and when you have too many punctuation symbols following a sentence. How convenient!!!

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different.

When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc.

When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match.

This is because match will try and match a string from the beginning until it cannot match any longer.

Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.

Now it's your turn to try some tokenization!


#DataCamp #PythonTutorial #Natural #tokenization #Language #Processing #Python

Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Python Tutorial: Introduction to tokenization», вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.

Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.

Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!

Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.