Intermediate Python 12. Multi-processing spider 영어 파이썬 코딩영어 смотреть онлайн
LC training(13:43) - What we're doing here is we're saying parse us This is going to be a list of urls basically a list three character urls and then we're coming down here from basically this will be the data that we basically the same as what we did in the previous tutorial where we're the you know the return was like a list of 0 2 4 6 8 and so on This time it's a mapping and it's gonna be lists of urls that we get for all of the links in parse us You might be able to guess already that what we're going to do is parse us this starting value is a list of urls What are we getting back from here with get links a list of urls Thus this could be a process that just recurse infinitely But not yet because data is a list of lists So what we're going to say is now data is going to be equal to url for url list in data for url in url list How is that for list comprehension what is that doing So for every url in each of the mini url lists and then for each of the mini url lists in all of the urls that we have which is data We're saying what we want to do is now have a new list which is just those contents So this is a way of taking a list of lists and taking the data from each of those lists in the lists and putting into a single list fantastic
Summary & Code results -
concept 1. a spider might have multiple sub tasks but the main task of a spider is to go to a website, find all of the links on that website and then go to all of those links and just slowly spiderweb out on to the entire interwebs.
pip install beautifulsoup4, lxml
from multiprocessing import Pool
import bs4 as bs
import random
import requests
import string
# Beautiful Soup is a Python library for pulling data out of HTML and XML files.
# It works with your favorite parser to provide idiomatic ways of navigating, searching,
# and modifying the parse tree. It commonly saves programmers hours or days of work.
def random_starting_url():
starting = ''.join(random.SystemRandom().choice(string.ascii_lowercase) for _ in range(3))
url = ''.join(['http://', starting, '.com'])
return url
# url = random_starting_url()
# print(url)
def handle_local_links(url,link):
if link.startswith('/'): # url에 이어서 '/'가 나온다면 그건 하위 디렉트리 link
return ''.join([url,link]) # url과 그 directory link를 붙여서 값을 돌려준다
else:
return link # 그왼, 그냥 원래 link 값만 돌려준다
def get_links(url):
try:
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'lxml') # pulls data out of HTML and XML
body = soup.body
links = [link.get('href') for link in body.find_all('a')] # all 'a'에서 'href'를 찾아
links = [handle_local_links(url,link) for link in links] # '/'가 따라 나오면 그 하위 링크를 붙여서
links = [str(link.encode("ascii")) for link in links] # ascii 형태로 링크 값을 돌려준다
return links
except TypeError as e:
...
except Exception as e:
print(str(e))
return []
def main():
how_many = 50
p = Pool(processes=how_many)
parse_us = [random_starting_url() for _ in range(how_many)] # 세 자리 url을 찾아 나온 걸로
data = p.map(get_links, [link for link in parse_us]) # spider map으로 다시 찾아 나온 links를 list data에다 담았다가
data = [url for url_list in data for url in url_list] # url_list 안의 url_list들을 모두 하나의 list data로 모아서
p.close()
with open('urls.txt','w') as f: # 그 값을 별도의 화일에다 담아낸다
f.write(str(data))
if __name__ == '__main__':
main()
# look up the https://pythonprogramming.net/introduction-scraping.../
# held just before this intermediate lectures.
-----------------------------------------------------------------------
C:\Users\USER\Desktop\Intermediate-Tutorials - python intermediate12.py
HTTPConnectionPool(host='bxw.com', port=80): Max retries exceeded with url: / (Caused by
...
NewConnectionError('-urllib3.connection.HTTPConnection object at 0x000001EF099842B0-: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
'NoneType' object has no attribute 'find_all'
Likely got None for links, so we are throwing this
...
HTTPConnectionPool(host='iyi.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('-urllib3.connection.HTTPConnection object at 0x000001F8F1590B00-: Failed to establish a new connection: [WinError 10060] 연결된 구 성원으로부터 응답이 없어 연결하지 못했거나, 호스트로부터 응답이 없어 연결이 끊어졌습니다'))
-------------------------------------------------------------------------
["b'#main'", "b'https://www.linde.com/'", "b'http://boc.com/en/about-boc/'", "b'http://boc.com/en/careers/'", "b'https://www.boconline.co.uk/shop/LogonForm...'", "b'https://www.boconline.co.uk/.../store-finder/index.html'", "b'https://www.boconline.co.uk/shop/en/uk/customer-information'", "b'http://boc.com/en/'", "b'http://boc.com/en/'", "b'http://boc.com/en/'",
...
"b'https://www.instagram.com/ellenbrussdesign/?hl=en'", "b'https://www.etsy.com/shop/HermannAndSmalls?ref'", "b'https://goo.gl/maps/YDGd4wFyqTE2'", "b'https://ebd.com/accessibility/'"]
원본 비디오: https://www.youtube.com/watch?v=N0ph2a6Vd7M&list=PLQVvvaa0QuDfju7ADVp5W1GF9jVhjbX-_&index=13
Что делает видео по-настоящему запоминающимся? Наверное, та самая атмосфера, которая заставляет забыть о времени. Когда вы заходите на RUVIDEO, чтобы посмотреть онлайн «Intermediate Python 12. Multi-processing spider 영어 파이썬 코딩영어» бесплатно и без регистрации, вы рассчитываете на нечто большее, чем просто загрузку плеера. И мы это понимаем. Контент такого уровня заслуживает того, чтобы его смотрели в HD 1080, без дрожания картинки и бесконечного буферизации.
Честно говоря, Rutube сегодня — это кладезь уникальных находок, которые часто теряются в общем шуме. Мы же вытаскиваем на поверхность самое интересное. Будь то динамичный экшн, глубокий разбор темы от любимого автора или просто уютное видео для настроения — всё это доступно здесь бесплатно и без лишних формальностей. Никаких «заполните анкету, чтобы продолжить». Только вы, ваш экран и качественный поток.
Если вас зацепило это видео, не забудьте взглянуть на похожие материалы в блоке справа. Мы откалибровали наши алгоритмы так, чтобы они подбирали контент не просто «по тегам», а по настроению и смыслу. Ведь в конечном итоге, онлайн-кинотеатр — это не склад файлов, а место, где каждый вечер можно найти свою историю. Приятного вам отдыха на RUVIDEO!
Видео взято из открытых источников Rutube. Если вы правообладатель, обратитесь к первоисточнику.