Web Crawling with BeautifulSoup

import requests from bs4 import BeautifulSoup # Preparing entry data for bs4 TAGNAME = "a" ATTR = "class" VALUE = "target-class" TARGET = "http://randomblog.com/page/" TG_ATTR = "href" def crawler(max_pages): page = 1 while page < max_pages: url = TARGET + str(page) + "/" source = requests.get(url).text soup = BeautifulSoup(source, "html.parser") for link in soup.findAll(TAGNAME, {ATTR: VALUE}): href = link.get(TG_ATTR) print(href) page += 1 crawler(5)
This example of a web crawler uses the BeautifulSoup library (you need to manually install it) http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup. In this particular case, the crawler allows us to print the links' href value only if they contain a defined class. Read the documentation on here for more use cases: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Be the first to comment

You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.