import requests
from bs4 import BeautifulSoup
# Preparing entry data for bs4
TAGNAME = "a"
ATTR = "class"
VALUE = "target-class"
TARGET = "http://randomblog.com/page/"
TG_ATTR = "href"
def crawler(max_pages):
page = 1
while page < max_pages:
url = TARGET + str(page) + "/"
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
for link in soup.findAll(TAGNAME, {ATTR: VALUE}):
href = link.get(TG_ATTR)
print(href)
page += 1
crawler(5)
This example of a web crawler uses the BeautifulSoup library (you need to manually install it) http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup. In this particular case, the crawler allows us to print the links' href value only if they contain a defined class. Read the documentation on here for more use cases: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Be the first to comment
You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.