Welcome folks today in this blog post we will be making a web scraper
which scrapes the alexa ranking
of a website. All the full source code of the website is shown below.
Get Started
In order to get started we need to install the following library using the pip
command as shown below
pip install beautifulsoup4
pip install validators
After installing this library make an app.py
file and copy paste the following code
app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
#!/usr/bin/python3 import requests import sys from bs4 import BeautifulSoup import re import validators if len(sys.argv) !=2: print("Please pass the TLD (url without http(s))") print("python {0} TopLevelDomain\nEx: python {0} theayurveda.org".format(sys.argv[0])) exit(0) alexa_base_url = 'https://alexa.com/siteinfo/' site_name = sys.argv[1] site_name.lower() def is_valid_domain(site_name): if validators.domain(site_name): return True else: return False if not is_valid_domain(site_name): print("Not a valid domain format {0} Exiting...".format(site_name)) print("Valid Top Level Domain looks like 'theayurveda.org' or 'www.theayurveda.org' ") exit(0) url_for_rank = alexa_base_url + site_name # Request formatted url for rank(s) page = requests.get(url_for_rank) soup = BeautifulSoup(page.content, 'html.parser') # get ranks text in a list country_ranks = soup.find_all('div', id='CountryRank') # select the data with class='rank-global' and the class='data' global_rank = soup.select('.rank-global .data') # Display Global rank safely try: match = re.search(r'[\d,]+', global_rank[0].text.strip()) print("Global Rank: ", match.group()) except: print("No global rank found for ", site_name) # Display country rank(s) try: ranks_list = country_ranks[0].text.strip().split("\n") print("Country Rank: ") for rank in ranks_list: if re.search(r'#\d+', rank): print("\t",rank) except: print("No country rank was found for ", site_name) |
So now you can run this python script
app.py by typing the below command as shown below
python app.py codingshiksha.com
Here we need to provide the website url
as the command line argument to the python application as shown below
As you can see we got the alexa global rank
and also we get the country rank
of popular countries of the website