Welcome folks today in this blog post we will be making a email scraper
app using beautifulsoup4
library. All the full source code of the application is given below.
Get Started
In order to get started you need to install the following libraries using the pip
command as shown below:
pip install beautifulsoup4
After installing these libraries make an app.py
file and copy paste the following code
app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
############################################################ ## /##` ###` ###` ## ## # ## Jacob L. Chrzanowski # # # # # ## ## ## BeautifulSoup Email Scraper ### # # ## # ## # \ ### # # ## ###################################################### import code from urllib.request import urlopen from bs4 import BeautifulSoup import re #url = "https://www.rit.edu/its/about/staff" url = "https://pastebin.com/37wZg26w" html = urlopen(url) bsObj = BeautifulSoup( html.read(), "html.parser") #emails = re.findall(r"[A-Za-z0-9._%+-]+(\@|\[at\]|\[\@\])[A-Za-z0-9.-]+(\.|\['dot'\]|\[.\])[A-Za-z]{2,4}", str(bsObj)) bsObj = str(bsObj) emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", bsObj) visited_emails = set() for email in emails: #print(email) if email not in visited_emails: print('found email type1: ' + str(email)) visited_emails.add(email) emails = re.findall(r"[A-Za-z0-9._%+-]+\[at\][A-Za-z0-9.-]+\.[A-Za-z]{2,4}", bsObj) visited_emails = set() for email in emails: #print(email) if email not in visited_emails: print('found email type2: ' + str(email)) visited_emails.add(email) emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\[dot\][A-Za-z]{2,4}", bsObj) visited_emails = set() for email in emails: #print(email) if email not in visited_emails: print('found email type3: ' + str(email)) visited_emails.add(email) emails = re.findall(r"[A-Za-z0-9._%+-]+\[at\][A-Za-z0-9.-]+\[dot\][A-Za-z]{2,4}", bsObj) visited_emails = set() for email in emails: #print(email) if email not in visited_emails: print('found email type4: ' + str(email)) visited_emails.add(email) # links = bsObj.findAll("link") # for link in links: # link = str(link) # if r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}' in link: # print(link) # # reg = re.search(r'[\w\.-]+@[\w\.-]+', link) # # reg.group(0) # code.interact(local=locals()) # print(bsObj.prettify()[0:100]) # print('\n\n\n') # #print([a["href"] for a in bsObj.select("a[href^=mailto:]")]) # line = "should we use regex more often? let me know at 321dsasdsa@dasdsa.com" # match = re.search(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}', line) # match.group(0) quit() |
Now if you execute this python
script app.py by typing the below command
python app.py