Welcome folks today in this blog post we will be downloading pdf files from url
using beautifulsoup4 and requests library in python. All the full source code of the application is given below.
Get Started
In order to get started you need to install the following libraries
pip install requests
pip install bs4
After installing all these libraries inside your python project just make an app.py
file and copy paste the following code
app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# Import libraries import requests from bs4 import BeautifulSoup # URL from which pdfs to be downloaded url = "https://nanonets.com/blog/deep-learning-ocr/" # Requests URL and get response object response = requests.get(url) # Parse text obtained soup = BeautifulSoup(response.text, 'html.parser') # Find all hyperlinks present on webpage links = soup.find_all('a') i = 0 # From all links check for pdf link and # if present download file for link in links: if ('.pdf' in link.get('href', [])): i += 1 print("Downloading file: ", i) # Get response object for link response = requests.get(link.get('href')) # Write content in pdf file pdf = open("pdf"+str(i)+".pdf", 'wb') pdf.write(response.content) pdf.close() print("File ", i, " downloaded") print("All PDF files downloaded") |
Now inside this python script we provided the url
from which we will download the pdf files so now you will execute this python script by running the below command
python app.py
Now you can see that after executing the python script it has downloaded all the three pdf files from the url and stored it inside the root directory