Python 3 Script to Scrape All PDF Files From Website URL Using BeautifulSoup4 and PyPDF2 Full Project For Beginners


Welcome folks today in this blog post we will be scraping all pdf files from website url using beautifulsoup4 and pypdf2 library in python 3. All the full source code of the application is shown below.




Get Started




In order to get started you need to make an file and copy paste the following code



import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader

url = ""
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")

list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')

for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"

def info(pdf_path):
    response = requests.get(pdf_path)
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    txt = f"""
    Information about {pdf_path}:

    Author: {}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    return information

for i in list_of_pdf:

