Welcome folks today in this blog post we will be extracting all images from pdf document
in python using fitz
and PyMuPDF
Library. All the full source code of the application is given below.
Get Started
In order to get started we need to install the following libraries using the pip
command as shown below
pip install pillow
pip install fitz
pip install PyMuPDF
After you install these libraries inside your python project now just make an app.py
file and copy paste the following code to it
app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import fitz # PyMuPDF import io from PIL import Image # file path you want to extract images from file = "###inputfilepath##.pdf" # open the file pdf_file = fitz.open(file) # iterate over PDF pages for page_index in range(len(pdf_file)): # get the page itself page = pdf_file[page_index] image_list = page.getImageList() # printing number of images found in this page if image_list: print(f"[+] Found a total of {len(image_list)} images in page {page_index}") else: print("[!] No images found on page", page_index) for image_index, img in enumerate(page.getImageList(), start=1): # get the XREF of the image xref = img[0] # extract the image bytes base_image = pdf_file.extractImage(xref) image_bytes = base_image["image"] # get the image extension image_ext = base_image["ext"] # load it to PIL image = Image.open(io.BytesIO(image_bytes)) # save it to local disk image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb")) |
Now in the above python snippet of code just replace the input path
of pdf file where which you need to extract images
Now if you execute the python script by typing the below command you will see it will extract all the images
which are present inside the pdf document
python app.py