Python 3 PDFMiner Library Example to Extract or Read Text Content From PDF File Full Tutorial For Beginners

Python 3 PDFMiner Library Example to Extract or Read Text Content From PDF File Full Tutorial For Beginners

 

Welcome folks today in this blog post we will be looking at how to extract text content from pdf file in python using pdfminer library. All the full source code of the application is given below.

 

 

 

Get Started

 

 

 

In order to get started you need to install pdfminer library inside your python project by executing the pip command which is shown below

 

pip install pdfminer

 

After installing this library just create an app.py file and copy paste the following code

 

app.py

 

 

 

import io 
from pdfminer.converter import TextConverter 
from pdfminer.pdfinterp import PDFPageInterpreter 
from pdfminer.pdfinterp import PDFResourceManager 
from pdfminer.pdfpage import PDFPage 


def extract_text_by_page(pdf_path): 

    with open(pdf_path, 'rb') as fh: 
        
        for page in PDFPage.get_pages(fh, 
                                    caching=True, 
                                    check_extractable=True): 
            
            resource_manager = PDFResourceManager() 
            fake_file_handle = io.StringIO() 
            
            converter = TextConverter(resource_manager, 
                                    fake_file_handle) 
            
            page_interpreter = PDFPageInterpreter(resource_manager, 
                                                converter) 
            
            page_interpreter.process_page(page) 
            text = fake_file_handle.getvalue() 
            
            yield text 
            
            # close open handles 
            converter.close() 
            fake_file_handle.close() 
            
def extract_text(pdf_path): 
    for page in extract_text_by_page(pdf_path): 
        print(page) 
        print() 
        
# Driver code 
if __name__ == '__main__': 
    print(extract_text('###pathofpdffile###'))

 

See also  Python 3 Google Distance Matrix API Script to Calculate Distance and Duration Between Two Places on Command Line Full Project For Beginners

Here in this block of code we are importing the library pdfminer and then loading the input pdf file path so here you need to replace the path of the pdf file path. After this you just need to run the python application by running the below command

 

python app.py

 

 

 

 

 

 

As you can see the text has been extracted from the pdf file and in this way you can do this in python

 

Leave a Reply