Python pdf extract text

1/8/2023

We have read the pdf file and now access some properties to get data: It also offers few more arguments that can be passed. Step 3: PdfFileReader function is used to read the data from the object that holds the path of a pdf file. I am assuming test.pdf file is stored in the same directory where the main program is. We have provided one more argument i.e rb which means read binary. This ll create an object that holds the path of the pdf file. Step 2: Open the PDF file using open() method. Step 1: At the top of the, we have imported the PyPDF2 module. In the above code, we have done the following things one by one line: Output: A Simple PDF File This is a small demonstration. The document won’t look perfect and there will likely be a few minor cleanups to do, but you should have all the text from the executive summary.PdfReader = PyPDF2.PdfFileReader(pdfFileObj) # Create Document object document = Document() # Add a heading to our word document document.add_heading('Executive Summary', 0) # Create a paragraph by feeding our document the extracted text p = document.add_paragraph(clean_text)Īll we need to do now is save our document and it will appear in our file repository on the right side of the Google colab environment. With the necessary library installed we must first create an empty document object and then build that empty object by doing the following steps. !pip install python-docx from docx import Document First, we install and import it into our environment. The library we will be using is called python-docx. We need one more library now so that we can create our word document. clean_text = executive_summary.replace("\n","") Our Python Code: Making our word document We can remove this with a simple one-liner. You’ll notice that the text has many instances of “\n” within it when you print it out.

# Getting Executive Summary page_obj1 = pdf_reader.getPage(12) page_obj2 = pdf_reader.getPage(13) executive_summary = page_obj1.extractText() page_obj2.extractText() Now let’s pull all the text from pages 12 and 13 and combine them to get the executive summary. If you print the page_obj you will get something quite unreadable to the human eye. # How to create a page objec page_obj = pdf_reader.getPage(12) We can pull out an individual page using the following method. We know from looking at the original PDF that we are interested in pages 12 and 13 where the Executive Summary resides. # Converting the object into a PDF Reader Object pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj) # If you want to find out the number of pages in the PDF use this # command print(pdf_reader.numPages) Now we need to convert pdf_file_obj into a PyPDF2 object so that we can use the library to search through the Indonesia Energy Outlook to extract our text of interest. pdf_file_obj = open("/content/content-indonesia-energy-outlook-2019-english-version.pdf","rb") We must save the PDF as an object before we can start using PyPDF2 on it. !pip install PyPDF2 import PyPDF2īefore we move to the next step make sure you have loaded the PDF document into the file repository on the left of the colab environment. This library isn’t pre-installed in the Google colab environment so we will have to install it before importing the PyPDF2 into our code.

PyPDF2 can do much more than just extract text and if you are curious about its other capabilities, you can read about them here. The library we will use to extract the PDF text is called PyPDF2. Note: The following code explanation is designed for the Google colab environment.

With the PDF and text identified let’s move on to using python to extract the Executive Summary. For the purpose of this post, I am only going to focus on extracting the text from the Executive Summary on pages xii and xiii. If you open the link to the PDF you will find a long report with many pages and figures. Following the theme of my last post, I’m going to use another PDF focused on Indonesia’s current energy situation with the Indonesia Energy Outlook 2019 Report published by the Secretariat General of the National Energy Council.

0 Comments

Python pdf extract text

Leave a Reply.

Author

Archives

Categories