Extract Text from PDF in Python: A Complete Guide with Practical Code Samples

Precisely Extract Text from PDF using Python

PDF files are everywhere—from contracts and research papers to eBooks and invoices. While they preserve formatting perfectly, extracting text from PDFs can be challenging, especially with large or complex documents. Manual copying is not only slow but often inaccurate.

Whether you’re a developer automating workflows, a data analyst processing content, or simply someone needing quick text extraction, programmatic methods can save you valuable time and effort.

In this comprehensive guide, you’ll learn how to extract text from PDF files in Python using Spire.PDF for Python — a powerful and easy-to-use PDF processing library. We’ll cover extracting all text, targeting specific pages or areas, ignoring hidden text, and capturing layout details such as text position and size.

Table of Contents

Why Extract Text from PDF Files

Text extraction from PDFs is essential for many use cases, including:

  • Automating data entry and document processing
  • Enabling full-text search and indexing
  • Performing data analysis on reports and surveys
  • Extracting content for machine learning and NLP
  • Converting PDFs to other editable formats

Install Spire.PDF for Python: Powerful PDF Parser Library

Spire.PDF for Python is a comprehensive and easy-to-use PDF processing library that simplifies all your PDF manipulation needs. It offers advanced text extraction capabilities that work seamlessly with both simple and complex PDF documents.

Installation

The library can be installed easily via pip. Open your terminal and run the following command:

pip install spire.pdf

Need help with the installation? Follow this step-by-step guide: How to Install Spire.PDF for Python on Windows

Extract Text from PDF (Basic Example)

If you just want to quickly read all the text from a PDF, this simple example shows how to do it. It iterates over each page, extracts the full text using PdfTextExtractor, and saves it to a text file with spacing and line breaks preserved.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Prepare a variable to hold the extracted text
all_text = ""

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()
# Extract all text including whitespaces
extractOptions.IsExtractAllText = True

# Loop through all pages and extract text
for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    textExtractor = PdfTextExtractor(page)
    text = textExtractor.ExtractText(extractOptions)
    # Append text from each page
    all_text += text + "\n"

# Write all extracted text to a file
with open('output/TextOfAllPages.txt', 'w', encoding='utf-8') as file:
    file.write(all_text)

Advanced Text Extraction Features

For greater control over what and how text is extracted, Spire.PDF for Python offers advanced options. You can selectively extract content from specific pages or regions, or even with layout details, such as text position and size, to better suit your specific data processing needs.

Retrieve Text from Selected Pages

Instead of processing an entire PDF, you can target specific pages for text extraction. This is especially useful for large documents where only certain sections are relevant for your task.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Create a PdfTextExtractOptions object and enable full text extraction
extractOptions = PdfTextExtractOptions()
# Extract all text including whitespaces
extractOptions.IsExtractAllText = True

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/TextOfPage.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text from Selected Pages Output

Get Text from Defined Area

When dealing with structured documents like forms or invoices, extracting text from a specific region can be more efficient. You can define a rectangular area and extract only the text within that boundary on the page.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Define the rectangular area to extract text from
# RectangleF(left, top, width, height)
extractOptions.ExtractArea = RectangleF(0.0, 100.0, 890.0, 80.0)

# Extract text from the specified area, keeping white spaces
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/TextOfRectangle.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text from Defined Area Output

Ignore Hidden Text During Extraction

Some PDFs contain hidden or invisible text, often used for accessibility or OCR layers. You can choose to ignore such content during extraction to focus only on what is actually visible to users.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()
# Ignore hidden text during extraction
extractOptions.IsShowHiddenText = False

# Get a specific page (e.g., page 2)
page = doc.Pages[1]

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a file using UTF-8 encoding
with open('output/ExcludeHiddenText.txt', 'w', encoding='utf-8') as file:
    file.write(text)

Retrieve Text with Position (Coordinates) and Size Information

For layout-sensitive applications—such as converting PDF content into editable formats or reconstructing page structure—you can extract text along with its position and size. This provides precise control over how content is interpreted and used.

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Loop through all pages of the document
for i in range(doc.Pages.Count):
    page = doc.Pages[i]

    # Create a PdfTextFinder object for the current page
    finder = PdfTextFinder(page)

    # Find all text fragments on the page
    fragments = finder.FindAllText()

    print(f"Page {i + 1}:")

    # Loop through all text fragments
    for fragment in fragments:
        # Extract text content from the current text fragment
        text = fragment.Text

        # Get bounding rectangles with position and size
        rects = fragment.Bounds

        print(f'Text: "{text}"')

        # Iterate through all rectangles
        for rect in rects:
            # Print the position and size information of the current rectangle
            print(f"Position: ({rect.X}, {rect.Y}), Size: ({rect.Width} x {rect.Height})")

        print()

Conclusion

Extracting text from PDF files in Python becomes efficient and flexible with Spire.PDF for Python. Whether you need to process entire documents or extract text from specific pages or regions, Spire.PDF provides a robust set of tools to meet your needs. By automating text extraction, you can streamline workflows, power intelligent search systems, or prepare data for analysis and machine learning.

FAQs

Q1: Can text be extracted from password-protected PDFs?

A1: Yes, Spire.PDF for Python can open and extract text from secured files by providing the correct password when loading the PDF document.

Q2: Is batch text extraction from multiple PDFs supported?

A2: Yes, you can programmatically iterate through a directory of PDF files and apply text extraction to each file efficiently using Spire.PDF for Python.

Q3: Is it possible to extract images or tables from PDFs?

A3: While this guide focuses on text extraction, Spire.PDF for Python also supports image extraction and table extraction.

Q4: Can text be extracted from scanned (image-based) PDFs?

A4: Extracting text from scanned PDFs requires OCR (Optical Character Recognition). Spire.PDF for Python does not include built-in OCR, but you can combine it with an OCR library like Spire.OCR for image-to-text conversion.

Get a Free License

To fully experience the capabilities of Spire.PDF for Python without any evaluation limitations, you can request a free 30-day trial license.