Reading PowerPoint Files in Python: Extract Text, Images & More

Python Read PowerPoint Documents

PowerPoint (PPT & PPTX) files are rich with diverse content, including text, images, tables, charts, shapes, and metadata. Extracting these elements programmatically can unlock a wide range of use cases, from automating repetitive tasks to performing in-depth data analysis or migrating content across platforms.

In this tutorial, we'll explore how to read PowerPoint documents in Python using Spire.Presentation for Python, a powerful library for processing PowerPoint files.

Table of Contents:

1. Python Library to Read PowerPoint Files

To work with PowerPoint files in Python, we'll use Spire.Presentation for Python. This feature-rich library enables developers to create, edit, and read content from PowerPoint presentations efficiently. It allows for the extraction of text, images, tables, SmartArt, and metadata with minimal coding effort.

Before we begin, install the library using pip:

pip install spire.presentation

Now, let's dive into different ways to extract content from PowerPoint files.

2. Extracting Text from Slides in Python

PowerPoint slides contain text in various forms—shapes, tables, SmartArt, and more. We'll explore how to extract text from each of these elements.

2.1 Extract Text from Shapes

Most text in PowerPoint slides resides within shapes (text boxes, labels, etc.). Here’s how to extract text from shapes:

Steps-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Iterate through each slide and its shapes.
  • Check if a shape is an IAutoShape (a standard text container).
  • Extract text from each paragraph in the shape.

Code Example

from spire.presentation import *
from spire.presentation.common import *

# Create an object of Presentation class
presentation = Presentation()

# Load a PowerPoint presentation
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Create a list
text = []

# Loop through the slides in the document
for slide_index, slide in enumerate(presentation.Slides):
    # Add slide marker
    text.append(f"====slide {slide_index + 1}====")

    # Loop through the shapes in the slide
    for shape in slide.Shapes:
        # Check if the shape is an IAutoShape object
        if isinstance(shape, IAutoShape):
            # Loop through the paragraphs in the shape
            for paragraph in shape.TextFrame.Paragraphs:
                # Get the paragraph text and append it to the list
                text.append(paragraph.Text)

# Write the text to a txt file
with open("output/ExtractAllText.txt", "w", encoding='utf-8') as f:
    for s in text:
        f.write(s + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing all extracted text from shapes in the presentation.

2.2 Extract Text from Tables

Tables in PowerPoint store structured data. Extracting this data requires iterating through each cell to maintain the table’s structure.

Step-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Iterate through each slide to access its shapes.
  • Identify table shapes (ITable objects).
  • Loop through rows and cells to extract text.

Code Example

from spire.presentation import *
from spire.presentation.common import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint file
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Create a list for tables
tables = []

# Loop through the slides
for slide in presentation.Slides:
    # Loop through the shapes in the slide
    for shape in slide.Shapes:
        # Check whether the shape is a table
        if isinstance(shape, ITable):
            tableData = ""

            # Loop through the rows in the table
            for row in shape.TableRows:
                rowData = ""

                # Loop through the cells in the row
                for i in range(row.Count):
                    # Get the cell value
                    cellValue = row[i].TextFrame.Text
                    # Add cell value with spaces for better readability
                    rowData += (cellValue + "  |  " if i < row.Count - 1 else cellValue)
                tableData += (rowData + "\n")
            tables.append(tableData)

# Write the tables to text files
for idx, table in enumerate(tables, start=1):
    fileName = f"output/Table-{idx}.txt"
    with open(fileName, "w", encoding='utf-8') as f:
        f.write(table)

# Dispose resources
presentation.Dispose()

Output:

A text file containing structured table data from the presentation.

2.3 Extract Text from SmartArt

SmartArt is a unique feature in PowerPoint used for creating diagrams. Extracting text from SmartArt involves accessing its nodes and retrieving the text from each node.

Step-by-Step Guide

  • Load the PowerPoint file into a Presentation object.
  • Iterate through each slide and its shapes.
  • Identify ISmartArt shapes in slides.
  • Loop through each node in the SmartArt.
  • Extract and save the text from each node.

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint file
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Iterate through each slide in the presentation
for slide_index, slide in enumerate(presentation.Slides):

    # Create a list to store the extracted text for the current slide
    extracted_text = []

    # Loop through the shapes on the slide and find the SmartArt shapes
    for shape in slide.Shapes:
        if isinstance(shape, ISmartArt):
            smartArt = shape
            
            # Extract text from the SmartArt nodes and append to the list
            for node in smartArt.Nodes:
                extracted_text.append(node.TextFrame.Text)

    # Write the extracted text to a separate text file for each slide
    if extracted_text:  # Only create a file if there's text extracted
        file_name = f"output/SmartArt-from-slide-{slide_index + 1}.txt"
        with open(file_name, "w", encoding="utf-8") as text_file:
            for text in extracted_text:
                text_file.write(text + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing node text of a SmartArt.

You might also be interested in: Read Speaker Notes in PowerPoint in Python

3. Saving Images from Slides in Python

In addition to text, slides often contain images that may be important for your analysis. This section will show you how to save images from the slides.

Step-by-Step Guide

  • Initialize the Presentation object and load your PowerPoint file.
  • Access the Images collection in the presentation.
  • Iterate through each image and save it in a desired format (e.g., PNG).

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint document
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Get the images in the document
images = presentation.Images

# Iterate through the images in the document
for i, image in enumerate(images):

    # Save a certain image in the specified path
    ImageName = "Output/Images_"+str(i)+".png"
    image.Image.Save(ImageName)

# Dispose resources
presentation.Dispose()

Output:

PNG files of all images extracted from the PowerPoint.

4. Accessing Metadata (Document Properties) in Python

Extracting metadata provides insights into the presentation, such as its title, author, and keywords. This section will guide you on how to access and save this metadata.

Step-by-Step Guide

  • Create and load your PowerPoint file into a Presentation object.
  • Access the DocumentProperty object.
  • Extract properties like Title , Author , and Keywords .

Code Example

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation object
presentation = Presentation()

# Load a PowerPoint document
presentation.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pptx")

# Get the DocumentProperty object
documentProperty = presentation.DocumentProperty

# Prepare the content for the text file
properties = [
    f"Title: {documentProperty.Title}",
    f"Subject: {documentProperty.Subject}",
    f"Author: {documentProperty.Author}",
    f"Manager: {documentProperty.Manager}",
    f"Company: {documentProperty.Company}",
    f"Category: {documentProperty.Category}",
    f"Keywords: {documentProperty.Keywords}",
    f"Comments: {documentProperty.Comments}",
]

# Write the properties to a text file
with open("output/DocumentProperties.txt", "w", encoding="utf-8") as text_file:
    for line in properties:
        text_file.write(line + "\n")

# Dispose resources
presentation.Dispose()

Output:

A text file containing PowerPoint document properties.

You might also be interested in: Add Document Properties to a PowerPoint File in Python

5. Conclusion

With Spire.Presentation for Python, you can effortlessly read and extract various elements from PowerPoint files—such as text, images, tables, and metadata. This powerful library streamlines automation tasks, content analysis, and data migration, allowing for efficient management of PowerPoint files. Whether you're developing an analytics tool, automating document processing, or managing presentation content, Spire.Presentation offers a robust and seamless solution for programmatically handling PowerPoint files.

6. FAQs

Q1. Can Spire.Presentation handle password-protected PowerPoint files?

Yes, Spire.Presentation can open and process password-protected PowerPoint files. To access an encrypted file, use the LoadFromFile() method with the password parameter:

presentation.LoadFromFile("encrypted.pptx", "yourpassword")

Q2. How can I read comments from PowerPoint slides?

You can read comments from PowerPoint slides using the Spire.Presentation library. Here’s how:

from spire.presentation import Presentation

presentation = Presentation()
presentation.LoadFromFile("Input.pptx")

with open("PowerPoint_Comments.txt", "w", encoding="utf-8") as file:
    for slide_idx, slide in enumerate(presentation.Slides):
        if len(slide.Comments) > 0:
            for comment_idx, comment in enumerate(slide.Comments):
                file.write(f"Comment {comment_idx + 1} from Slide {slide_idx + 1}: {comment.Text}\n")

Q3. Does Spire.Presentation preserve formatting when extracting text?

Basic text extraction retrieves raw text content. For formatted text (fonts, colors), you would need to access additional properties like TextRange.LatinFont and TextRange.Fill .

Q4. Are there any limitations on file size when reading PowerPoint files in Python?

While Spire.Presentation can handle most standard presentations, extremely large files (hundreds of MB) may require optimization for better performance.

Q5. Can I create or modify PowerPoint documents using Spire.Presentation?

Yes, you can create PowerPoint documents and modify existing ones using Spire.Presentation. The library provides a range of features that allow you to add new slides, insert text, images, tables, and shapes, as well as edit existing content.

Get a Free License

To fully experience the capabilities of Spire.Presentation for Python without any evaluation limitations, you can request a free 30-day trial license.