Extract Hyperlinks from PDF in Python



To extract hyperlinks from PDF in python can be done by using several libraries like PyPDF2, PDFminer and pdfx etc.

  • PyPDF2 : A python bulit-in library acts as PDF toolkit, allows us to read and manipulate PDF files.

  • PDFMiner : Tool used for extracting information from PDF documents, it focuses entirely on getting and analyzing text data.

  • Pdfx : This module is used to extract MetaData, plain data and URL from a given Pdf.

Using PyPDF2

PyPDF2 mainly capable for Extracting data, merging PDF's, Splitting and Rotating pages. This approach includes reading the PDF file and convert it into text then extracting URL from text using Regular Expression.

Install PyPDF2

To use this PyPDF2 library, we have to install by using below code.

pip install PyPDF2

Reading the PDF file

The below code will open PDF file in binary mode ('rb') and create a file object then passed to PyPDF2.PdfFileReader to create a pdfReader object which interacts with the content inside the PDF.

pdfReader.numPages will define the number of pages in the PDF and extractText() method will extract the text by iterating over each page.

import PyPDF2
 
file = "Enter PDF File Name" 
pdfFileObject = open(file, 'rb')  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)  
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
     
pdfFileObject.close()

Regular Expression to find URL

The regular Expression (regex) method is used to search for specific patterns in text such as URLs. In the below code findall() method will search the text which is extracted from the PDF page and return the list containing URLS r"(https?://\S+)" wil finds the strings starts with http:// or https://

# Import Module
import PyPDF2
import re 
 
# Enter File Name
file = "Enter PDF File Name"
 
# Open File file
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
# Regular Expression (Get URL from String)
def Find(string): 
   
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
   
# Iterate through all pages
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
     
    # Extract text from page
    pdf_text = pageObject.extractText()
     
    # Print all URL
    print(Find(pdf_text))
     
# CLose the PDF 
pdfFileObject.close() 

Test.pdf


Output

['http://www.education.gov.yk.ca/']

Using pdfx

The pdfx module is used specifically to extract URL, Metadata and plain text from the given PDF file. This approach makes the process of extracting the URL's simpler compared to PyPDF2

Install pdfx

Install it using 'pip install pdfx'

pip install pdfx

Example

In the below code pdfx.PDFx() reads the given PDF file and references_as_dict() method will return a dictionary containing URLs found in the PDF file.

# Import Module
import pdfx 
 
# Read PDF File
pdf = pdfx.PDFx("File Name") 
 
# Get list of URL
print(pdf.get_references_as_dict())

Output

{'url':['http://www.education.gov.yk.ca/']}

Using PDFMiner

Compared to the pyPDF tool this PDFminer is the more powerful and complex library. It allows us the detailed extraction of text, hyperlinks and also structure of a PDF file. It reads the PDF by converting the entire file into an element tree structure.

Install PDFMiner

To use the PDFMiner library, you first need to install it by using the below command

pip install pdfminer.six

Example

The below example code defines, reading a PDF page by page using PDFMiner, converting it to text for extracting the hyperlinks. The extract_pages() function processes the given PDF file and returns layout objects and LTLink object is used to identify hyperlinks in the PDF file.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTAnno, LTLink

file = "Enter PDF File Name"

# Iterate through PDF pages
for page_layout in extract_pages(file):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            # Extracting text
            for text_line in element:
                if isinstance(text_line, LTAnno):
                    continue
                print(text_line.get_text())
        if isinstance(element, LTLink):
            # Extracting hyperlinks
            print(f"Found hyperlink: {element.get('uri')}")

Output

Found hyperlink: http://www.education.gov.yk.ca/
Updated on: 2024-09-17T10:42:53+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements