
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract Hyperlinks from PDF in Python
To extract hyperlinks from PDF in python can be done by using several libraries like PyPDF2, PDFminer and pdfx etc.
-
PyPDF2 : A python bulit-in library acts as PDF toolkit, allows us to read and manipulate PDF files.
-
PDFMiner : Tool used for extracting information from PDF documents, it focuses entirely on getting and analyzing text data.
-
Pdfx : This module is used to extract MetaData, plain data and URL from a given Pdf.
Using PyPDF2
PyPDF2 mainly capable for Extracting data, merging PDF's, Splitting and Rotating pages. This approach includes reading the PDF file and convert it into text then extracting URL from text using Regular Expression.
Install PyPDF2
To use this PyPDF2 library, we have to install by using below code.
pip install PyPDF2
Reading the PDF file
The below code will open PDF file in binary mode ('rb') and create a file object then passed to PyPDF2.PdfFileReader to create a pdfReader object which interacts with the content inside the PDF.
pdfReader.numPages will define the number of pages in the PDF and extractText() method will extract the text by iterating over each page.
import PyPDF2 file = "Enter PDF File Name" pdfFileObject = open(file, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) for page_number in range(pdfReader.numPages): pageObject = pdfReader.getPage(page_number) pdf_text = pageObject.extractText() print(pdf_text) pdfFileObject.close()
Regular Expression to find URL
The regular Expression (regex) method is used to search for specific patterns in text such as URLs. In the below code findall() method will search the text which is extracted from the PDF page and return the list containing URLS r"(https?://\S+)" wil finds the strings starts with http:// or https://
# Import Module import PyPDF2 import re # Enter File Name file = "Enter PDF File Name" # Open File file pdfFileObject = open(file, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObject) # Regular Expression (Get URL from String) def Find(string): # findall() has been used # with valid conditions for urls in string regex = r"(https?://\S+)" url = re.findall(regex,string) return [x for x in url] # Iterate through all pages for page_number in range(pdfReader.numPages): pageObject = pdfReader.getPage(page_number) # Extract text from page pdf_text = pageObject.extractText() # Print all URL print(Find(pdf_text)) # CLose the PDF pdfFileObject.close()
Test.pdf
Output
['http://www.education.gov.yk.ca/']
Using pdfx
The pdfx module is used specifically to extract URL, Metadata and plain text from the given PDF file. This approach makes the process of extracting the URL's simpler compared to PyPDF2
Install pdfx
Install it using 'pip install pdfx'
pip install pdfx
Example
In the below code pdfx.PDFx() reads the given PDF file and references_as_dict() method will return a dictionary containing URLs found in the PDF file.
# Import Module import pdfx # Read PDF File pdf = pdfx.PDFx("File Name") # Get list of URL print(pdf.get_references_as_dict())
Output
{'url':['http://www.education.gov.yk.ca/']}
Using PDFMiner
Compared to the pyPDF tool this PDFminer is the more powerful and complex library. It allows us the detailed extraction of text, hyperlinks and also structure of a PDF file. It reads the PDF by converting the entire file into an element tree structure.
Install PDFMiner
To use the PDFMiner library, you first need to install it by using the below command
pip install pdfminer.six
Example
The below example code defines, reading a PDF page by page using PDFMiner, converting it to text for extracting the hyperlinks. The extract_pages() function processes the given PDF file and returns layout objects and LTLink object is used to identify hyperlinks in the PDF file.
from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTAnno, LTLink file = "Enter PDF File Name" # Iterate through PDF pages for page_layout in extract_pages(file): for element in page_layout: if isinstance(element, LTTextContainer): # Extracting text for text_line in element: if isinstance(text_line, LTAnno): continue print(text_line.get_text()) if isinstance(element, LTLink): # Extracting hyperlinks print(f"Found hyperlink: {element.get('uri')}")
Output
Found hyperlink: http://www.education.gov.yk.ca/