PDF Accessibility Solutions

This repository provides two complementary solutions for PDF accessibility:

PDF-to-PDF Remediation: Processes PDFs and maintains the PDF format while improving accessibility.
PDF-to-HTML Remediation: Converts PDFs to accessible HTML format.

Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards.

Index	Description
Architecture Overview	High level overview illustrating component interactions
Automated One Click Deployment	How to deploy the project
Testing Your PDF Accessibility Solution	User guide for the working solution
PDF-to-PDF Remediation Solution	PDF format preservation solution details
PDF-to-HTML Remediation Solution	HTML conversion solution details
Monitoring	System monitoring and observability
Troubleshooting	Common issues and solutions
Contributing	How to contribute to the project

Architecture Overview

The following architecture diagram illustrates the various AWS components utilized to deliver the solution.

When to Use This Tool

PDF-to-PDF Remediation - Suitable For:

Scanned documents requiring OCR and accessibility tagging
Simple PDFs with text and images but no complex forms
Documents without existing accessibility tags that need comprehensive remediation
PDFs requiring batch processing for large-scale accessibility improvements
Documents where PDF format must be preserved (e.g., for archival or legal requirements)

PDF-to-PDF Remediation - Not Suitable For:

PDFs with fully-tagged interactive forms - existing form accessibility tags may be affected during processing
Documents already meeting accessibility standards - processing may introduce unnecessary changes
PDFs requiring pixel-perfect layout preservation - some layout adjustments may occur
Documents with complex embedded multimedia - advanced interactive elements may not be fully preserved

PDF-to-HTML Remediation - Suitable For:

Documents intended for web display where HTML format is acceptable
Content requiring maximum accessibility through semantic HTML structure
PDFs with complex layouts that benefit from HTML's responsive design capabilities
Documents needing extensive alt text generation for images

PDF-to-HTML Remediation - Not Suitable For:

Documents requiring PDF format for legal, archival, or distribution purposes
PDFs with interactive forms that must remain functional in PDF format
Documents with precise layout requirements that cannot be adapted to HTML

Known Limitations

File Size Considerations

PDF-to-PDF Processing:

Simple PDFs: File size may increase up to 25% due to merge pipeline overhead (metadata, structure)
Complex PDFs with forms: Output files may be 20-50% larger than input due to accessibility enhancements
Large PDFs (>100MB): Processing time increases; consider splitting into smaller documents
Compression: Automatic compression is applied during merge operations to minimize size impact
Note: Smaller PDFs experience proportionally larger overhead; larger documents benefit more from compression

Form Field Accessibility

PDF-to-PDF Processing:

Tagged forms: Existing form field accessibility tags are preserved during processing
Untagged forms: Interactive form elements will receive accessibility tags
Partially-tagged forms: Existing tags are maintained while missing tags are added
Form validation: All preserved form field properties are verified after processing

Processing Limitations

Both Solutions:

Processing time: Complex documents may take several minutes to process
Concurrent processing: PDF-to-PDF supports up to 100 concurrent chunks via Step Functions
Manual review required: Some accessibility issues (e.g., reading order, complex tables) may need human verification
Language support: Primary support for English; other languages may have varying results

Accessibility Standards

Important Notes:

These tools significantly improve accessibility but do not guarantee 100% WCAG 2.1 Level AA compliance
Manual review and testing with assistive technologies is recommended for critical documents
Complex tables, charts, and diagrams may require manual accessibility improvements
Color contrast issues in images may need manual correction

For detailed information about limitations and workarounds, see LIMITATIONS.md.

For guidance on processing multiple PDFs efficiently, see BATCH_PROCESSING.md.

Automated One Click Deployment

We provide a unified deployment script that allows you to deploy either or both the solutions with a single command. Choose your preferred solution during deployment:

Prerequisites

Common Requirements:

AWS Account with appropriate permissions to create and manage AWS resources
- See IAM Permissions Guide for detailed permission requirements
AWS CloudShell access (AWS CLI is pre-installed and configured automatically)
- Sign in to the AWS Management Console
- In the top navigation bar, click the CloudShell icon (terminal symbol) next to the search bar
- Wait for CloudShell to initialize (this may take a few moments on first use)

Solution-Specific Requirements:

PDF-to-PDF:
- Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
  - Adobe PDF Services API to obtain API credentials.
PDF-to-HTML: AWS Bedrock Data Automation service access
- Ensure you have access to create a Bedrock Data Automation project - usually present by default

One-Click Deployment

Step 1: Open AWS CloudShell and Clone the Repository

git clone https://github.com/ASUCICREPO/PDF_Accessibility.git
cd PDF_Accessibility

Step 2: Run the Unified Deployment Script

chmod +x deploy.sh
./deploy.sh

Step 3: Follow the Interactive Prompts

The script will guide you through:

Solution Selection: Choose between PDF-to-PDF or PDF-to-HTML remediation
Solution-Specific Setup:
- PDF-to-PDF: Enter Adobe API credentials (stored securely in AWS Secrets Manager)
- PDF-to-HTML: Automatic creation of Bedrock Data Automation project
Automated Deployment: Real-time monitoring of the deployment progress
Optional UI Deployment: After successful deployment of your chosen solution(s), you'll have the option to deploy a user interface as well

Step 4: Test Your Deployment

After successful deployment, the script provides specific testing instructions for your chosen solution.

Testing Your PDF Accessibility Solution

PDF-to-PDF Solution Testing

Navigate to Your S3 Bucket
- In the AWS S3 Console, find the bucket starting with pdfaccessibility-
- This bucket was automatically created during deployment
Create the Input Folder
- Create a folder named pdf/ in the root of the bucket
- This is where you'll upload PDFs for processing
Upload Your PDF Files
- Upload any PDF file(s) to the pdf/ folder
- Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
- The process automatically triggers when files are uploaded
Monitor Processing
- Temporary Files: A temp/ folder will be created containing intermediate processing files
- Final Results: A result/ folder will be created with your accessibility-compliant PDF files
- Use the CloudWatch dashboard to monitor processing progress
Download Results
- Navigate to the result/ folder to access your remediated PDFs
- Files maintain their original names with "COMPLIANT" prefix after accessibility improvements applied

PDF-to-HTML Solution Testing

Navigate to Your S3 Bucket
- In the AWS S3 Console, find the bucket starting with pdf2html-bucket-
- This bucket was automatically created during deployment
Upload Your PDF Files
- Navigate to the uploads/ folder (created automatically during deployment)
- Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
- The process automatically triggers when files are uploaded
Monitor Processing
- Two folders will be created automatically:
  - output/: Contains temporary processing data and intermediate files
  - remediated/: Contains the final remediated results
Access Your Results
- Navigate to the remediated/ folder
- Download the zip file named final_{your-filename}.zip
Explore the Remediated Content The downloaded zip file contains:
- remediated.html: Final accessibility-compliant HTML version
- result.html: Original HTML conversion (before remediation)
- images/ folder: Extracted images with generated alt text
- remediation_report.html: Detailed report of accessibility improvements made
- usage_data.json: Processing metrics and usage statistics

Advanced Usage

Redeployment After initial deployment, you can redeploy using the created CodeBuild project:

aws codebuild start-build --project-name YOUR-PROJECT-NAME --source-version main

Or simply re-run the deployment script and choose the solution your want redeploy.

PDF-to-PDF Remediation Solution

Overview

This solution processes PDFs while maintaining the original PDF format. It uses AWS CDK to build infrastructure that splits PDFs into chunks, processes them via AWS Step Functions, and merges the results using ECS tasks.

Architecture

S3 Bucket: Stores input and processed PDFs
Lambda Functions: PDF splitting, merging, and accessibility checking
Step Functions: Orchestrates the processing workflow
ECS Fargate: Runs containerized processing tasks
CloudWatch Dashboard: Monitors progress and performance

Manual Deployment

For detailed manual deployment instructions, see our Manual Deployment Guide.

PDF-to-HTML Remediation Solution

Overview

This solution converts PDF documents to accessible HTML format while preserving layout and visual appearance. It leverages AWS Bedrock Data Automation for PDF parsing and uses a serverless Lambda architecture.

Architecture

S3 Bucket: Stores input PDFs and remediated HTML files
Lambda Function: Processes PDFs using containerized accessibility utility
ECR Repository: Hosts the Docker image for Lambda
Bedrock Data Automation: Provides PDF parsing and extraction capabilities

Monitoring

PDF-to-PDF Solution

CloudWatch Dashboard: Automatically created during deployment
Step Functions Console: Monitor workflow executions
ECS Console: Track container task status

PDF-to-HTML Solution

Lambda Logs: /aws/lambda/Pdf2HtmlPipeline
S3 Events: Monitor file processing status
CloudWatch Metrics: Track function performance

Troubleshooting

Common Issues

AWS Credentials

Ensure AWS CLI is configured with appropriate permissions
Verify access to required AWS services (S3, Lambda, ECS, Bedrock)

Service Limits

Check AWS service quotas if deployment fails
Request additional Elastic IPs if needed: EC2 Service Quotas

Build Failures

Check CodeBuild console for detailed error messages
Verify all prerequisites are met
Ensure Docker is available for PDF-to-HTML deployments

Solution-Specific Troubleshooting

PDF-to-PDF Issues

Verify Adobe API credentials are correct and active
Check CloudWatch logs for Lambda functions and ECS tasks
Ensure NOVA_PRO Bedrock model access is granted

PDF-to-HTML Issues

Verify Bedrock Data Automation permissions
Check Lambda function logs in CloudWatch
Ensure Docker image was pushed to ECR successfully

Getting Help

Check build logs in CodeBuild console
Review CloudWatch logs for runtime issues
Verify all prerequisites are met
For deployment issues, refer to: CDK GitHub Issue
For additional troubleshooting: Troubleshooting Guide
Contact support: ai-cic@amazon.com

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.

Acknowledgments

The PDF-to-HTML remediation functionality in this project is adapted from AWS Labs' Content Accessibility Utility on AWS. This version includes updates and enhancements tailored for integration within the PDF Accessibility backend.

Support

For questions, issues, or support:

Email: ai-cic@amazon.com
Issues: GitHub Issues

Built by Arizona State University's AI Cloud Innovation Center (AI CIC)
Powered by AWS

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
cdk		cdk
docker_autotag		docker_autotag
docs		docs
javascript_docker		javascript_docker
lambda		lambda
pdf2html		pdf2html
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
app.py		app.py
buildspec-unified.yml		buildspec-unified.yml
cdk.json		cdk.json
deploy.sh		deploy.sh
requirements.txt		requirements.txt

License

cmackdev/PDF_Accessibility

Folders and files

Latest commit

History

Repository files navigation

PDF Accessibility Solutions

Table of Contents

Architecture Overview

When to Use This Tool

PDF-to-PDF Remediation - Suitable For:

PDF-to-PDF Remediation - Not Suitable For:

PDF-to-HTML Remediation - Suitable For:

PDF-to-HTML Remediation - Not Suitable For:

Known Limitations

File Size Considerations

Form Field Accessibility

Processing Limitations

Accessibility Standards

Automated One Click Deployment

Prerequisites

One-Click Deployment

Testing Your PDF Accessibility Solution

PDF-to-PDF Solution Testing

PDF-to-HTML Solution Testing

Advanced Usage

PDF-to-PDF Remediation Solution

Overview

Architecture

Manual Deployment

PDF-to-HTML Remediation Solution

Overview

Architecture

Monitoring

PDF-to-PDF Solution

PDF-to-HTML Solution

Troubleshooting

Common Issues

Solution-Specific Troubleshooting

Getting Help

Contributing

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages