Skip to content

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

License

Notifications You must be signed in to change notification settings

cmackdev/PDF_Accessibility

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Accessibility Solutions

This repository provides two complementary solutions for PDF accessibility:

  1. PDF-to-PDF Remediation: Processes PDFs and maintains the PDF format while improving accessibility.
  2. PDF-to-HTML Remediation: Converts PDFs to accessible HTML format.

Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards.

Table of Contents

Index Description
Architecture Overview High level overview illustrating component interactions
Automated One Click Deployment How to deploy the project
Testing Your PDF Accessibility Solution User guide for the working solution
PDF-to-PDF Remediation Solution PDF format preservation solution details
PDF-to-HTML Remediation Solution HTML conversion solution details
Monitoring System monitoring and observability
Troubleshooting Common issues and solutions
Contributing How to contribute to the project

Architecture Overview

The following architecture diagram illustrates the various AWS components utilized to deliver the solution.

Architecture Diagram

When to Use This Tool

PDF-to-PDF Remediation - Suitable For:

  • Scanned documents requiring OCR and accessibility tagging
  • Simple PDFs with text and images but no complex forms
  • Documents without existing accessibility tags that need comprehensive remediation
  • PDFs requiring batch processing for large-scale accessibility improvements
  • Documents where PDF format must be preserved (e.g., for archival or legal requirements)

PDF-to-PDF Remediation - Not Suitable For:

  • PDFs with fully-tagged interactive forms - existing form accessibility tags may be affected during processing
  • Documents already meeting accessibility standards - processing may introduce unnecessary changes
  • PDFs requiring pixel-perfect layout preservation - some layout adjustments may occur
  • Documents with complex embedded multimedia - advanced interactive elements may not be fully preserved

PDF-to-HTML Remediation - Suitable For:

  • Documents intended for web display where HTML format is acceptable
  • Content requiring maximum accessibility through semantic HTML structure
  • PDFs with complex layouts that benefit from HTML's responsive design capabilities
  • Documents needing extensive alt text generation for images

PDF-to-HTML Remediation - Not Suitable For:

  • Documents requiring PDF format for legal, archival, or distribution purposes
  • PDFs with interactive forms that must remain functional in PDF format
  • Documents with precise layout requirements that cannot be adapted to HTML

Known Limitations

File Size Considerations

PDF-to-PDF Processing:

  • Simple PDFs: File size may increase up to 25% due to merge pipeline overhead (metadata, structure)
  • Complex PDFs with forms: Output files may be 20-50% larger than input due to accessibility enhancements
  • Large PDFs (>100MB): Processing time increases; consider splitting into smaller documents
  • Compression: Automatic compression is applied during merge operations to minimize size impact
  • Note: Smaller PDFs experience proportionally larger overhead; larger documents benefit more from compression

Form Field Accessibility

PDF-to-PDF Processing:

  • Tagged forms: Existing form field accessibility tags are preserved during processing
  • Untagged forms: Interactive form elements will receive accessibility tags
  • Partially-tagged forms: Existing tags are maintained while missing tags are added
  • Form validation: All preserved form field properties are verified after processing

Processing Limitations

Both Solutions:

  • Processing time: Complex documents may take several minutes to process
  • Concurrent processing: PDF-to-PDF supports up to 100 concurrent chunks via Step Functions
  • Manual review required: Some accessibility issues (e.g., reading order, complex tables) may need human verification
  • Language support: Primary support for English; other languages may have varying results

Accessibility Standards

Important Notes:

  • These tools significantly improve accessibility but do not guarantee 100% WCAG 2.1 Level AA compliance
  • Manual review and testing with assistive technologies is recommended for critical documents
  • Complex tables, charts, and diagrams may require manual accessibility improvements
  • Color contrast issues in images may need manual correction

For detailed information about limitations and workarounds, see LIMITATIONS.md.

For guidance on processing multiple PDFs efficiently, see BATCH_PROCESSING.md.

Automated One Click Deployment

We provide a unified deployment script that allows you to deploy either or both the solutions with a single command. Choose your preferred solution during deployment:

Prerequisites

Common Requirements:

  1. AWS Account with appropriate permissions to create and manage AWS resources
  2. AWS CloudShell access (AWS CLI is pre-installed and configured automatically)
    • Sign in to the AWS Management Console
    • In the top navigation bar, click the CloudShell icon (terminal symbol) next to the search bar
    • Wait for CloudShell to initialize (this may take a few moments on first use)

Solution-Specific Requirements:

  • PDF-to-PDF:
    • Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
  • PDF-to-HTML: AWS Bedrock Data Automation service access
    • Ensure you have access to create a Bedrock Data Automation project - usually present by default

One-Click Deployment

Step 1: Open AWS CloudShell and Clone the Repository

git clone https://github.com/ASUCICREPO/PDF_Accessibility.git
cd PDF_Accessibility

Step 2: Run the Unified Deployment Script

chmod +x deploy.sh
./deploy.sh

Step 3: Follow the Interactive Prompts

The script will guide you through:

  1. Solution Selection: Choose between PDF-to-PDF or PDF-to-HTML remediation
  2. Solution-Specific Setup:
    • PDF-to-PDF: Enter Adobe API credentials (stored securely in AWS Secrets Manager)
    • PDF-to-HTML: Automatic creation of Bedrock Data Automation project
  3. Automated Deployment: Real-time monitoring of the deployment progress
  4. Optional UI Deployment: After successful deployment of your chosen solution(s), you'll have the option to deploy a user interface as well

Step 4: Test Your Deployment

After successful deployment, the script provides specific testing instructions for your chosen solution.

Testing Your PDF Accessibility Solution

PDF-to-PDF Solution Testing

  1. Navigate to Your S3 Bucket

    • In the AWS S3 Console, find the bucket starting with pdfaccessibility-
    • This bucket was automatically created during deployment
  2. Create the Input Folder

    • Create a folder named pdf/ in the root of the bucket
    • This is where you'll upload PDFs for processing
  3. Upload Your PDF Files

    • Upload any PDF file(s) to the pdf/ folder
    • Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
    • The process automatically triggers when files are uploaded
  4. Monitor Processing

    • Temporary Files: A temp/ folder will be created containing intermediate processing files
    • Final Results: A result/ folder will be created with your accessibility-compliant PDF files
    • Use the CloudWatch dashboard to monitor processing progress
  5. Download Results

    • Navigate to the result/ folder to access your remediated PDFs
    • Files maintain their original names with "COMPLIANT" prefix after accessibility improvements applied

PDF-to-HTML Solution Testing

  1. Navigate to Your S3 Bucket

    • In the AWS S3 Console, find the bucket starting with pdf2html-bucket-
    • This bucket was automatically created during deployment
  2. Upload Your PDF Files

    • Navigate to the uploads/ folder (created automatically during deployment)
    • Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
    • The process automatically triggers when files are uploaded
  3. Monitor Processing

    • Two folders will be created automatically:
      • output/: Contains temporary processing data and intermediate files
      • remediated/: Contains the final remediated results
  4. Access Your Results

    • Navigate to the remediated/ folder
    • Download the zip file named final_{your-filename}.zip
  5. Explore the Remediated Content The downloaded zip file contains:

    • remediated.html: Final accessibility-compliant HTML version
    • result.html: Original HTML conversion (before remediation)
    • images/ folder: Extracted images with generated alt text
    • remediation_report.html: Detailed report of accessibility improvements made
    • usage_data.json: Processing metrics and usage statistics

Advanced Usage

Redeployment After initial deployment, you can redeploy using the created CodeBuild project:

aws codebuild start-build --project-name YOUR-PROJECT-NAME --source-version main

Or simply re-run the deployment script and choose the solution your want redeploy.

PDF-to-PDF Remediation Solution

Overview

This solution processes PDFs while maintaining the original PDF format. It uses AWS CDK to build infrastructure that splits PDFs into chunks, processes them via AWS Step Functions, and merges the results using ECS tasks.

Architecture

  • S3 Bucket: Stores input and processed PDFs
  • Lambda Functions: PDF splitting, merging, and accessibility checking
  • Step Functions: Orchestrates the processing workflow
  • ECS Fargate: Runs containerized processing tasks
  • CloudWatch Dashboard: Monitors progress and performance

Manual Deployment

For detailed manual deployment instructions, see our Manual Deployment Guide.

PDF-to-HTML Remediation Solution

Overview

This solution converts PDF documents to accessible HTML format while preserving layout and visual appearance. It leverages AWS Bedrock Data Automation for PDF parsing and uses a serverless Lambda architecture.

Architecture

  • S3 Bucket: Stores input PDFs and remediated HTML files
  • Lambda Function: Processes PDFs using containerized accessibility utility
  • ECR Repository: Hosts the Docker image for Lambda
  • Bedrock Data Automation: Provides PDF parsing and extraction capabilities

Monitoring

PDF-to-PDF Solution

  • CloudWatch Dashboard: Automatically created during deployment
  • Step Functions Console: Monitor workflow executions
  • ECS Console: Track container task status

PDF-to-HTML Solution

  • Lambda Logs: /aws/lambda/Pdf2HtmlPipeline
  • S3 Events: Monitor file processing status
  • CloudWatch Metrics: Track function performance

Troubleshooting

Common Issues

AWS Credentials

  • Ensure AWS CLI is configured with appropriate permissions
  • Verify access to required AWS services (S3, Lambda, ECS, Bedrock)

Service Limits

  • Check AWS service quotas if deployment fails
  • Request additional Elastic IPs if needed: EC2 Service Quotas

Build Failures

  • Check CodeBuild console for detailed error messages
  • Verify all prerequisites are met
  • Ensure Docker is available for PDF-to-HTML deployments

Solution-Specific Troubleshooting

PDF-to-PDF Issues

  • Verify Adobe API credentials are correct and active
  • Check CloudWatch logs for Lambda functions and ECS tasks
  • Ensure NOVA_PRO Bedrock model access is granted

PDF-to-HTML Issues

  • Verify Bedrock Data Automation permissions
  • Check Lambda function logs in CloudWatch
  • Ensure Docker image was pushed to ECR successfully

Getting Help

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.

Acknowledgments

The PDF-to-HTML remediation functionality in this project is adapted from AWS Labs' Content Accessibility Utility on AWS. This version includes updates and enhancements tailored for integration within the PDF Accessibility backend.


Support

For questions, issues, or support:


Built by Arizona State University's AI Cloud Innovation Center (AI CIC)
Powered by AWS

About

Experience the PDF Remediation solution developed at ASU AI Cloud Innovation Center. This innovative tool remediates PDF documents to meet WCAG 2.1 Level AA standards with tagging, metadata cleanup, and AI-powered alt-text generation, promoting digital accessibility for everyone.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 79.0%
  • C++ 13.6%
  • C 6.3%
  • Shell 0.3%
  • Java 0.3%
  • JavaScript 0.3%
  • Other 0.2%