This repository provides two complementary solutions for PDF accessibility:
- PDF-to-PDF Remediation: Processes PDFs and maintains the PDF format while improving accessibility.
- PDF-to-HTML Remediation: Converts PDFs to accessible HTML format.
Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards.
| Index | Description |
|---|---|
| Architecture Overview | High level overview illustrating component interactions |
| Automated One Click Deployment | How to deploy the project |
| Testing Your PDF Accessibility Solution | User guide for the working solution |
| PDF-to-PDF Remediation Solution | PDF format preservation solution details |
| PDF-to-HTML Remediation Solution | HTML conversion solution details |
| Monitoring | System monitoring and observability |
| Troubleshooting | Common issues and solutions |
| Contributing | How to contribute to the project |
The following architecture diagram illustrates the various AWS components utilized to deliver the solution.
- Scanned documents requiring OCR and accessibility tagging
- Simple PDFs with text and images but no complex forms
- Documents without existing accessibility tags that need comprehensive remediation
- PDFs requiring batch processing for large-scale accessibility improvements
- Documents where PDF format must be preserved (e.g., for archival or legal requirements)
- PDFs with fully-tagged interactive forms - existing form accessibility tags may be affected during processing
- Documents already meeting accessibility standards - processing may introduce unnecessary changes
- PDFs requiring pixel-perfect layout preservation - some layout adjustments may occur
- Documents with complex embedded multimedia - advanced interactive elements may not be fully preserved
- Documents intended for web display where HTML format is acceptable
- Content requiring maximum accessibility through semantic HTML structure
- PDFs with complex layouts that benefit from HTML's responsive design capabilities
- Documents needing extensive alt text generation for images
- Documents requiring PDF format for legal, archival, or distribution purposes
- PDFs with interactive forms that must remain functional in PDF format
- Documents with precise layout requirements that cannot be adapted to HTML
PDF-to-PDF Processing:
- Simple PDFs: File size may increase up to 25% due to merge pipeline overhead (metadata, structure)
- Complex PDFs with forms: Output files may be 20-50% larger than input due to accessibility enhancements
- Large PDFs (>100MB): Processing time increases; consider splitting into smaller documents
- Compression: Automatic compression is applied during merge operations to minimize size impact
- Note: Smaller PDFs experience proportionally larger overhead; larger documents benefit more from compression
PDF-to-PDF Processing:
- Tagged forms: Existing form field accessibility tags are preserved during processing
- Untagged forms: Interactive form elements will receive accessibility tags
- Partially-tagged forms: Existing tags are maintained while missing tags are added
- Form validation: All preserved form field properties are verified after processing
Both Solutions:
- Processing time: Complex documents may take several minutes to process
- Concurrent processing: PDF-to-PDF supports up to 100 concurrent chunks via Step Functions
- Manual review required: Some accessibility issues (e.g., reading order, complex tables) may need human verification
- Language support: Primary support for English; other languages may have varying results
Important Notes:
- These tools significantly improve accessibility but do not guarantee 100% WCAG 2.1 Level AA compliance
- Manual review and testing with assistive technologies is recommended for critical documents
- Complex tables, charts, and diagrams may require manual accessibility improvements
- Color contrast issues in images may need manual correction
For detailed information about limitations and workarounds, see LIMITATIONS.md.
For guidance on processing multiple PDFs efficiently, see BATCH_PROCESSING.md.
We provide a unified deployment script that allows you to deploy either or both the solutions with a single command. Choose your preferred solution during deployment:
Common Requirements:
- AWS Account with appropriate permissions to create and manage AWS resources
- See IAM Permissions Guide for detailed permission requirements
- AWS CloudShell access (AWS CLI is pre-installed and configured automatically)
- Sign in to the AWS Management Console
- In the top navigation bar, click the CloudShell icon (terminal symbol) next to the search bar
- Wait for CloudShell to initialize (this may take a few moments on first use)
Solution-Specific Requirements:
- PDF-to-PDF:
- Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
- Adobe PDF Services API to obtain API credentials.
- Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
- PDF-to-HTML: AWS Bedrock Data Automation service access
- Ensure you have access to create a Bedrock Data Automation project - usually present by default
Step 1: Open AWS CloudShell and Clone the Repository
git clone https://github.com/ASUCICREPO/PDF_Accessibility.git
cd PDF_AccessibilityStep 2: Run the Unified Deployment Script
chmod +x deploy.sh
./deploy.shStep 3: Follow the Interactive Prompts
The script will guide you through:
- Solution Selection: Choose between PDF-to-PDF or PDF-to-HTML remediation
- Solution-Specific Setup:
- PDF-to-PDF: Enter Adobe API credentials (stored securely in AWS Secrets Manager)
- PDF-to-HTML: Automatic creation of Bedrock Data Automation project
- Automated Deployment: Real-time monitoring of the deployment progress
- Optional UI Deployment: After successful deployment of your chosen solution(s), you'll have the option to deploy a user interface as well
Step 4: Test Your Deployment
After successful deployment, the script provides specific testing instructions for your chosen solution.
-
Navigate to Your S3 Bucket
- In the AWS S3 Console, find the bucket starting with
pdfaccessibility- - This bucket was automatically created during deployment
- In the AWS S3 Console, find the bucket starting with
-
Create the Input Folder
- Create a folder named
pdf/in the root of the bucket - This is where you'll upload PDFs for processing
- Create a folder named
-
Upload Your PDF Files
- Upload any PDF file(s) to the
pdf/folder - Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
- The process automatically triggers when files are uploaded
- Upload any PDF file(s) to the
-
Monitor Processing
- Temporary Files: A
temp/folder will be created containing intermediate processing files - Final Results: A
result/folder will be created with your accessibility-compliant PDF files - Use the CloudWatch dashboard to monitor processing progress
- Temporary Files: A
-
Download Results
- Navigate to the
result/folder to access your remediated PDFs - Files maintain their original names with "COMPLIANT" prefix after accessibility improvements applied
- Navigate to the
-
Navigate to Your S3 Bucket
- In the AWS S3 Console, find the bucket starting with
pdf2html-bucket- - This bucket was automatically created during deployment
- In the AWS S3 Console, find the bucket starting with
-
Upload Your PDF Files
- Navigate to the
uploads/folder (created automatically during deployment) - Bulk Processing: You can upload multiple PDFs in the bucket for batch remediation
- The process automatically triggers when files are uploaded
- Navigate to the
-
Monitor Processing
- Two folders will be created automatically:
output/: Contains temporary processing data and intermediate filesremediated/: Contains the final remediated results
- Two folders will be created automatically:
-
Access Your Results
- Navigate to the
remediated/folder - Download the zip file named
final_{your-filename}.zip
- Navigate to the
-
Explore the Remediated Content The downloaded zip file contains:
remediated.html: Final accessibility-compliant HTML versionresult.html: Original HTML conversion (before remediation)images/folder: Extracted images with generated alt textremediation_report.html: Detailed report of accessibility improvements madeusage_data.json: Processing metrics and usage statistics
Redeployment After initial deployment, you can redeploy using the created CodeBuild project:
aws codebuild start-build --project-name YOUR-PROJECT-NAME --source-version mainOr simply re-run the deployment script and choose the solution your want redeploy.
This solution processes PDFs while maintaining the original PDF format. It uses AWS CDK to build infrastructure that splits PDFs into chunks, processes them via AWS Step Functions, and merges the results using ECS tasks.
- S3 Bucket: Stores input and processed PDFs
- Lambda Functions: PDF splitting, merging, and accessibility checking
- Step Functions: Orchestrates the processing workflow
- ECS Fargate: Runs containerized processing tasks
- CloudWatch Dashboard: Monitors progress and performance
For detailed manual deployment instructions, see our Manual Deployment Guide.
This solution converts PDF documents to accessible HTML format while preserving layout and visual appearance. It leverages AWS Bedrock Data Automation for PDF parsing and uses a serverless Lambda architecture.
- S3 Bucket: Stores input PDFs and remediated HTML files
- Lambda Function: Processes PDFs using containerized accessibility utility
- ECR Repository: Hosts the Docker image for Lambda
- Bedrock Data Automation: Provides PDF parsing and extraction capabilities
- CloudWatch Dashboard: Automatically created during deployment
- Step Functions Console: Monitor workflow executions
- ECS Console: Track container task status
- Lambda Logs:
/aws/lambda/Pdf2HtmlPipeline - S3 Events: Monitor file processing status
- CloudWatch Metrics: Track function performance
AWS Credentials
- Ensure AWS CLI is configured with appropriate permissions
- Verify access to required AWS services (S3, Lambda, ECS, Bedrock)
Service Limits
- Check AWS service quotas if deployment fails
- Request additional Elastic IPs if needed: EC2 Service Quotas
Build Failures
- Check CodeBuild console for detailed error messages
- Verify all prerequisites are met
- Ensure Docker is available for PDF-to-HTML deployments
PDF-to-PDF Issues
- Verify Adobe API credentials are correct and active
- Check CloudWatch logs for Lambda functions and ECS tasks
- Ensure NOVA_PRO Bedrock model access is granted
PDF-to-HTML Issues
- Verify Bedrock Data Automation permissions
- Check Lambda function logs in CloudWatch
- Ensure Docker image was pushed to ECR successfully
- Check build logs in CodeBuild console
- Review CloudWatch logs for runtime issues
- Verify all prerequisites are met
- For deployment issues, refer to: CDK GitHub Issue
- For additional troubleshooting: Troubleshooting Guide
- Contact support: ai-cic@amazon.com
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes.
The PDF-to-HTML remediation functionality in this project is adapted from AWS Labs' Content Accessibility Utility on AWS. This version includes updates and enhancements tailored for integration within the PDF Accessibility backend.
For questions, issues, or support:
- Email: ai-cic@amazon.com
- Issues: GitHub Issues
Built by Arizona State University's AI Cloud Innovation Center (AI CIC)
Powered by AWS
