A high-performance CUDA-accelerated image processing library with a web-based interface for real-time filter comparison and performance analysis.
- Three Image Filters: Gaussian Blur, Box Blur, and Sobel Edge Detection
- Multiple Optimization Levels: Compare naive vs optimized implementations
- Web Interface: Drag-and-drop image processing with real-time performance metrics
- REST API: FastAPI backend with comprehensive endpoints
- Performance Metrics: Execution time, memory bandwidth, and FPS calculations
- Full Color Support: Grayscale, RGB, and RGBA image processing
Smooth blur using weighted averaging with a bell curve distribution.
- Level 1 (Naive): Global memory only, direct reads
- Level 2 (Texture Memory): Hardware texture caching, constant memory for kernel weights, vectorized access
- Performance: 23.24× speedup (Level 2 vs Level 1) on RTX 4050
Simple average blur - all pixels within kernel radius have equal weight.
- Level 1 (Naive): Global memory only
- Level 2 (Shared Memory): Tile-based with cooperative loading and halo regions
Detects edges using gradient magnitude calculation with 3×3 Sobel kernels.
- Level 1 (Naive): Global memory only
- Level 2 (Shared Memory): Tile-based with pre-computed grayscale conversion
- Performance: 34.74× speedup (Level 2 vs Level 1) on RTX 4050
gpu_image_processing/
├── cuda_lib/ # Core CUDA library
│ ├── include/
│ │ └── image_filters.h # Public API header
│ ├── src/
│ │ └── image_filters.cu # CUDA kernel implementations
│ └── CMakeLists.txt
├── backend/ # Python FastAPI server
│ ├── app.py # REST API endpoints
│ ├── cuda_bindings/ # pybind11 Python bindings
│ │ ├── bindings.cpp # C++ ↔ Python bridge
│ │ └── CMakeLists.txt
│ ├── profiling/ # Nsight Compute integration
│ │ └── ncu_profiler.py
│ ├── requirements.txt
│ └── venv/ # Python virtual environment
├── frontend/ # Web interface
│ ├── index.html # Main page
│ ├── js/
│ │ └── app.js # Frontend logic
│ └── css/
│ └── styles.css # Styling
├── tests/ # Test programs
│ ├── test_gaussian_blur.cu
│ ├── test_box_blur.cu
│ ├── test_comparison.cu
│ └── test_real_image.cu
├── build/ # Compiled binaries
├── CMakeLists.txt # Main build config
├── start_servers.sh # Start backend + frontend
└── stop_servers.sh # Stop servers
- CUDA Toolkit (11.0+)
- CMake (3.18+)
- Python (3.8+)
- NVIDIA GPU with compute capability 7.0+
# Configure and build
mkdir -p build && cd build
cmake ..
make -j$(nproc)
# This creates: build/cuda_lib/libgpu_image_filters.acd backend/cuda_bindings
# Configure and build
cmake .
make
# This creates: backend/gpu_filters.cpython-*.socd backend
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt# Test Python module
cd backend
source venv/bin/activate
python3 -c "import gpu_filters; print('GPU filters loaded successfully!')"# Start both backend and frontend servers
./start_servers.shThis will:
- Build CUDA bindings if needed
- Start FastAPI backend on
http://localhost:8000 - Start frontend web server on
http://localhost:8080
Backend:
cd backend
source venv/bin/activate
python app.pyFrontend:
cd frontend
python3 -m http.server 8080Then open http://localhost:8080 in your browser.
The web interface provides:
- Drag & Drop Upload: Easy image input
- Filter Selection: Choose from Gaussian, Box, or Sobel
- Optimization Level Comparison: Process with multiple levels simultaneously
- Performance Metrics: Real-time execution time, bandwidth, and FPS
- Side-by-Side Comparison: Visual comparison of results
- Interactive Charts: Performance visualization with Chart.js
API information and status
Health check with GPU availability status
List all available filters and their parameters
Process image with selected filter and optimization level
Request:
{
"image": "data:image/jpeg;base64,/9j/4AAQSkZJRg...",
"filter": "gaussian",
"level": 2,
"sigma": 2.0,
"radius": 3
}Response:
{
"processed_image": "data:image/png;base64,iVBORw0KGgo...",
"metrics": {
"time_ms": 0.293,
"bandwidth_gbps": 39.99,
"fps": 3415.67
},
"info": {
"filter": "gaussian",
"level": "texture_memory",
"width": 1024,
"height": 1023,
"channels": 3
}
}Process image with ALL optimization levels for comparison
Upload image file and get base64 encoded version
API Documentation: Visit http://localhost:8000/docs for interactive Swagger UI
Gaussian Blur (3239×2146 RGB, σ=2.0, radius=3):
| Level | Time | Speedup |
|---|---|---|
| Level 1 (Naive) | 22.157 ms | 1× |
| Level 2 (Texture) | 0.953 ms | 23.24× |
Real kernel execution times from CUDA events (without profiling overhead)
Box Blur (3239×2146 RGB, radius=5):
| Level | Time | Speedup |
|---|---|---|
| Level 1 (Naive) | 12.311 ms | 1× |
| Level 2 (Shared) | 2.766 ms | 4.45× |
Real kernel execution times from CUDA events (without profiling overhead)
Sobel Edge Detection (3239×2146 RGB):
| Level | Time | Speedup |
|---|---|---|
| Level 1 (Naive) | 18.339 ms | 1× |
| Level 2 (Shared) | 0.528 ms | 34.74× |
Real kernel execution times from CUDA events (without profiling overhead)
- Global Memory: High latency (~400 cycles), high bandwidth
- Shared Memory: Low latency (~5 cycles), limited size (48KB per SM)
- Constant Memory: Read-only, cached, optimal for kernel weights
- Texture Memory: Hardware caching with automatic boundary handling
- Separable Convolution: 2D filter → two 1D passes (O(n²) → O(2n))
- Cooperative Loading: Threads work together to load shared memory tiles
- Halo/Apron Pixels: Border threads load extra pixels for neighbor access
- Thread Synchronization:
__syncthreads()for shared memory consistency - Texture Objects: Hardware-accelerated 2D spatial caching
- CUDA Events: Precise GPU timing without CPU overhead
- Memory Bandwidth: Bytes transferred / execution time
- Nsight Compute: Optional detailed profiling metrics
cd build
# Test Gaussian blur
./test_gaussian_blur
# Test Box blur
./test_box_blur
# Test comparison (multiple filters)
./test_comparison
# Test with real images
./test_real_image path/to/image.jpgcd backend
source venv/bin/activate
python test_client.py path/to/image.jpg- CUDA: GPU-accelerated image processing kernels
- C++/pybind11: Python bindings for CUDA library
- FastAPI: Modern Python web framework
- HTML/CSS/JavaScript: Frontend web interface
- Chart.js: Performance visualization
- CMake: Build system
- STB Image: Image I/O library
- Additional Filters: Canny Edge Detection, Bilateral Filter, Median Filter
- Level 3 & 4 Optimizations: Advanced techniques for all filters
- Batch Processing: Process multiple images simultaneously
This project is provided as-is for educational and research purposes.
Built with CUDA 🚀 | Performance-focused ⚡ | Web-enabled 🌐

