Winner solution of mobile AI (CVPRW 2021).
-
Updated
May 14, 2022 - Python
Winner solution of mobile AI (CVPRW 2021).
FrostNet: Towards Quantization-Aware Network Architecture Search
Quantization Aware Training
将端上模型部署过程中,常见的问题以及解决办法记录并汇总,希望能给其他人带来一点帮助。
Translation API using Meta's NLLB-200 model with 200+ languages
Hardware-aware optimization of CNN inference using INT8 quantization in PyTorch. Includes benchmarking, profiling, and visualization of accuracy, latency, and model size for edge deployment.
High-performance LLM inference platform with vLLM continuous batching achieving 12.3K+ req/sec, 42ms P50/178ms P99 latency, INT8/INT4 quantization (70% memory savings), tensor parallelism across 4 GPUs, and comprehensive monitoring serving 1500+ concurrent users.
Add a description, image, and links to the int8-quantization topic page so that developers can more easily learn about it.
To associate your repository with the int8-quantization topic, visit your repo's landing page and select "manage topics."