An open-source framework for training large multimodal models.
-
Updated
Aug 31, 2024 - Python
An open-source framework for training large multimodal models.
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
A curated list of Multimodal Related Research.
[CVPR 2024 & TPAMI 2025] UniRepLKNet
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
A Comparative Framework for Multimodal Recommender Systems
ICCV 2023-2025 Papers: Discover cutting-edge research from ICCV 2023-25, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
[ICCV 2023 & TPAMI 2025] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!
Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain
Multi-modality pre-training
Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!
Multi-Modal learning toolkit based on PaddlePaddle and PyTorch, supporting multiple applications such as multi-modal classification, cross-modal retrieval and image caption.
[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.
[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
[NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"
Add a description, image, and links to the multimodal-learning topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-learning topic, visit your repo's landing page and select "manage topics."