[CLIP-VIT-L + Qwen] 多模态大模型源码阅读

本文链接：https://blog.csdn.net/FlowerLoveJava/article/details/141500413

[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - DataSet篇

前情提要
源码解读

参考repo:WatchTower-Liu/VLM-learning; url: VLLM-BASE

前情提要

有关多模态大模型架构中的语言模型部分（MQwen.py）的代码请看（多模态大模型源码阅读 - 1、多模态大模型源码阅读 - 2，多模态大模型源码阅读 - 3，多模态大模型源码阅读 - 4）
多模态大模型架构中的视觉模型（visual/CLIP-VIT.py）部分请看多模态大模型源码阅读 - 5
多模态大模型架构中的trainer（trainer.py）部分请看多模态大模型源码阅读 - 6
多模态大模型架构中的MultiModal融合部分（MultiModal.py）部分请看多模态大模型源码阅读 - MultiModal篇。
观前提醒，本文中介绍的多模态模型架构来源于github项目WatchTower-Liu/VLM-learning，对Qwen模型的前向传播代码进行重写，并通过中间投影层将视觉特征与文本映射到同一向量空间。投影层原理参考LLAVA
在这里插入图片描述
本节将介绍多模态模型架构中的dataset部分，该部分主要用于处理图片和文本数据，使其能够用于image captioning（图像字幕生成）任务。

源码解读

完整代码

import torch
import json
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from transformers import CLIPProcessor, SiglipProcessor
from PIL import Image
import numpy as np
from tqdm import tqdm

from qwen.qwen_generation_utils import make_context

def readJson(filePath):
    with open(filePath, 'r', encoding="utf-8") as f:
        data = json.load(f)
    return data

def data_collate(example, tokenizer, black_token_length):
    images = []
    captions = []
    labels = []
    max_length = np.max([len(e[1]) for e in example]) + 1
    for e in example:
        img, caption, L = e
        L = L + 1
        caption = caption + [tokenizer.eod_id]
        images.append(img)
        caption_labels = [-100]*(black_token_length + (len(caption)-L) - 1) + caption[-L:] + [-100]*(max_length - len(caption))
        captions.append(torch.tensor(caption + [tokenizer.eod_id]*(max_length - len(caption))))
        labels.append(torch.tensor(caption_labels))

    labels = torch.stack(labels, dim=0).long()
    captions = torch.stack(captions, dim=0).long()
    images = torch.stack(images, dim=0).to(torch.float16)

    return {
   "images": images, "input_ids": captions, "labels": labels}

class ImageCaptionDataset(Dataset):
    def __init__(self, tokenizer, image_map_file, captions_file, Vconfig, return_caption_num=1, max_train_data_item=None):
        super().__init__()
        self.tokenizer = tokenizer
        self.return_caption_num = return_caption_num
        self.max_train_data_item = max_train_data_item

        mean = [0.485, 0.456, 0.406]  # RGB
        std = [0.229, 0.224, 0.225]  # RGB

        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean, std),
            transforms.Resize([224, 224])
        ])

        self.image_map = readJson(image_map_file)
        self.captions = readJson(captions_f