[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - DataSet篇

参考repo:WatchTower-Liu/VLM-learning; url: VLLM-BASE
前情提要
有关多模态大模型架构中的语言模型部分(MQwen.py)的代码请看(多模态大模型源码阅读 - 1、 多模态大模型源码阅读 - 2, 多模态大模型源码阅读 - 3,多模态大模型源码阅读 - 4)
多模态大模型架构中的视觉模型(visual/CLIP-VIT.py)部分请看多模态大模型源码阅读 - 5
多模态大模型架构中的trainer(trainer.py)部分请看多模态大模型源码阅读 - 6
多模态大模型架构中的MultiModal融合部分(MultiModal.py)部分请看多模态大模型源码阅读 - MultiModal篇。
观前提醒,本文中介绍的多模态模型架构来源于github项目WatchTower-Liu/VLM-learning,对Qwen模型的前向传播代码进行重写,并通过中间投影层将视觉特征与文本映射到同一向量空间。投影层原理参考LLAVA
本节将介绍多模态模型架构中的dataset部分,该部分主要用于处理图片和文本数据,使其能够用于image captioning(图像字幕生成)任务。
源码解读
完整代码
import torch
import json
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from transformers import CLIPProcessor, SiglipProcessor
from PIL import Image
import numpy as np
from tqdm import tqdm
from qwen.qwen_generation_utils import make_context
def readJson(filePath):
with open(filePath, 'r', encoding="utf-8") as f:
data = json.load(f)
return data
def data_collate(example, tokenizer, black_token_length):
images = []
captions = []
labels = []
max_length = np.max([len(e[1]) for e in example]) + 1
for e in example:
img, caption, L = e
L = L + 1
caption = caption + [tokenizer.eod_id]
images.append(img)
caption_labels = [-100]*(black_token_length + (len(caption)-L) - 1) + caption[-L:] + [-100]*(max_length - len(caption))
captions.append(torch.tensor(caption + [tokenizer.eod_id]*(max_length - len(caption))))
labels.append(torch.tensor(caption_labels))
labels = torch.stack(labels, dim=0).long()
captions = torch.stack(captions, dim=0).long()
images = torch.stack(images, dim=0).to(torch.float16)
return {
"images": images, "input_ids": captions, "labels": labels}
class ImageCaptionDataset(Dataset):
def __init__(self, tokenizer, image_map_file, captions_file, Vconfig, return_caption_num=1, max_train_data_item=None):
super().__init__()
self.tokenizer = tokenizer
self.return_caption_num = return_caption_num
self.max_train_data_item = max_train_data_item
mean = [0.485, 0.456, 0.406] # RGB
std = [0.229, 0.224, 0.225] # RGB
self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std),
transforms.Resize([224, 224])
])
self.image_map = readJson(image_map_file)
self.captions = readJson(captions_f