去Hugging Face官网下载sentence-transformers模型
1、导入所需要的库
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
import torch.nn.functional as F
2、加载预训练模型
path = 'D:/Model/sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModel.from_pretrained(path)
3、定义平均池化
def mean_pooling(model_output, attention_mask):
#First element of model_output contains all token embeddings
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) /
torch.clamp(input_mask_expanded.sum(1), min=1e-9)
4、对句子进行嵌入
sentences = ['loved thisand know really bought wanted see pictures myselfIm lucky enough someone could justify buying present',
'issue pages stickers restuck really used configurations made regular pages rather taking pieces robot back',
'stickers dont stick well first time placing',
'Great fun grandson loves robots',
'would suggest younger kids son 3']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings1 = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings1)
# Normalize embeddings
sentence_embeddings2 = F.normalize(sentence_embeddings1, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings2)
5、运行结果
6、定义句子之间的相似度
def compute_sim_score(v1, v2) :
return v1.dot(v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
7、计算句子相似度
#'issue pages stickers restuck really used configurations made regular pages rather taking pieces robot back'
#'stickers dont stick well first time placing'
compute_sim_score(sentence_embeddings1[1], sentence_embeddings1[2])
#result:tensor(0.5126)
8、看一下嵌入的shape
sentence_embeddings1.shape
#torch.Size([5, 384])
展望总结:
接下来试试对真实用户对项目的评论句子做嵌入