NLP之【点互信息PMI】——衡量两变量之间的相关性

最新推荐文章于 2024-06-06 02:56:43 发布

Yale曼陀罗

最新推荐文章于 2024-06-06 02:56:43 发布

阅读量2.6k

点赞数 2

分类专栏： NLP自然语言处理——学习专栏数据分析和挖掘CDA备考系列专栏

本文链接：https://blog.csdn.net/weixin_42782150/article/details/127068069

版权

NLP自然语言处理——学习专栏同时被 2 个专栏收录

5 篇文章

订阅专栏

数据分析和挖掘CDA备考系列专栏

5 篇文章

订阅专栏

本文介绍了点互信息（PMI）的概念，用于衡量两个词语之间的相关性。通过计算PMI，可以判断词语是否独立出现。文中展示了如何使用Python的nltk库计算PMI，并通过例子解释了PMI值的意义，以及如何自定义函数进行计算。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

绪论

在自然语言处理中, 想要探讨两个字之间，是否存在某种关系，例如：某些字比较容易一起出现, 这些字一起出现时，可能带有某种讯息。

例如，在新闻报导中，有New 、York，这两个字一起出现，可以代表一个地名New York，所以当出现了New这个字, 则有可能出现York，这可以用Pointwise Mutual Information(PMI)来计算New 、York一起出现的相关性。

一、PMI的基本概念

点互信息（Pointwise Mutual Information，PMI）： 在数据挖掘或者信息检索的相关资料里，经常会 利用PMI（Pointwise Mutual Information）这个指标来衡量两个事物之间的相关性。比如：两个词，两个句子等。

点互信息（Pointwise Mutual Information，PMI）：

$log\frac{p(x,y)}{P ( x) × P ( y )} = log\frac{p(x|y)}{p(x)} = log\frac{p(y|x)}{p(y)}$

如果某两个字的出现是独立事件（即：满足 $x$ 跟 $y$ 相互独立），则 $p (x, y) = p (x) p (y)$ ，此时PMI为0，其计算过程如下：
$P (x, y) = P (x) \times P (y)$
$=log\frac{p(x,y)}{P ( x) × P ( y )} =lo g \frac{P ( x) × P ( y )}{P ( x) × P ( y )}= lo g 1 = 0$
如果有两个字出现的机率不是独立事件，即：某个字出现时提升另一个字的出现的机率，则PMI大于0，其计算过程如下：

$P (x, y) > P (x) \times P (y)$
$=log\frac{p(x,y)}{P ( x) × P ( y )} > 0$

综上可知， 如果这两个字的出现越不是偶然, 则PMI算出来的值越高, 越有可能带有某种讯息。

通常我们可以用一个Co-occurrence Matrix（词语共现频次表）来表示对一个语料库中两个单词出现在同一份文档的统计情况，例如：

在这里插入图片描述
说明：

表格中的数字, 代表左方的Word和上方的Word, 一起出现的次数；
假设这篇文章总共只有19个字；

接下来，以计算information和data这两Word的PMI 为例，计算过程如下：

单词information和单词data同时出现的概率：
$)=\frac{6}{19}=0.32$
单词information出现的概率：
$)=\frac{6 + 4 + 1}{19}=\frac{11}{19}=0.58$
单词data出现的概率：
$)=\frac{6 + 1}{19}=\frac{7}{19}=0.37$
计算information和data这两Word的PMI：
$\begin{aligned} PMI( x= inform a ti o n , y = da ta ) &=log\frac{P ( x= i n fo rm a ti o n , y = da ta )}{P ( x= i n fo rm a ti o n ) × P ( y = da ta )}\\ &=log_2\frac{0.32}{0.58×0.37}\\ &=log_21.49\\ &=0.57 \end{aligned}$
算出来的数字大于0 , 表示information和data这两个字的出现, 不是独立事件。

其他中间结果如下表所示：

在这里插入图片描述
但是从上表中你可能会发现一个问题，那就是你有可能会去计算 $log_20 = -inf$ ，即得到一个负无穷。为此人们通常会计算一个 PPMI（Positive PMI） 来避免出现 -inf，即：

$log_2\frac{p(w,c)}{P ( w) × P ( c )} \Rightarrow PPMI(w,c) =max(log_2\frac{p(w,c)}{P ( w) × P ( c )},0)$

二、调用Python nltk来计算两个词的PMI

我们用 python nltk 的brown corpus新闻类别文章, 来计算New , York的PMI和New , The的PMI , 并比较两者差异。

执行代码如下：

import nltk
from nltk.corpus import brown
from nltk import WordNetLemmatizer
from math import log
wnl = WordNetLemmatizer()

# nltk.download('brown')  下载对应的数据集
# nltk.download('omw-1.4')

"""
参数：
_Fdist：是单字出现的频率；
_Sents：是文章中所有的句子；
p(x)：计算单字x出现的概率；
pxy(x,y)：计算单字x和单字y出现在同一个句子的概率；
pim(x,y)：计算单字x和单字y的Pointwise Mutual Information
"""
_Fdist = nltk.FreqDist([wnl.lemmatize(w.lower()) for w in brown.words(categories='news')])
_Sents = [[wnl.lemmatize(j.lower()) for j in i] for i in brown.sents(categories='news')]

def p(x):
    return _Fdist[x]/float(len(_Fdist))

def pxy(x,y):
    return (len(list(filter(lambda s: x in s and y in s, _Sents)))+1)/float((len(_Sents)))

def pmi(x,y):
    return log(pxy(x,y)/(p(x)*p(y)),2)

# 先来分别计算new和york各别出现的概率
p('new')
>>>
0.020265724857046755
p('york')
>>>
0.004372687521022536
# 再来计算new，york出现在同一句子的概率
pxy('new','york')
>>>
0.011031797534068787
# 计算new , york的PMI
pmi ( 'new' , 'york' ) 
>>>
6.959890136179789

# 再来看看, new , the出现在同一句子的机率
pxy('new','the')
>>>
0.04023361453601557
# 再算一下the出现的机率
p('the')
>>>
0.5369996636394214
# 计算new , the的PMI 
pmi('new','the')
>>>
1.8863664858873235

以上结果显示，PMI可以得出, new和york经常一起出现，并不是偶然事件，表示这两个词一起出现可能带有较多讯息；另外，虽然new和the同时出现的频率比较多，但the单独出现的频率也比较多，计算结果PMI可以把the的出现频率除掉，得出new和the的出现，比较像是独立事件。

参考链接：自然语言处理-- Pointwise Mutual Information

Python自然语言处理库NLTK下的Collocations模块，提供了PMI计算的方法。Collocations中有两个类BigramCollocationFinder和TrigramCollocationFinder分别可以识别2词短语和3词短语。

参考链接：Jupyter Notebook使用Python做PMI点互信息计算

"""
实现以下功能：

读取txt、xls、xlsx文件的数据（其中excel形式的数据，其数据是存储在某一列）

对文本数据进行分词、英文小写化、英文词干化、去停用词

按照两元语法模式，计算所有文本两两词语的pmi值

将pmi值保存到csv文件中
"""
import re
import csv
import jieba
import pandas as pd 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

def chinese(text):
    """
    对中文数据进行处理，并将计算出的pmi保存到文件“中文pmi计算.csv”中
    """
    content = ''.join(re.findall(r'[\u4e00-\u9fa5]+', text))
    words = jieba.cut(content)
    words = [w for w in words if len(w)>1]
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(words)
    
    with open('/Users/Documents/NLP_data_projects/data/pmi.txt',
             'a+', encoding='utf-8', newline='') as csvf:
        writer = csv.writer(csvf)
        writer.writerow(('word1','word2','pmi_score'))
        for row in finder.score_ngrams(bigram_measures.pmi):
            data = (*row[0],row[1])
            try:
                writer.writerow(data)
            except:
                pass
def english(text):
    """
    对英文数据进行处理，并将计算出的pmi保存到文件'english_pmi.txt'
    """
    stopwordss = set(stopwords.words('english'))
    stemmer = nltk.stem.snowball.SnowballStemmer('english')
    tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
    words = tokenizer.tokenize(text)
    words = [w for w in words if not w.isnumeric()]
    words = [w.lower() for w in words]
    words = [stemmer.stem(w) for w in words]
    words = [w for w in words if w not in stopwordss]
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(words)
    with open('/Users/Documents/NLP_data_projects/data/english_pmi.txt',
             'a+', encoding='utf-8', newline='') as csvf:
        writer = csv.writer(csvf)
        writer.writerow(('word1','word2','pmi_score'))
        for row in finder.score_ngrams(bigram_measures.pmi):
            data = (*row[0],row[1])
            try:
                writer.writerow(data)
            except:
                pass

def pmi_score(file, lang, column='entity'):
    """
    计算pmi：
    param file：原始文本数据文件
    param lang：数据的语言，参数为chinese或english
    param column：如果文件为excel形式的文件，column为excel中的数据列
    """
    # 读取数据
    text = ''
    if 'csv' in file:
        df = pd.read_csv(file)
        rows = df.iterrows()
        for row in rows:
            text += row[1][column]
    elif ('xlsx' in file) or ('xls' in file):
        df = pd.read_excel(file)
        rows = df.iterrows()
        for row in rows:
            text += row[1][column]
    else:
        text = open(file).read()
        
    # 对该语言的文本数据计算pmi
    globals()[lang](text)

# 计算pmi
pmi_score(file='/Users/yangyang/Documents/NLP_data_projects/data/pmi.txt', lang='chinese')

参考链接：PMI点互信息计算

三、根据词语的共现频次表自定义PMI函数计算

# Defined in Section 2.1.2
import numpy as np
M = np.array([[0, 2, 1, 1, 1, 1, 1, 2, 1, 3],
              [2, 0, 1, 1, 1, 0, 0, 1, 1, 2],
              [1, 1, 0, 1, 1, 0, 0, 0, 0, 1],
              [1, 1, 1, 0, 1, 0, 0, 0, 0, 1],
              [1, 1, 1, 1, 1, 0, 0, 0, 0, 1],
              [1, 0, 0, 0, 0, 0, 1, 1, 0, 1],
              [1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
              [2, 1, 0, 0, 1, 1, 1, 0, 1, 2],
              [1, 1, 1, 0, 1, 0, 0, 1, 0, 1],
              [3, 2, 1, 1, 1, 1, 1, 2, 2, 0]])
def pmi(M, positive=True):
    col_totals = M.sum(axis=0)
    row_totals = M.sum(axis=1)
    total = col_totals.sum()
    expected = np.outer(row_totals, col_totals)/total
    M = M/expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):  # np.errstate()用于浮点错误处理的上下文管理器。
        M = np.log(M)
    M[np.isinf(M)] = 0.0 # log(0)=0  # np.isinf()用于按元素测试正无穷或负无穷
    if positive:
        M[M<0] = 0.0
    return M

M_pmi = pmi(M)
np.set_printoptions(precision=2)  # np.set_printoptions()用于控制Python中小数的显示精度
print(M_pmi)
>>>
[[0.   0.25 0.   0.14 0.   0.37 0.37 0.37 0.14 0.29]
 [0.25 0.   0.33 0.51 0.04 0.   0.   0.04 0.51 0.25]
 [0.14 0.51 0.   1.1  0.63 0.   0.   0.   0.   0.14]
 [0.14 0.51 0.92 0.   0.63 0.   0.   0.   0.   0.14]
 [0.   0.33 0.73 0.92 0.45 0.   0.   0.   0.   0.  ]
 [0.37 0.   0.   0.   0.   0.   1.54 0.85 0.   0.37]
 [0.37 0.   0.   0.   0.   1.54 0.   0.85 0.   0.37]
 [0.25 0.   0.   0.   0.04 0.73 0.73 0.   0.51 0.25]
 [0.   0.33 0.73 0.   0.45 0.   0.   0.45 0.   0.  ]
 [0.21 0.17 0.   0.07 0.   0.29 0.29 0.29 0.76 0.  ]]
# np.linalg.svd(a, full_matrices=True, compute_uv=True) 奇异值分解
U, s, Vh = np.linalg.svd(M_pmi)   

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

words = ['我', '喜欢', '自然','语言','处理','爱','深度','学习','机器','。']
for i in range(len(words)):
    plt.text(U[i, 0], U[i, 1], words[i])
plt.xlim(-0.7, 0)
plt.ylim(-0.5, 0.8)
# plt.savefig('/Users/yangyang/Documents/NLP_learning/svg.pdf')
plt.show()