1.Abstract
The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). 基于聚类的无监督关系发现方法逐渐成为开放关系抽取(OpenRE)的重要方法之一。
However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes.然而,高维向量可以编码复杂的语言信息,这导致派生的簇不能与关系语义类显式对齐的问题。
In this work, we propose a relation oriented clustering model and use it to identify the novel relations in the unlabeled data. 在这项工作中,我们提出了一个面向关系的聚类模型,并用它来识别未标记数据中的新关系。
Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation oriented representation. 具体来说,为了使模型能够学习对关系数据进行聚类,我们的方法利用预定义关系的现成标记数据来学习面向关系的表示。
We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. 我们通过将实例聚集到它们对应的关系质心以形成集群结构来最小化具有相同关系的实例之间的距离,从而使学习到的表示对集群友好。
To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data.为了减少预定义类的聚类偏差,我们通过最小化标记和未标记数据的联合目标来优化模型。
Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.实验结果表明,与当前的 SOTA 方法相比,我们的方法在两个数据集上分别将错误率降低了 29.2% 和 15.7%。
2.Introduction
Relation extraction (RE), a crucial basic task in the field of information extraction, is of the utmost practical interest to various fields including web search (Xiong et al., 2017), knowledge base completion (Bordes et al., 2013), and question answering (Yu et al., 2017).关系抽取(RE),信息抽取领域的一项关键基础任务, 对包括网络搜索(Xiong 等人,2017 年)、知识库补全(Bordes 等人,2013 年)和问答(Yu 等人,2017 年)在内的各个领域都具有最大的实际意义。
However, conventional RE paradigms such as supervision and distant supervision are generally designed for pre-defined relations, which cannot deal with new emerging relations in the real world. 然而,监督和远程监督等传统的 RE 范式通常是为预定义的关系设计的,无法处理现实世界中新出现的关系。
Under this background, open relation extraction (OpenRE) has been widely studied for its use in extracting new emerging relational types from open-domain corpora. 在此背景下,开放关系提取 (OpenRE) 因其在从开放域语料库中提取新出现的关系类型而得到广泛研究。
The approaches used to handle open relations roughly fall into one of two groups. The first group is open information extraction (OpenIE) (Etzioni et al., 2008; Yates et al., 2007; Fader et al., 2011), which directly extracts related phrases as representations of different relational types. 用于处理开放关系的方法大致分为两类之一。 第一组是开放信息提取(OpenIE)(Etzioni 等人,2008 年;Yates 等人,2007 年;Fader 等人,2011 年),它直接提取相关短语作为不同关系类型的表示。However, if not properly canonicalized, the extracted relational facts can be redundant and ambiguous. 然而,如果没有正确规范化,提取的关系事实可能是多余的和模棱两可的。
The second group is unsupervised relation discovery (Yao et al., 2011; Shinyama and Sekine, 2006; Simon et al., 2019). In this type of research, much attention has been focused on unsupervised clustering-based RE methods, which cluster and recognize relations from high-dimensional representations (Elsahar et al., 2017). Recently, the self-supervised signals in pretrained language model are further exploited for clustering optimization (Hu et al., 2020).第二组是无监督关系发现(Yao 等,2011;Shinyama 和 Sekine,2006;Simon 等,2019)。 在这类研究中,很多注意力都集中在基于无监督聚类的 RE 方法上,该方法从高维表示中聚类和识别关系(Elsahar 等,2017)。 最近,预训练语言模型中的自监督信号被进一步用于聚类优化(Hu et al., 2020)。
However, many studies show that high dimensional embeddings can encode complex linguistic information such as morphological (Peters et al., 2018), local syntactic (Hewitt and Manning, 2019), and longer range semantic information (Jawahar et al., 2019). 然而,许多研究表明,高维嵌入可以编码复杂的语言信息,例如形态(Peters 等人,2018 年)、局部句法(Hewitt 和 Manning,2019 年)和更长范围的语义信息(Jawahar 等人,2019 年)。
In this work, we propose a relation-oriented clustering method. To enable the model to learn to cluster relational data, pre-defined relations and their existing labeled instances are leveraged to optimize a non-linear mapping, which transforms high-dimensional entity pair representations into relation-oriented representations.在这项工作中,我们提出了一种面向关系的聚类方法。为了使模型能够学习对关系数据进行聚类,利用预定义的关系及其现有标记实例来优化非线性映射,将高维实体对表示转换为面向关系的表示。
Specifically, we minimize distance between the instances with same relation by gathering the instances representation towards their corresponding relation centroids to form the cluster structure, so that the learned representation is cluster-friendly. 具体来说,我们通过将实例表示聚集到它们对应的关系质心来形成集群结构来最小化具有相同关系的实例之间的距离,从而使学习到的表示对集群友好。
In order to reduce the clustering bias on the predefined classes, we iteratively train the entity pair representations by optimizing a joint objective function on the labeled and unlabeled subsets of the data, improving both the supervised classification of the labeled data, and the clustering of the unlabeled data. 为了减少对预定义类的聚类偏差,我们通过优化数据的标记和未标记子集的联合目标函数来迭代训练实体对表示,改进标记数据的监督分类和未标记的数据的聚类。
In addition, the proposed method can be easily extended to incremental learning by classifying the pre-defined and novel relations with a unified classifier, which is often desirable in real-world applications. Our experimental results show that our method outperforms current state-of-the-art methods for OpenRE. Our codes are publicly available at Github*。此外,通过使用统一分类器对预定义的新关系进行分类,所提出的方法可以轻松扩展到增量学习,这在实际应用中通常是可取的。 我们的实验结果表明,我们的方法优于当前最先进的 OpenRE 方法。 我们的代码可在 Github 上公开获取*
To summarize, the main contributions of our work are as follows: (1) we propose a novel relation-oriented clustering method RoCORE to enable model to learn to cluster relational data; (2) the proposed method achieves the incremental learning of unlabeled novel relations, which is often desirable in real-world applications; (3) experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two real world datasets respectively, compared with current state-of-the-art OpenRE methods. 总而言之,我们的工作的主要贡献如下:(1)我们提出了一种新的面向关系的聚类方法 RoCORE,使模型能够学习对关系数据进行聚类; (2)所提出的方法实现了未标记的新关系的增量学习,这在实际应用中通常是可取的; (3) 实验结果表明,与当前最先进的 OpenRE 方法相比,我们的方法在两个真实世界数据集上分别将错误率降低了 29.2% 和 15.7%。
图 1:虽然实例 S2 和 S3 都表示建立关系,而 S1 表示 CEO 关系,但 S1 和 S2 之间的距离仍然小于 S2 和 S3 之间的距离。 这是因为 S1 和 S2 之间可能有更多相似的表面信息(例如单词重叠)或句法结构,因此派生的簇不能明确地与关系对齐。
2 Related Work
Open Relation Extraction.To meet the needs of extracting new emerging relation types, many efforts have been undertaken to exploring methods for open relation extraction (OpenRE). 为了满足提取新出现的关系类型的需要,已经进行了许多努力来探索开放关系提取(OpenRE)的方法。
The first line of research is Open Information Extraction (Etzioni et al., 2008; Yates et al., 2007; Fader et al., 2011), in which relation phrases are extracted directly to represent different relation types. 第一线研究是开放信息提取(Etzioni 等人,2008 年;Yates 等人,2007 年;Fader 等人,2011 年),其中直接提取关系短语来表示不同的关系类型。
However, using surface forms to represent relations results in an associated lack of generality since many surface forms can express the same relation. 然而,使用表面形式来表示关系会导致相关的普遍性缺乏,因为许多表面形式可以表达相同的关系。
Recently, unsupervised clustering-based RE methods is attracting lots of attentions. Elsahar et al. (2017) proposed to extract and cluster open relations by re-weighting word embeddings and using the types of named entities as additional features. 最近,基于无监督聚类的 RE 方法引起了很多关注。 艾尔萨哈尔等人。 (2017) 提出通过重新加权词嵌入并使用命名实体的类型作为附加特征来提取和聚类开放关系。
Hu et al. (2020) proposed to exploit weak, self-supervised signals in pretrained language model for adaptive clustering on contextualized relational features. 胡等人。 (2020) 提出在预训练语言模型中利用弱的自监督信号对上下文关系特征进行自适应聚类。
However, the self-supervised signals are sensitive to the initial representation (Gansbeke et al., 2020) and there is still no guarantee that the learned clusters will align with the relational semantic classes (Xing et al., 2002). 然而,自监督信号对初始表示很敏感(Gansbeke 等人,2020 年)并且仍然不能保证学习到的集群会与关系语义类保持一致(Xing 等人,2002 年)。
Wu et al. (2019) proposed the relation similarity metrics from labeled data, and then transfers the relational knowledge to identify novel relations in unlabeled data. Different from them, we propose a relation-oriented method explicitly clustering data based on relational information.吴等人。 (2019) 提出了来自标记数据的关系相似性度量,然后转移关系知识以识别未标记数据中的新关系。 与它们不同的是,我们提出了一种基于关系信息显式聚类数据的面向关系的方法。
Knowledge in High-Dimensional Vector. Pre trained static and contextual word representations can provide valuable prior knowledge for con structing relational representations (Soares et al., 2019; Elsahar et al., 2017). 高维向量知识。 预训练的静态和上下文词表示可以为构建关系表示提供有价值的先验知识(Soares 等人,2019 年;Elsahar 等人,2017 年)。
Peters et al. (2018) showed that different neural architectures (e.g., LSTM, CNN, and Transformers) can hierarchically structure linguistic information that varies with network depth. Recently, many studies (Jawahar et al., 2019; Clark et al., 2019; Goldberg, 2019) have shown that such hierarchy also exists in pre training models like BERT. 彼得斯等人。 (2018) 表明,不同的神经架构(例如,LSTM、CNN 和 Transformers)可以对随网络深度变化的语言信息进行分层结构。 最近,许多研究(Jawahar 等人,2019 年;Clark 等人,2019 年;Goldberg,2019 年)表明,在 BERT 等预训练模型中也存在这种层次结构。
These results suggest that high-dimensional embeddings, independent of model architecture, learn much about the structure of language. 这些结果表明,独立于模型架构的高维嵌入,学习很多关于语言结构的知识。
Directly clustering on these high dimensional embeddings should hardly produce ideal clusters in our desired way, which motivates us to extend current unsupervised clustering-based RE methods to learn the representations tailored for clustering relational data.直接对这些高维嵌入进行聚类应该很难以我们想要的方式产生理想的聚类,这促使我们扩展当前基于无监督聚类的 RE 方法来学习为聚类关系数据量身定制的表示。
3 Approach
In this work, we propose a relation-oriented clustering method, which takes advantage of the relational information in the existing labeled data to enable model to learn to cluster relational data. In order to reduce the clustering bias on the predefined classes, we iteratively train the entity pair representations by optimizing a joint objective function on the labeled and unlabeled subsets of the data, improving both the supervised classification of the labeled data, and the clustering of the unlabeled data. The proposed method is shown in Figure 2。在这项工作中,我们提出了一种面向关系的聚类方法,它利用现有标记数据中的关系信息使模型能够学习对关系数据进行聚类。 为了减少对预定义类的聚类偏差,我们通过优化标记数据和未标记数据的联合目标函数来迭代训练实体对表示,改进标记数据的监督分类和未标记的数据的聚类。 提出的方法如图2所示。
图 2:我们的 RoCORE 方法概述。 第一步,我们将标记和未标记的实例编码为实体对表示。 然后通过在第二步中向它们的关系质心聚集,将实体对表示转换为面向关系的表示。 最后,基于对未标记数据进行聚类生成的伪标签,我们通过最小化联合目标函数来优化实体对表示和分类器,以减少对预定义类的聚类偏差。 上述三个步骤迭代执行,以逐步提高模型在新关系上的性能。