Python编程题——句式中的指定字符替换
目标:给定句式(含待替换字符)、替换对象与待替换对象的 span,求随机替换N次后生成的数据。并保存为json格式文件。
Python脚本:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import json
import copy
import random
from loguru import logger
def get_data(syntax, data_map, slot_name_span, json_file_path, threshold=10):
"""
功能: syntax生成指定格式的数据
:param syntax:
:param data_map:
:param slot_name_span:
:param json_file_path:
:param threshold:
:return:
"""
with open(json_file_path, 'w', encoding='utf-8') as outfile:
for i in range(threshold):
text = copy.copy(syntax)
flag1, label_list = 0, []
for idx, item in enumerate(slot_name_span):
slot_name, s_index, e_index = item
entity = random.choice(data_map[slot_name])
text = text[:s_index + flag1] + entity + text[e_index + flag1:]
flag2 = copy.copy(flag1)
flag1 += (len(entity) - (e_index - s_index))
if idx == 0:
e_index = s_index + len(entity)
else:
s_index += flag2
e_index = s_index + len(entity)
label_list.append([s_index, e_index, slot_name, entity])
res = {"text": text, "label_list": label_list}
logger.info("第{0}条数据为:{1}.".format(i + 1, res))
json.dump(res, outfile, ensure_ascii=False)
outfile.write("\n")
if __name__ == "__main__":
get_data(syntax="time我想喝tea",
data_map={"time": ["今天", "明天", "后天"],
"tea": ["茶百道", "喜茶", "奈雪的茶", "一点点", "瑞幸咖啡"]},
slot_name_span=[("time", 0, 4), ("tea", 7, 10)],
json_file_path="./data_demo.json")
运行结果:
(1)代码运行
(2)json文件内容
{"text": "今天我想喝瑞幸咖啡", "label_list": [[0, 2, "time", "今天"], [5, 9, "tea", "瑞幸咖啡"]]}
{"text": "今天我想喝茶百道", "label_list": [[0, 2, "time", "今天"], [5, 8, "tea", "茶百道"]]}
{"text": "后天我想喝喜茶", "label_list": [[0, 2, "time", "后天"], [5, 7, "tea", "喜茶"]]}
{"text": "明天我想喝茶百道", "label_list": [[0, 2, "time", "明天"], [5, 8, "tea", "茶百道"]]}
{"text": "后天我想喝喜茶", "label_list": [[0, 2, "time", "后天"], [5, 7, "tea", "喜茶"]]}
{"text": "今天我想喝茶百道", "label_list": [[0, 2, "time", "今天"], [5, 8, "tea", "茶百道"]]}
{"text": "明天我想喝喜茶", "label_list": [[0, 2, "time", "明天"], [5, 7, "tea", "喜茶"]]}
{"text": "今天我想喝奈雪的茶", "label_list": [[0, 2, "time", "今天"], [5, 9, "tea", "奈雪的茶"]]}
{"text": "今天我想喝瑞幸咖啡", "label_list": [[0, 2, "time", "今天"], [5, 9, "tea", "瑞幸咖啡"]]}
{"text": "今天我想喝茶百道", "label_list": [[0, 2, "time", "今天"], [5, 8, "tea", "茶百道"]]}
- 适用范围:数据增强、数据处理与数据清洗。
- 细节说明:麻烦在于替换后的entity索引位置变换。