一个脚本打比赛之SMP WEIBO 2016

最新推荐文章于 2025-04-17 11:42:57 发布

spylyt

最新推荐文章于 2025-04-17 11:42:57 发布

阅读量743

点赞数 1

分类专栏：人工智能-神经网络算法文章标签：社交网络数据机器学习数据挖掘竞赛算法

本文链接：https://blog.csdn.net/spylyt/article/details/78512911

版权

## 一个脚本打比赛之SMP WEIBO 2016 ## 前言：如何对用户进行精准画像是社交网络分析的基础问题。本文就如何对weibo用户网络提取特征发表一点小的想法，还请尽管拍砖。数据来源：SMP WEIBO 2016 任务目标：分析用户关联关系与用户发帖内容，通过无监督与有监督方法对用户进行聚类。 ———- 第一部分：筛选source，即判定用户发表的内容是否是垃圾信息。

import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from time import time
%matplotlib inline

训练数据字段含义： uid: 用户唯一标识，由数字组成 retweet count: 转发数，数字 review count: 评论数，数字 source: 来源，文本 time: 创建时间，时间戳文本(目前有两种格式，yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm) content: 文本内容（可能包含@信息、表情符信息等）

with open('train/train/train_status.txt','r') as f:
    lines = f.readlines()
status=[]
for line in lines:
    status.append(line.strip().split(','))
tr_status = pd.DataFrame(status).loc[:,:5]
tr_status.columns=['uid','retweet','review','source','time','content']
tr_status.to_csv('train_status.csv',index=False)
display(tr_status.head())
with open('valid/valid_status.txt','r') as f:
    lines = f.readlines()
status=[]
for line in lines:
    status.append(line.strip().split(','))
v_status = pd.DataFrame(status).loc[:,:5]
v_status.columns=['uid','retweet','review','source','time','content']
v_status.to_csv('valid_status.csv',index=False)
display(v_status.head())

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	uid	review	source	time	content
0	1103763581	0	Arduino中文社区	2016-01-07 13:14	我用微博在 Arduino 中文社区上登录啦！ Arduino 中文社区 …
1	1103763581	2	荣耀6 Plus	2015-11-10 09:13:35	很长时间没有上微博看看了，估计都快被忘记了吧！无锡·新安 …
2	1103763581	0	荣耀6 Plus	2015-07-26 20:07:57	# 农村现状 # 20 年前还是个小孩，一到瓜果成熟的季节，三五…
3	1103763581	0	荣耀6 Plus	2015-06-22 18:39:47	我分享了 @环球时报的文章社评：法国出租与专车司机冲突的启示
4	1103763581	6	荣耀6 Plus	2015-06-10 07:37:22	好久没上微博了，不知道大家还记得我不？梁家巷显示地图

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	uid	retweet	review	source	time	content
0	1753249671	0	0	iPhone客户端	2016-05-06 10:01	扑通扑通我的心跳！久久不能平 …… 深呼吸、深呼吸、深呼吸！
1	1753249671	0	0	iPhone客户端	2016-04-15 01:19

最低0.47元/天解锁文章