本专栏是记录作者学习TensorFlow深度学习的相关内容
本节简单介绍了数据预处理的内容,主要是字符串、csv文件、DataFrame、字典、张量等数据格式之间的转换与处理。经过这一节,我们以将字符串文件格式化为张量为例,详细介绍了处理过程。
本节的 Jupyter 笔记本文件已上传至gitee以供大家学习交流:我的gitee仓库
文章目录
为了能用深度学习来解决现实世界的问题,我们经常 从预处理原始数据开始, 而不是从那些准备好的张量格式数据开始。 我们使用Python的pandsas包,对原始数据进行预处理,将原始数据转化为张量格式。
下文用到的数据摘至HTTP DATASET CSIC 2010数据集:https://www.tic.itefi.csic.es/dataset/,该数据集包含上万条自动生成的Web请求,主要用于测试网络攻击防护系统
1 DataFrame数据的存取
该部分我们需要认识DataFrame,DataFrame 是 pandas 库中的一种数据结构,它类似于表格或电子表格。它可以看作是一个二维的数据结构,其中数据以行和列的形式组织。DataFrame 提供了丰富的功能,用于数据的清理、分析和操作。
将字典类型数据转化为DataFrame数据
# 初始化一个空的 DataFrame,并加入数据
import pandas as pd
request_dict={'Method': 'POST',
'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
'Protocol': 'HTTP/1.1',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close',
'Content-Length': '68',
'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
df = pd.DataFrame([request_dict])#DataFrame中的数据可以看做一个列表,数据行是列表的一个元素。所以传入的数据应该是列表的格式。用[]包裹
df
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml+xml... x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POST | http://localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml+xml... | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
其中[request_dict]数据格式如下
[{'Method': 'POST',
'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
'Protocol': 'HTTP/1.1',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close',
'Content-Length': '68',
'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}]
将DataFrame数据导出到csv文件中
#将DataFrame数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:
df.to_csv(data_file, index=True)
to_csv
方法将 DataFrame 中的数据保存到名为 raffic.csv
的文件中。参数 index=True
表示不保存行索引(默认情况下,行索引也会被保存到 CSV 文件中)。
将CSV文件数据导出到csv文件中
#从csv文件导出DataFrame数据中
data = pd.read_csv(data_file)
data
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml+xml... x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POST | http://localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml+xml... | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
2 将字符串格式化为字典
当然,我们的数据来源很可能是txt文件,是一系列字符串,此时我们需要对字符串进行处理
#数据
requests='''GET http://localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: close
GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: close
POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68
id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
#分割数据
request_list=requests.split("\n\n\n")
request_list
结果:
['GET http://localhost:8080/tienda1/index.jsp HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4\nConnection: close',
'GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5\nConnection: close',
'POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF\nContent-Type: application/x-www-form-urlencoded\nConnection: close\nContent-Length: 68\n\nid=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']
demo
以下是以第三条POST请求为例的demo,以便读者理解,如想直接看完整实现可看下一个部分
将请求存入列表
request=request_list[2]
lines = request.split("\n")
lines
结果:
['POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1',
'User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma: no-cache',
'Cache-control: no-cache',
'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding: x-gzip, x-deflate, gzip, deflate',
'Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language: en',
'Host: localhost:8080',
'Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type: application/x-www-form-urlencoded',
'Connection: close',
'Content-Length: 68',
'',
'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']
获取method,url,protocol
method,url,protocol= lines[0].split(" ")
method,url,protocol
结果:
('POST', 'http://localhost:8080/tienda1/publico/anadir.jsp', 'HTTP/1.1')
获取头部
headers=lines[1:-2]
headers
结果:
['User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma: no-cache',
'Cache-control: no-cache',
'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding: x-gzip, x-deflate, gzip, deflate',
'Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language: en',
'Host: localhost:8080',
'Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type: application/x-www-form-urlencoded',
'Connection: close',
'Content-Length: 68']
将头部加入字典
headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
headers_dict
结果:
{'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close',
'Content-Length': '68'}
获取请求体
body=lines[-1]
body
结果:
'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'
将请求拼接成字典
request_dict = {
'Method': method,
'URL': url,
'Protocol': protocol,
'User-Agent': headers_dict.get('User-Agent', ''),
'Pragma': headers_dict.get('Pragma', ''),
'Cache-control': headers_dict.get('Cache-control', ''),
'Accept': headers_dict.get('Accept', ''),
'Accept-Encoding': headers_dict.get('Accept-Encoding', ''),
'Accept-Charset': headers_dict.get('Accept-Charset', ''),
'Accept-Language': headers_dict.get('Accept-Language', ''),
'Host': headers_dict.get('Host', ''),
'Cookie': headers_dict.get('Cookie', ''),
'Content-Type': headers_dict.get('Content-Type', ''),
'Connection': headers_dict.get('Connection', ''),
'Content-Length': headers_dict.get('Content-Length', ''),
'Body':body
}
request_dict
结果:
{'Method': 'POST',
'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
'Protocol': 'HTTP/1.1',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close',
'Content-Length': '68',
'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
完整实现
实现处理多条数据
requests='''GET http://localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: close
GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: close
POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68
id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
request_list=requests.split("\n\n\n")
requests_list=[]
for request in request_list:
#将请求存入列表
lines = request.split("\n")
#获取method,url,protocol
method,url,protocol= lines[0].split(" ")
#将请求拼接成字典
request_dict = {
'Method': method,
'URL': url,
'Protocol': protocol,
}
if(method=='GET'):
#获取头部
headers=lines[1:]
elif(method=='POST'):
#获取头部
headers=lines[1:-2]
#获取请求体
body=lines[-1]
request_dict.update({'Body' : body})
#将头部加入字典
headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
request_dict.update(headers_dict)
requests_list.append(request_dict)
requests_list
结果:
[{'Method': 'GET',
'URL': 'http://localhost:8080/tienda1/index.jsp',
'Protocol': 'HTTP/1.1',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=1F767F17239C9B670A39E9B10C3825F4',
'Connection': 'close'},
{'Method': 'GET',
'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito',
'Protocol': 'HTTP/1.1',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5',
'Connection': 'close'},
{'Method': 'POST',
'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
'Protocol': 'HTTP/1.1',
'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito',
'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
'Pragma': 'no-cache',
'Cache-control': 'no-cache',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
'Accept-Language': 'en',
'Host': 'localhost',
'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'close',
'Content-Length': '68'}]
3 将字典转化为DataFrame
使用loc方法处理数据,loc 是 Pandas 中用于通过标签(label)定位和访问 DataFrame 中的数据的方法。
import pandas as pd
#初始化df
df = pd.DataFrame(columns=['Method', 'URL' , 'Protocol', 'User-Agent', 'Pragma', 'Cache-control', 'Accept', 'Accept-Encoding',
'Accept-Charset', 'Accept-Language', 'Host', 'Cookie', 'Content-Type', 'Connection',
'Content-Length', 'Body'])
# 使用 loc 方法将新行添加到 DataFrame
for request_dict in requests_list:
df.loc[len(df)] = request_dict
#以下方法为清空df
#df.drop(df.index, inplace=True)
df
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 GET http://localhost:8080/tienda1/index.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml+xml... x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 NaN close NaN NaN
1 GET http://localhost:8080/tienda1/publico/anadir.j... HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml+xml... x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 NaN close NaN NaN
2 POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml+xml... x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GET | http://localhost:8080/tienda1/index.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml+xml... | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 | NaN | close | NaN | NaN |
1 | GET | http://localhost:8080/tienda1/publico/anadir.j... | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml+xml... | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 | NaN | close | NaN | NaN |
2 | POST | http://localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml+xml... | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:
# 将数据保存为 CSV 文件
df.to_csv(data_file, index=False)
4 插值法处理缺失值
NaN
数据值代表缺失值,处理缺失值的方法有插值法和删除法,其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。 在这里,我们将考虑插值法。
#从csv中获取数据
df = pd.read_csv(data_file)
# 提取数值型列
numeric_cols = df.select_dtypes(include=['float64']).columns
# # 提取非数值型列
# non_numeric_cols = df.select_dtypes(exclude=['float64']).columns
# 对数值型列进行均值填充
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
# # 对Content-Type列进行填充
# df = pd.get_dummies(df , columns=['Content-Type'] , dummy_na=True)
# 对非数值列进行填充(对于该数据集来说将非数值列进行填充没有任何意义,这部分只是为了演示操作)
df = pd.get_dummies(df , dummy_na=True)
df
结果:
Content-Length Method_GET Method_POST Method_nan URL_http://localhost:8080/tienda1/index.jsp URL_http://localhost:8080/tienda1/publico/anadir.jsp URL_http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito URL_nan Protocol_HTTP/1.1 Protocol_nan ... Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF Cookie_nan Content-Type_application/x-www-form-urlencoded Content-Type_nan Connection_close Connection_nan Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito Body_nan
0 68.0 True False False True False False False True False ... True False False False False True True False False True
1 68.0 True False False False False True False True False ... False True False False False True True False False True
2 68.0 False True False False True False False True False ... False False True False True False True False True False
Content-Length | Method_GET | Method_POST | Method_nan | URL_http://localhost:8080/tienda1/index.jsp | URL_http://localhost:8080/tienda1/publico/anadir.jsp | URL_http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito | URL_nan | Protocol_HTTP/1.1 | Protocol_nan | ... | Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 | Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 | Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF | Cookie_nan | Content-Type_application/x-www-form-urlencoded | Content-Type_nan | Connection_close | Connection_nan | Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito | Body_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 68.0 | True | False | False | True | False | False | False | True | False | ... | True | False | False | False | False | True | True | False | False | True |
1 | 68.0 | True | False | False | False | False | True | False | True | False | ... | False | True | False | False | False | True | True | False | False | True |
2 | 68.0 | False | True | False | False | True | False | False | True | False | ... | False | False | True | False | True | False | True | False | True | False |
3 rows × 36 columns
5 DataFrame转换为张量
只有数值类型的DataFrame可以转化为张量格式。 若要以上述流量作为数据集进行入侵检测的训练,上面将非数值数据项转化为数值类型的方案肯定是不行的,机器不能学习到流量里的特征。
对于将流量转化为数值类型的数据的方法,根据作者了解,可以将流量转化为图片的形式,用卷积网络进行训练。后续作者也会在该方向展开入侵检测的学习。
当数据采用张量的格式,就可以通过张量函数对数据进行操作。
import tensorflow as tf
X = tf.constant(df.to_numpy(dtype=float))
X
结果:
<tf.Tensor: shape=(3, 36), dtype=float64, numpy=
array([[68., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1.,
0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,
1., 0., 0., 0., 0., 1., 1., 0., 0., 1.],
[68., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.,
0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,
0., 1., 0., 0., 0., 1., 1., 0., 0., 1.],
[68., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,
0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,
0., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])>