【TensorFlow深度学习】三、流量数据预处理(字符串、csv、DataFrame、字典、张量之间的转换)

本专栏是记录作者学习TensorFlow深度学习的相关内容

本节简单介绍了数据预处理的内容,主要是字符串、csv文件、DataFrame、字典、张量等数据格式之间的转换与处理。经过这一节,我们以将字符串文件格式化为张量为例,详细介绍了处理过程。

本节的 Jupyter 笔记本文件已上传至gitee以供大家学习交流:我的gitee仓库


为了能用深度学习来解决现实世界的问题,我们经常 从预处理原始数据开始, 而不是从那些准备好的张量格式数据开始。 我们使用Python的pandsas包,对原始数据进行预处理,将原始数据转化为张量格式

下文用到的数据摘至HTTP DATASET CSIC 2010数据集:https://www.tic.itefi.csic.es/dataset/,该数据集包含上万条自动生成的Web请求,主要用于测试网络攻击防护系统

1 DataFrame数据的存取

该部分我们需要认识DataFrame,DataFrame 是 pandas 库中的一种数据结构,它类似于表格或电子表格。它可以看作是一个二维的数据结构,其中数据以行和列的形式组织。DataFrame 提供了丰富的功能,用于数据的清理、分析和操作。

将字典类型数据转化为DataFrame数据

# 初始化一个空的 DataFrame,并加入数据
import pandas as pd
request_dict={'Method': 'POST',
 'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
 'Protocol': 'HTTP/1.1',
 'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
 'Pragma': 'no-cache',
 'Cache-control': 'no-cache',
 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
 'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
 'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
 'Accept-Language': 'en',
 'Host': 'localhost',
 'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Connection': 'close',
 'Content-Length': '68',
 'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
df = pd.DataFrame([request_dict])#DataFrame中的数据可以看做一个列表,数据行是列表的一个元素。所以传入的数据应该是列表的格式。用[]包裹
df

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	POST	http://localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml+xml...	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0POSThttp://localhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml+xml...x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...

其中[request_dict]数据格式如下

[{'Method': 'POST',
  'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
  'Protocol': 'HTTP/1.1',
  'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
  'Pragma': 'no-cache',
  'Cache-control': 'no-cache',
  'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
  'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
  'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
  'Accept-Language': 'en',
  'Host': 'localhost',
  'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Connection': 'close',
  'Content-Length': '68',
  'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}]

将DataFrame数据导出到csv文件中

#将DataFrame数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:
    df.to_csv(data_file, index=True)

to_csv 方法将 DataFrame 中的数据保存到名为 raffic.csv 的文件中。参数 index=True 表示不保存行索引(默认情况下,行索引也会被保存到 CSV 文件中)。

将CSV文件数据导出到csv文件中

#从csv文件导出DataFrame数据中
data = pd.read_csv(data_file)
data

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	POST	http://localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml+xml...	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0POSThttp://localhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml+xml...x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...

2 将字符串格式化为字典

当然,我们的数据来源很可能是txt文件,是一系列字符串,此时我们需要对字符串进行处理

#数据
requests='''GET http://localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: close


GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: close


POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68

id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
#分割数据
request_list=requests.split("\n\n\n")
request_list

结果:

['GET http://localhost:8080/tienda1/index.jsp HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4\nConnection: close',
 'GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5\nConnection: close',
 'POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1\nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)\nPragma: no-cache\nCache-control: no-cache\nAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\nAccept-Encoding: x-gzip, x-deflate, gzip, deflate\nAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5\nAccept-Language: en\nHost: localhost:8080\nCookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF\nContent-Type: application/x-www-form-urlencoded\nConnection: close\nContent-Length: 68\n\nid=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']

demo

以下是以第三条POST请求为例的demo,以便读者理解,如想直接看完整实现可看下一个部分

将请求存入列表

request=request_list[2]
lines = request.split("\n")
lines

结果:

['POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1',
 'User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
 'Pragma: no-cache',
 'Cache-control: no-cache',
 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
 'Accept-Encoding: x-gzip, x-deflate, gzip, deflate',
 'Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5',
 'Accept-Language: en',
 'Host: localhost:8080',
 'Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF',
 'Content-Type: application/x-www-form-urlencoded',
 'Connection: close',
 'Content-Length: 68',
 '',
 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']

获取method,url,protocol

method,url,protocol= lines[0].split(" ")
method,url,protocol

结果:

('POST', 'http://localhost:8080/tienda1/publico/anadir.jsp', 'HTTP/1.1')

获取头部

headers=lines[1:-2]
headers

结果:

['User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
 'Pragma: no-cache',
 'Cache-control: no-cache',
 'Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
 'Accept-Encoding: x-gzip, x-deflate, gzip, deflate',
 'Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5',
 'Accept-Language: en',
 'Host: localhost:8080',
 'Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF',
 'Content-Type: application/x-www-form-urlencoded',
 'Connection: close',
 'Content-Length: 68']

将头部加入字典

headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
headers_dict

结果:

{'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
 'Pragma': 'no-cache',
 'Cache-control': 'no-cache',
 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
 'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
 'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
 'Accept-Language': 'en',
 'Host': 'localhost',
 'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Connection': 'close',
 'Content-Length': '68'}

获取请求体

body=lines[-1]
body

结果:

'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'

将请求拼接成字典

request_dict = {
    'Method': method,
    'URL': url,
    'Protocol': protocol,
    'User-Agent': headers_dict.get('User-Agent', ''),
    'Pragma': headers_dict.get('Pragma', ''),
    'Cache-control': headers_dict.get('Cache-control', ''),
    'Accept': headers_dict.get('Accept', ''),
    'Accept-Encoding': headers_dict.get('Accept-Encoding', ''),
    'Accept-Charset': headers_dict.get('Accept-Charset', ''),
    'Accept-Language': headers_dict.get('Accept-Language', ''),
    'Host': headers_dict.get('Host', ''),
    'Cookie': headers_dict.get('Cookie', ''),
    'Content-Type': headers_dict.get('Content-Type', ''),
    'Connection': headers_dict.get('Connection', ''),
    'Content-Length': headers_dict.get('Content-Length', ''),
    'Body':body
}
request_dict

结果:

{'Method': 'POST',
 'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
 'Protocol': 'HTTP/1.1',
 'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
 'Pragma': 'no-cache',
 'Cache-control': 'no-cache',
 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
 'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
 'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
 'Accept-Language': 'en',
 'Host': 'localhost',
 'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
 'Content-Type': 'application/x-www-form-urlencoded',
 'Connection': 'close',
 'Content-Length': '68',
 'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}

完整实现

实现处理多条数据

requests='''GET http://localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: close


GET http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: close


POST http://localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68

id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
request_list=requests.split("\n\n\n")
requests_list=[]
for request in request_list:
    #将请求存入列表
    lines = request.split("\n")
    #获取method,url,protocol
    method,url,protocol= lines[0].split(" ")
    #将请求拼接成字典
    request_dict = {
        'Method': method,
        'URL': url,
        'Protocol': protocol,
    }
    if(method=='GET'):
        #获取头部
        headers=lines[1:]
    elif(method=='POST'):
        #获取头部
        headers=lines[1:-2]
        #获取请求体
        body=lines[-1]
        request_dict.update({'Body' : body})

    #将头部加入字典
    headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
    
    request_dict.update(headers_dict)
    requests_list.append(request_dict)
requests_list

结果:

[{'Method': 'GET',
  'URL': 'http://localhost:8080/tienda1/index.jsp',
  'Protocol': 'HTTP/1.1',
  'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
  'Pragma': 'no-cache',
  'Cache-control': 'no-cache',
  'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
  'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
  'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
  'Accept-Language': 'en',
  'Host': 'localhost',
  'Cookie': 'JSESSIONID=1F767F17239C9B670A39E9B10C3825F4',
  'Connection': 'close'},
 {'Method': 'GET',
  'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito',
  'Protocol': 'HTTP/1.1',
  'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
  'Pragma': 'no-cache',
  'Cache-control': 'no-cache',
  'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
  'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
  'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
  'Accept-Language': 'en',
  'Host': 'localhost',
  'Cookie': 'JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5',
  'Connection': 'close'},
 {'Method': 'POST',
  'URL': 'http://localhost:8080/tienda1/publico/anadir.jsp',
  'Protocol': 'HTTP/1.1',
  'Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito',
  'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)',
  'Pragma': 'no-cache',
  'Cache-control': 'no-cache',
  'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
  'Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate',
  'Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5',
  'Accept-Language': 'en',
  'Host': 'localhost',
  'Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Connection': 'close',
  'Content-Length': '68'}]

3 将字典转化为DataFrame

使用loc方法处理数据,loc 是 Pandas 中用于通过标签(label)定位和访问 DataFrame 中的数据的方法。

import pandas as pd
#初始化df
df = pd.DataFrame(columns=['Method', 'URL' , 'Protocol', 'User-Agent', 'Pragma', 'Cache-control', 'Accept', 'Accept-Encoding',
                           'Accept-Charset', 'Accept-Language', 'Host', 'Cookie', 'Content-Type', 'Connection',
                           'Content-Length', 'Body'])
# 使用 loc 方法将新行添加到 DataFrame
for request_dict in requests_list:
    df.loc[len(df)] = request_dict
#以下方法为清空df
#df.drop(df.index, inplace=True)
df

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	GET	http://localhost:8080/tienda1/index.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml+xml...	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=1F767F17239C9B670A39E9B10C3825F4	NaN	close	NaN	NaN
1	GET	http://localhost:8080/tienda1/publico/anadir.j...	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml+xml...	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5	NaN	close	NaN	NaN
2	POST	http://localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml+xml...	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0GEThttp://localhost:8080/tienda1/index.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml+xml...x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=1F767F17239C9B670A39E9B10C3825F4NaNcloseNaNNaN
1GEThttp://localhost:8080/tienda1/publico/anadir.j...HTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml+xml...x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5NaNcloseNaNNaN
2POSThttp://localhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml+xml...x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
将DataFrame中的数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:
    # 将数据保存为 CSV 文件
    df.to_csv(data_file, index=False)

4 插值法处理缺失值

NaN数据值代表缺失值,处理缺失值的方法有插值法和删除法,其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。 在这里,我们将考虑插值法。

#从csv中获取数据
df = pd.read_csv(data_file)
# 提取数值型列
numeric_cols = df.select_dtypes(include=['float64']).columns
# # 提取非数值型列
# non_numeric_cols = df.select_dtypes(exclude=['float64']).columns

# 对数值型列进行均值填充
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# # 对Content-Type列进行填充
# df = pd.get_dummies(df , columns=['Content-Type'] , dummy_na=True)
# 对非数值列进行填充(对于该数据集来说将非数值列进行填充没有任何意义,这部分只是为了演示操作)
df = pd.get_dummies(df , dummy_na=True)

df

结果:

	Content-Length	Method_GET	Method_POST	Method_nan	URL_http://localhost:8080/tienda1/index.jsp	URL_http://localhost:8080/tienda1/publico/anadir.jsp	URL_http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito	URL_nan	Protocol_HTTP/1.1	Protocol_nan	...	Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4	Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5	Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF	Cookie_nan	Content-Type_application/x-www-form-urlencoded	Content-Type_nan	Connection_close	Connection_nan	Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito	Body_nan
0	68.0	True	False	False	True	False	False	False	True	False	...	True	False	False	False	False	True	True	False	False	True
1	68.0	True	False	False	False	False	True	False	True	False	...	False	True	False	False	False	True	True	False	False	True
2	68.0	False	True	False	False	True	False	False	True	False	...	False	False	True	False	True	False	True	False	True	False
Content-LengthMethod_GETMethod_POSTMethod_nanURL_http://localhost:8080/tienda1/index.jspURL_http://localhost:8080/tienda1/publico/anadir.jspURL_http://localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carritoURL_nanProtocol_HTTP/1.1Protocol_nan...Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AFCookie_nanContent-Type_application/x-www-form-urlencodedContent-Type_nanConnection_closeConnection_nanBody_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carritoBody_nan
068.0TrueFalseFalseTrueFalseFalseFalseTrueFalse...TrueFalseFalseFalseFalseTrueTrueFalseFalseTrue
168.0TrueFalseFalseFalseFalseTrueFalseTrueFalse...FalseTrueFalseFalseFalseTrueTrueFalseFalseTrue
268.0FalseTrueFalseFalseTrueFalseFalseTrueFalse...FalseFalseTrueFalseTrueFalseTrueFalseTrueFalse

3 rows × 36 columns

5 DataFrame转换为张量

只有数值类型的DataFrame可以转化为张量格式。 若要以上述流量作为数据集进行入侵检测的训练,上面将非数值数据项转化为数值类型的方案肯定是不行的,机器不能学习到流量里的特征。
对于将流量转化为数值类型的数据的方法,根据作者了解,可以将流量转化为图片的形式,用卷积网络进行训练。后续作者也会在该方向展开入侵检测的学习。
当数据采用张量的格式,就可以通过张量函数对数据进行操作。

import tensorflow as tf
X = tf.constant(df.to_numpy(dtype=float))
X

结果:

<tf.Tensor: shape=(3, 36), dtype=float64, numpy=
array([[68.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,
         0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,
         1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.],
       [68.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,
         0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,
         0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.],
       [68.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,
         0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,
         0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.]])>
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

雯雅千鶴子

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值