08 Python之Pandas库选择查看、选择

Pandas查看和选择数据

Pandas是基于Numpy构建的,让以Numpy为中心的应用变得更加简单。平台获取的数据主要是以Pandas中DataFrame的形式。除此之外,Pandas还包括一维数组Series以及三维的Pannel。

下面将进行详细介绍:

Series:一维数组,与numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以使不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率

DataFrame:二维的表格型数据结构。很多功能与R中data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

# 首先导入库
import pandas as pd
import numpy as np

1.Series

由一组数据(各种numpy数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的Series,可以通过传递一个list对象来创建一个Series,Pandas默认创建整型索引。

# 创建一个Series:
s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
# 获取Series的索引:
s.index
RangeIndex(start=0, stop=6, step=1)

2.DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame中的每一行是一个记录,名称为Index的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame既有行索引也有列索引,可以被看做由Series组成的字典(共用同一个索引)。

2.1 创建一个DataFrame

dates = pd.date_range("20130101",periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
df
ABCD
2013-01-010.5130201.541941-0.787835-0.666850
2013-01-020.4740041.059586-0.6528230.343018
2013-01-031.2303730.7250930.371367-0.215019
2013-01-040.4648430.056014-1.1493051.216529
2013-01-05-0.9503100.388610-0.7792161.453014
2013-01-06-1.081557-0.6878381.702892-1.268365

2.2 股票数据的获取

1.通过API接口获取

(1)网络爬虫概述

网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。

(2)金融数据模块获取股票数据方法:DataReader()

(3)DataFrame存储为CSV文件方法:dataframe.to_csv()

import numpy as np
import pandas as pd
import pandas_datareader.data as web
import datetime

# 爬取数据
df_csvsave = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())

# 保存为csv数据
df_csvsave.to_csv(".\\data.csv",columns=df_csvsave.columns,index=True)

# 展示数据
df_csvsave
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995

2.通过CSV文件获取

(1)获取CSV格式的股票数据

(2)CSV文件数据转换为DataFrame方法:pd.read_csv()

df = pd.read_csv("data.csv",index_col=0,encoding="gb2312")
df
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995

2.2 查看数据

查看前几条数据:

# 默认是查询5条
df.head()
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000

查看后几条数据:

# 默认查询后5条
df.tail()
HighLowOpenCloseVolumeAdj Close
Date
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995

查看DataFrame的索引

df.index
Index(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04', '2019-04-05',
       '2019-04-08', '2019-04-09', '2019-04-10', '2019-04-11', '2019-04-12'],
      dtype='object', name='Date')

查看DataFrame的列名

df.columns
Index(['High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'], dtype='object')

查看DataFrame的值

df.values
array([[1.91679993e+02, 1.88380005e+02, 1.91639999e+02, 1.91240005e+02,
        2.78620000e+07, 1.91240005e+02],
       [1.94460007e+02, 1.91050003e+02, 1.91089996e+02, 1.94020004e+02,
        2.27657000e+07, 1.94020004e+02],
       [1.96500000e+02, 1.93149994e+02, 1.93250000e+02, 1.95350006e+02,
        2.32718000e+07, 1.95350006e+02],
       [1.96369995e+02, 1.93139999e+02, 1.94789993e+02, 1.95690002e+02,
        1.91143000e+07, 1.95690002e+02],
       [1.97100006e+02, 1.95929993e+02, 1.96449997e+02, 1.97000000e+02,
        1.85266000e+07, 1.97000000e+02],
       [2.00229996e+02, 1.96339996e+02, 1.96419998e+02, 2.00100006e+02,
        2.58817000e+07, 2.00100006e+02],
       [2.02850006e+02, 1.99229996e+02, 2.00320007e+02, 1.99500000e+02,
        3.57682000e+07, 1.99500000e+02],
       [2.00740005e+02, 1.98179993e+02, 1.98679993e+02, 2.00619995e+02,
        2.16953000e+07, 2.00619995e+02],
       [2.01000000e+02, 1.98440002e+02, 2.00850006e+02, 1.98949997e+02,
        2.09008000e+07, 1.98949997e+02],
       [2.00139999e+02, 1.96210007e+02, 1.99199997e+02, 1.98869995e+02,
        2.77443000e+07, 1.98869995e+02]])

使用describe()函数对于数据的快速统计汇总

df.describe()
HighLowOpenCloseVolumeAdj Close
count10.00000010.00000010.00000010.0000001.000000e+0110.000000
mean198.107001195.004999196.268999197.1340012.435307e+07197.134001
std3.4586313.5003513.5183173.0291315.169550e+063.029131
min191.679993188.380005191.089996191.2400051.852660e+07191.240005
25%196.402496193.142498193.634998195.4350052.109942e+07195.435005
50%198.620003196.070000196.434998197.9349982.301875e+07197.934998
75%200.612503197.719994199.069996199.3624992.727865e+07199.362499
max202.850006199.229996200.850006200.6199953.576820e+07200.619995

对数据的转置:

df.T
Date2019-04-012019-04-022019-04-032019-04-042019-04-052019-04-082019-04-092019-04-102019-04-112019-04-12
High1.916800e+021.944600e+021.965000e+021.963700e+021.971000e+022.002300e+022.028500e+022.007400e+022.010000e+022.001400e+02
Low1.883800e+021.910500e+021.931500e+021.931400e+021.959300e+021.963400e+021.992300e+021.981800e+021.984400e+021.962100e+02
Open1.916400e+021.910900e+021.932500e+021.947900e+021.964500e+021.964200e+022.003200e+021.986800e+022.008500e+021.992000e+02
Close1.912400e+021.940200e+021.953500e+021.956900e+021.970000e+022.001000e+021.995000e+022.006200e+021.989500e+021.988700e+02
Volume2.786200e+072.276570e+072.327180e+071.911430e+071.852660e+072.588170e+073.576820e+072.169530e+072.090080e+072.774430e+07
Adj Close1.912400e+021.940200e+021.953500e+021.956900e+021.970000e+022.001000e+021.995000e+022.006200e+021.989500e+021.988700e+02

按列对DataFrame进行排序

df.sort_values("Open",ascending=False)
HighLowOpenCloseVolumeAdj Close
Date
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004

2.3 选择数据

2.3.1 通过下标选取数据:

选择一列数据:

df["Open"]
Date
2019-04-01    191.639999
2019-04-02    191.089996
2019-04-03    193.250000
2019-04-04    194.789993
2019-04-05    196.449997
2019-04-08    196.419998
2019-04-09    200.320007
2019-04-10    198.679993
2019-04-11    200.850006
2019-04-12    199.199997
Name: Open, dtype: float64
# 返回DataFrame
df[["Open"]]
Open
Date
2019-04-01191.639999
2019-04-02191.089996
2019-04-03193.250000
2019-04-04194.789993
2019-04-05196.449997
2019-04-08196.419998
2019-04-09200.320007
2019-04-10198.679993
2019-04-11200.850006
2019-04-12199.199997

选择多列

df[["Open","High"]]
OpenHigh
Date
2019-04-01191.639999191.679993
2019-04-02191.089996194.460007
2019-04-03193.250000196.500000
2019-04-04194.789993196.369995
2019-04-05196.449997197.100006
2019-04-08196.419998200.229996
2019-04-09200.320007202.850006
2019-04-10198.679993200.740005
2019-04-11200.850006201.000000
2019-04-12199.199997200.139999

选择多行

df[0:3]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006

2.3.2 使用标签选取数据

df.loc[行标签,列标签]

df.loc[“a”:“b”] #选取ab两行数据

df.loc[:,“Open”] #选取Open列的数据

df.loc的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是DataFrame,否则,则为Series

df.loc["2019-04-01","Open"]
191.63999938964844
df.loc["2019-04-01":"2019-04-03"]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
df.loc[:,"Open"]
Date
2019-04-01    191.639999
2019-04-02    191.089996
2019-04-03    193.250000
2019-04-04    194.789993
2019-04-05    196.449997
2019-04-08    196.419998
2019-04-09    200.320007
2019-04-10    198.679993
2019-04-11    200.850006
2019-04-12    199.199997
Name: Open, dtype: float64

2.3.3 使用位置选取数据

df.iloc[行位置,列位置]

df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值

df.iloc[[0,2],:] #选取第一行及第三行的数据

df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据

df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个Series

df.iloc[1,:] #选取第一行数据,返回的为一个Series

df
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995
# 选取第二行,第二列的值,返回的为单个值
df.iloc[1,1]
191.0500030517578
# 选取第一行即第三行的数据
df.iloc[[0,2],:]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
# 选取第一行到第三行(不包含)的数据
df.iloc[0:2,:]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
# 选取所有记录的第一列的值,返回的为一个Series
df.iloc[:,1]
Date
2019-04-01    188.380005
2019-04-02    191.050003
2019-04-03    193.149994
2019-04-04    193.139999
2019-04-05    195.929993
2019-04-08    196.339996
2019-04-09    199.229996
2019-04-10    198.179993
2019-04-11    198.440002
2019-04-12    196.210007
Name: Low, dtype: float64
# 选取第一行数据。返回的为一个Series
df.iloc[1,:]
High         1.944600e+02
Low          1.910500e+02
Open         1.910900e+02
Close        1.940200e+02
Volume       2.276570e+07
Adj Close    1.940200e+02
Name: 2019-04-02, dtype: float64

2.3.4 通过逻辑指针进行数据切片

df[逻辑条件]

df[df.one >= 2] #单个逻辑条件

df[(df.one >=1 )&(df.one<3)] #多个逻辑条件组合

df
HighLowOpenCloseVolumeAdj Close
Date
2019-04-01191.679993188.380005191.639999191.24000527862000191.240005
2019-04-02194.460007191.050003191.089996194.02000422765700194.020004
2019-04-03196.500000193.149994193.250000195.35000623271800195.350006
2019-04-04196.369995193.139999194.789993195.69000219114300195.690002
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995
df[df.Open > 195]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-08200.229996196.339996196.419998200.10000625881700200.100006
2019-04-09202.850006199.229996200.320007199.50000035768200199.500000
2019-04-10200.740005198.179993198.679993200.61999521695300200.619995
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995
df[(df.Open > 195)&(df.Close < 199)]
HighLowOpenCloseVolumeAdj Close
Date
2019-04-05197.100006195.929993196.449997197.00000018526600197.000000
2019-04-11201.000000198.440002200.850006198.94999720900800198.949997
2019-04-12200.139999196.210007199.199997198.86999527744300198.869995

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mind_programmonkey

你的鼓励是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值