数据科学、数据分析、人工智能、数学建模必备知识汇总-----主目录-----持续更新(进不去说明我没写完):https://blog.csdn.net/grd_java/article/details/140174015 |
---|
文章目录
一、pandas处理什么样的数据
# I want to start using pandas 我想开始使用pandas
import pandas as pd
'''
To load the pandas package and start working with it, import the package.
The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
加载pandas包并开始使用它工作,import这个包
社区赞成pandas别名为pd,所以加载pandas as pd是所有pandas文档假定标准惯例
'''
1. pandas数据表格表示
'''
I want to store passenger data of the Titanic.
For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data.
我想存储泰坦尼克号的乘客数据。
对于许多的乘客,我知道姓名(characters),age(integers)和性别(male/female)数据。
'''
'''
To manually store data in a table, create a DataFrame.
When using a Python dictionary of lists, the dictionary keys will be used as column headers and
the values in each list as columns of the DataFrame.
手动地存储数据到表格中,创建一个DataFrame。
当使用一个 Python 列表字典时,字典keys被用于column headers(列头) 并且 每个list中的values用于DataFrame的column列
'''
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
}
)
print("df:\n",df)
- A
DataFrame
is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.DataFrame是一个在列中可以存储不同数据类型(包含characters字符型,integers整型,floating point values浮点型)二维数据结构
- It is similar to a spreadsheet, a SQL table or the
data.frame
in R.小于电子表格,sql表或R语言中的data.frame
- The table has 3 columns, each of them with a column label. The column labels are respectively
Name
,Age
andSex
.上面例子中的表有3列,每一列都有一个列标签。列标签依次是Name、Age和Sex
- The column Name consists of textual data with each value a string, the column Age are numbers and the column Sex is textual data.
Name列由文本数据组成,每个值都是string字符串,Age列是数值以及Sex列是文本数据
In spreadsheet software, the table representation of our data would look very similar:
在电子表格软件中,我们数据的表格表现形式将看起来非常类似:
2. DataFrame中每一列都是一个Series系列
# I’m just interested in working with the data in the column Age
# 我只对处理Age列中的数据感兴趣
'''
When selecting a single column of a pandas DataFrame, the result is a pandas Series.
To select the column, use the column label in between square brackets [].
当选择pandas DataFrame的单个列时,返回结 果是一个pandas Series系列
选择某一列,在方括号[]中使用对应列标签即可
'''
print("df['Age']:\n",df["Age"])
Note:If you are familiar with Python dictionaries, the selection of a single column is very similar to the selection of dictionary values based on the key.
注意:如果你熟悉使用Python字典,单个列的选择与字典基于key选择values是非常非常相似的
# You can create a Series from scratch as well: 当然你也可以从零创建一个Series序列
# A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.
# 一个 pandas Series 是没有列标签的,它就是一个DataFrame的单个列。一个Series却有行标签.例如agrs[0]
ages = pd.Series([22, 35, 58], name="Age")
print("ages:\n",ages)
print("ages[0]:\n",ages[0])
3. 使用一个DataFrame或者Series做点什么
# I want to know the maximum Age of the passengers 我想知道乘客里面最大的年龄
# We can do this on the DataFrame by selecting the Age column and applying max():
# 我们可以通过选择Age列并应用max()在DataFrame中做这件事
print("df['Age'].max():\n",df["Age"].max())
# Or to the Series: 或者对于序列
print("ages.max():\n",ages.max())
'''
As illustrated by the max() method, you can do things with a DataFrame or Series.
pandas provides a lot of functionalities, each of them a method you can apply to a DataFrame or Series.
As methods are functions, do not forget to use parentheses ().
如max()方法所示,你可以使用一个DataFrame或Series做许多事。
pandas提供许多功能性函数,它们中每一个方法你都可以引用到一个DataFrame或者Series上
因为方法是函数,不要忘了使用圆括号()
'''
# I’m interested in some basic statistics of the numerical data of my data table
# 我W
'''
The describe() method provides a quick overview of the numerical data in a DataFrame.
As the Name and Sex columns are textual data, these are by default not taken into account by the describe() method.
describe()方法提供一个在DataFrame中的数值数据的快速预览
因为Name和Sex列是文本数据,默认它们不会被describe()方法所考虑
'''
df.describe()
Many pandas operations return a
DataFrame
or aSeries
. The describe() method is an example of a pandas operation returning a pandasSeries
or a pandasDataFrame
.许多pandas操作返回一个DataFrame或一个Series。describe()方法就是一个例子,它返回pandas Series或pandas DataGrame的 pandas 操作
Note:This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas. Continue reading the next tutorials to get started!
注意:这只是一个起点。类似于电子表格软件,pandas将数据表现为具有行和列的表。除了表现形式外,pandas还支持你可能在电子表格软件中所做的数据操作和运算
二、如何读写表格数据?
测试用的csv文件数据链接(来源官方文档),https://github.com/pandas-dev/pandas/raw/main/doc/data/titanic.csv
- 将数据复制
- 将数据放在titanic.csv文件中
# I want to analyze the Titanic passenger data, available as a CSV file.
# 我想分析泰坦尼克号乘客数据,以CSV文件形式提供
'''
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame.
pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …),
each of them with the prefix read_*.
pandas 提供 read_csv()函数 读取存储在csv文件的数据到pandas DataFrame中。
pandas 提供 开箱即用的许多不同的文件格式或数据源(csv,excel,sql,json,parquet,...),访问它们的函数每个都带有前缀read_*(*号是通配符,可以换成csv,excel等)
'''
titanic = pd.read_csv("data/titanic.csv")
'''
Make sure to always have a check on the data after reading in the data.
When displaying a DataFrame, the first and last 5 rows will be shown by default:
确保在读取数据后始终对数据进行检查
当显示一个DataFrame时,默认会显示前5行和后5行
'''
titanic
# I want to see the first 8 rows of a pandas DataFrame.我想看pandas DataFrame的前8行
# To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 8) as argument.
# 查看一个DataFrame的前N行,使用head()方法,并将所需行数(这个例子中是8)作为参数。
titanic.head(8)
Note: Interested in the last N rows instead? pandas also provides a
tail()
method. For example, titanic.tail(10) will return the last 10 rows of the DataFrame.注意:对最后N行感兴趣?Pandas还提供了tail()方法。例如
titanic.tail(10)
将返回DataFrame最后10行
'''
A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:
可以通过请求pandas dtypes属性来检查pandas如何解释每列数据类型:
'''
# For each of the columns, the used data type is enlisted.
# The data types in this DataFrame are integers (int64), floats (float64) and strings (object).
# 对于每一列,使用的数据类型都被登记
# 在这个DataFrame中的数据类型是整型(int64),浮点型(float64)和字符串(object)
titanic.dtypes
Note:When asking for the
dtypes
, no brackets are used!dtypes
is an attribute of aDataFrame
andSeries
. Attributes of aDataFrame
orSeries
do not need brackets. Attributes represent a characteristic of aDataFrame
/Series
, whereas methods (which require brackets) do something with theDataFrame
/Series
as introduced in the first tutorial.注意:当请求dtypes时,不需要使用括号!dtypes是一个DataFrame和Series的属性。熟悉不需要括号。熟悉表示DataFrame/Series的特征,而方法(需要括号)在DataFrame/Series中做一些事,就像第一个教程中介绍的那样
My colleague requested the Titanic data as a spreadsheet.
我的同事要求将泰坦尼克号数据做成电子表格(
注意:如果报错No module named 'openpyxl',就pip install openpyxl一下
)
'''
Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data.
The to_excel() method stores the data as an excel file.
In the example here, the sheet_name is named passengers instead of the default Sheet1.
By setting index=False the row index labels are not saved in the spreadsheet.
read_*函数被用于读数据到pandas,to_*方法被用于存储数据
to_excel()方法 存储 数据为一个excel文件。
下面的例子中,sheet_name被命名为passengers 替代默认的Sheet1.
通过设置index = False 行索引标签不被保存到电子表格中
'''
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
# The equivalent read function read_excel() will reload the data to a DataFrame:
# 等效于read函数的read_excel() 将重新加载数据到 DataFrame
titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
titanic.head()
I’m interested in a technical summary of a
DataFrame
我对DataFrame的系统相关总结感兴趣
titanic.info()
The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:
方法info()提供关于DataFrame的系统相关信息,下面更详细的解释一下output(info()函数输出的信息)
- It is indeed a DataFrame.
确实是一个DataFrame
- There are 891 entries, i.e. 891 rows.
有891条记录,即891行
- Each row has a row label (aka the
index
) with values ranging from 0 to 890.每行有1个行标签(又名index索引),取值范围0到890
- The table has 12 columns. Most columns have a value for each of the rows (all 891 values are
non-null
). Some columns do have missing values and less than 891non-null
values.表有12列。大多数列的每一行都有一个值(所有891个值都是non-null非空)。一些列确实有缺失值并且少于891非空值,例如Age和Cabin以及Embarked
- The columns
Name
,Sex
,Cabin
andEmbarked
consists of textual data (strings, akaobject
). The other columns are numerical data with some of them whole numbers (akainteger
) and others are real numbers (akafloat
).Name、Sex、Cabin和Embarked列包含文本数据(字符串,亦称object)。其它列是数值型数据,其中一些是整数(亦称integer),还有其它一些是实数(亦称float)
- The kind of data (characters, integers,…) in the different columns are summarized by listing the
dtypes
.不同列中的数据类型(字符型、整型、…)通过列出dtypes进行总结
- The approximate amount of RAM used to hold the DataFrame is provided as well.
当然还提供了用于保存DataFrame的大致RAM(随机存储器)容量
实战过程中的补充
1. 读取csv文件时,编码格式问题而报错时,可以通过指定编码解决 |
---|
data = pd.read_csv('深圳罗湖二手房信息.csv',engine='python',encoding='GBK')
三、如何选择DataFrame的子集
1. 如何从DataFrame选择特定列
I’m interested in the age of the Titanic passengers.
我对泰坦尼克号乘客的年龄感兴趣
# To select a single column, use square brackets [] with the column name of the column of interest.
# 选择单个列,使用方括号[]和你感兴趣的列名即可
ages = titanic["Age"]
ages.head()
# Each column in a DataFrame is a Series.
# As a single column is selected, the returned object is a pandas Series.
# We can verify this by checking the type of the output:
# DataFrame中每列都是一个Series。一个被选中的单一列,返回对象是一个pandas Series
# 我们可以通过检查输出的type来进行验证
type(titanic["Age"])
# And have a look at the shape of the output:
# 还有,看看输出的形状
titanic["Age"].shape
DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas
Series
andDataFrame
containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.DataFrame.shape 是pandas序列和DataFrame的一个属性(记得在读写教程中提到的,不要对属性使用括号),其中包含行和列的数量:(nrows,ncolumns)。pandas Series是1维的,因此只返回行数
I’m interested in the age and sex of the Titanic passengers.
我对泰坦尼克号乘客的age和sex感兴趣
# The returned data type is a pandas DataFrame:
# 返回的数据类型是一个pandas DataFrame
type(titanic[["Age", "Sex"]])
# The selection returned a DataFrame with 891 rows and 2 columns.
# Remember, a DataFrame is 2-dimensional with both a row and column dimension.
# 这次的选择(查询)返回了一个包含891行和2列的DataFrame。注意,一个DataFrame是2维的,既有行维度又有列维度
titanic[["Age", "Sex"]].shape
Note:The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas
DataFrame
as seen in the previous example.注意:内部方括号定义了一个带有列名的Python列表,而外部方括号用于从pandas DataFrame中选择数据,如前面的示例所示。
2. 如何从DataFrame中过滤特定行
I’m interested in the passengers older than 35 years.
我对大于35岁的乘客感兴趣
# To select rows based on a conditional expression, use a condition inside the selection brackets [].
# 基于条件表达式选择行,在选择括号[]里面使用条件
# The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:
# 在选择括号里面的条件titanic["Age"] > 35 检查Age列哪一行的值大于35
above_35 = titanic[titanic["Age"] > 35]
above_35.head(10)
The output of the conditional expression (
>
, but also==
,!=
,<
,<=
,… would work) is actually a pandasSeries
of boolean values (eitherTrue
orFalse
) with the same number of rows as the originalDataFrame
. Such aSeries
of boolean values can be used to filter theDataFrame
by putting it in between the selection brackets[]
. Only rows for which the value isTrue
will be selected.条件表达式的输出(
>
, 还有==
,!=
,<
,<=
,… 都有效)实际上是一个pandas布尔值序列(>
, but also==
,!=
,<
,<=
,… would work)与原始DataFrame具有相同的行数。这样一个布尔值序列可以被用于过滤DataFrame,方法是将它放入选择括号[]之间。只有值为True的行才会被选中。
'''
We know from before that the original Titanic DataFrame consists of 891 rows.
Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:
我们从前面知道,原始的泰坦尼克DataFrame包含891行,
让我们通过检查 刚才得到的结果DataFrame above_35 的shape属性来查看满足条件的行数
'''
above_35.shape
I’m interested in the Titanic passengers from cabin class 2 and 3.
我对在2和3舱室的泰坦尼克号乘客感兴趣
'''
Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list.
To filter the rows based on such a function, use the conditional function inside the selection brackets [].
In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.
类似于条件表达式,isin()条件函数会为在提供列表中每行的值返回True
基于这样的函数过滤,请使用在选择括号[]内的条件函数
本例中titanic["Pclass"].isin([2,3]),选择括号内的条件是[2,3],检查Pclass列是2还是3
'''
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()
# The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:
# 上述操作相当于按Pclass为2或3的行进行过滤,并将两个语句用一个|(or)运算符组合在一起
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()
When combining multiple conditional statements, each condition must be surrounded by parentheses
()
. Moreover, you can not useor
/and
but need to use theor
operator|
and theand
operator&
.当组合多个条件语句时,每个条件必须被圆括号()括起来。此外,您不能使用or/and,但需要使用or操作符
|
和and操作符&
I want to work with passenger data for which the age is known.
我想使用已知年龄的乘客数据
'''
The notna() conditional function returns a True for each row the values are not a Null value.
As such, this can be combined with the selection brackets [] to filter the data table.
notna()条件函数为非Null值的每行返回True
因此,可以与选择括号[]结合使用,以过滤数据表
'''
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()
# You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
# 你可能想知道有什么确切的改变,因为前5行仍然是相同的值(对于直接查整个表来说)。一种验证方法是检查形状是否改变
age_no_na.shape
3. 如何从DataFrame中选择特定行和列
I’m interested in the names of the passengers older than 35 years.
我对35岁以上乘客的名字感兴趣
'''
In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore.
The loc/iloc operators are required in front of the selection brackets [].
When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
这个例子中,一次生成行和列的子集,仅使用选择括号[]以及无法满足了
loc/iloc操作符需要再选择括号[]的前面。
当使用loc/iloc时,逗号之前的部分是你想要的行,逗号之后的部分是你想要选择的列
'''
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
'''
When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets [].
For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon.
Using a colon specifies you want to select all rows or columns.
当使用列名,行标签或一个条件表达式时,请在选择括号[]前使用loc操作
对于逗号前后的部分,可以使用单个标签、标签列表、标签片、条件表达式或冒号
使用冒号指定要选择的所有行或列
'''
adult_names.head()
I’m interested in rows 10 till 25 and columns 3 to 5.
我对10到25行和3到5列感兴趣
'''
Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore.
When specifically interested in certain rows and/or columns based on their position in the table,
use the iloc operator in front of the selection brackets [].
同样,一次生成行和列的子集,仅使用[]是无法满足的。当根据表中的位置对某些行 和/或 列特别感兴趣时,在[]前使用iloc操作
'''
titanic.iloc[9:25, 2:5]
# When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data.
# For example, to assign the name anonymous to the first 3 elements of the fourth column:
# 当使用loc或iloc选择特定行 和/或 列时,可以将新值分配给所选数据。
# 例如,将名称anonymous分配给第4列的前3个元素
titanic.iloc[0:3, 3] = "anonymous"
titanic.head()
- When selecting subsets of data, square brackets
[]
are used.当选择数据的子集时,用[]
- Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
在这些括号中,你可以使用单个行或列标签、行或列标签的列表、标签切片、条件表达式或冒号
- Select specific rows and/or columns using
loc
when using the row and column names.当使用行和列名时,使用loc选择特定行 和/或 列
- Select specific rows and/or columns using
iloc
when using the positions in the table.当使用表中的位置时,使用iloc选择特定行 和/或 列
- You can assign new values to a selection based on
loc
/iloc
.你可以分配新值到基于loc/iloc选中的内容
四、如何在pandas中创建平面图plots
这一节看个乐子就行,日后画图直接用Matplotlib就可以,这玩意没人用。
但是这一节需要的数据记得下载一下,后面的小节会用到这些数据
本节需要用到的数据,下载好后,运行下面的代码,确保成功即可,matplotlib需要额外使用命令pip install matplotlib安装(不会的,百度怎么给python安装matplotlib)
https://github.com/pandas-dev/pandas/tree/main/doc/data/air_quality_no2.csv
import pandas as pd # 导入pandas
import matplotlib.pyplot as plt # 导入matplotlib中的pyplot
air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality.head()
Note:The usage of the index_col and parse_dates parameters of the read_csv function to define the first (0th) column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively.
注意:使用read_csv函数的index_col和parse_dates将第一列(第0列)定义为DataFrame的索引,并将列中的日期分别转换为Timestamp对象
I want a quick visual check of the data.
我想要快速查看一下数据
# With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.
# 对于DataFrame,pandas默认情况下为每个包含数字数据的列创建一个单线图
air_quality.plot()
# plt.show() # 调用plt显示,但是jupyter环境下,无需调用也可以显示
I want to plot only the columns of the data table with the data from Paris.
我只想用来自Paris的数据绘制数据表的列
'''
To plot a specific column, use the selection method of the subset data tutorial in combination with the plot() method.
Hence, the plot() method works on both Series and DataFrame.
要绘制特定列,请将子集数据教程中的选择方法与plot()方法结合使用
因此,plot()方法同时适用于Series和DataFrame
'''
air_quality["station_paris"].plot()
plt.show()
I want to visually compare the ( N O 2 ) (NO_2) (NO2) values measured in London versus Paris.
我想直观地比较一下在伦敦和巴黎测量的 ( N O 2 ) (NO_2) (NO2)的值
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
# plt.show() # 调用plt显示,但是jupyter环境下,无需调用也可以显示
有哪些常用的方法?
note:In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an overview of the available methods, for example air_quality.plot. + TAB.
注意:在许多开发环境和Jupyter Notebook中,使用TAB按钮可以获得可以方法的概述,例如
air_quality.plot.
+ TAB.
'''
Apart from the default line plot when using the plot function, a number of alternatives are available to plot data.
Let’s use some standard Python to get an overview of the available plot methods:
在使用绘图函数时,除了默认的线形绘图之外,还有许多可供选择的方法来绘制数据
让我们使用一些标准的Python来概述可用的plot方法
'''
[
method_name for method_name in dir(air_quality.plot) if not method_name.startswith("_")
]
# One of the options is DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the air quality example data:
# 其中一个选项是DataFrame.plot.box(),它引用boxplot。box方法使用于空气质量实例数据
air_quality.plot.box()
I want each of the columns in a separate subplot.
我希望每一列都在一个单独的子图中
'''
Separate subplots for each of the data columns are supported by the subplots argument of the plot functions.
The builtin options available in each of the pandas plot functions are worth reviewing.
绘图函数的subplots参数支持每个数据列的独立子绘图。
每个pandas plot函数中可用的内置选项值的回顾一下。
'''
axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
plt.show()
I want to further customize, extend or save the resulting plot.
我想进一步定制、扩展或保存作为结果返回的plot平面图
fig, axs = plt.subplots(figsize=(12, 4))
air_quality.plot.area(ax=axs)
axs.set_ylabel("NO$_2$ concentration")
fig.savefig("no2_concentrations.png")
Each of the plot objects created by pandas is a Matplotlib object. As Matplotlib provides plenty of options to customize plots, making the link between pandas and Matplotlib explicit enables all the power of Matplotlib to the plot. This strategy is applied in the previous example:
pandas创建的每个绘图对象都是一个Matplotlib对象。由于Matplotlib提供了大量自定义绘图的选项,使pandas和Matplotlib之间的链接显式地使Matplotlib的所有功能都能够用于绘图。这个策略应用在前面的例子中:
# Create an empty Matplotlib Figure and Axes 创建空的Matplotlib Figure和Axes
fig, axs = plt.subplots(figsize=(12, 4))
# Use pandas to put the area plot on the prepared Figure/Axes 使用pandas将区域图放到准备好的图/轴上
air_quality.plot.area(ax=axs)
# Do any Matplotlib customization you like 做任何您喜欢的Matplotlib定制
axs.set_ylabel("NO$_2$ concentration")
# Save the Figure/Axes using the existing Matplotlib method. 使用现有的Matplotlib方法保存图/轴。
fig.savefig("no2_concentrations.png")
plt.show() # Display the plot
五、如何基于现有列派生出新列
I want to express the ( N O 2 ) (NO_2) (NO2) concentration of the station in London in m g / m ( 3 ) mg/m(^3) mg/m(3).(If we assume temperature of 25 degrees Celsius and pressure of 1013 hPa, the conversion factor is 1.882)
我想用 m g / m ( 3 ) mg/m(^3) mg/m(3)表示伦敦车站 ( N O 2 ) 的浓度 (NO_2)的浓度 (NO2)的浓度(假设温度为25摄氏度,压力为1013hPa,换算系数为1.882)
air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality.head()
# To create a new column, use the [] brackets with the new column name at the left side of the assignment.
# 创建新列,请在赋值号左侧使用带新列名的[]括号
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality.head()
Note: The calculation of the values is done element-wise. This means all values in the given column are multiplied by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!
注意:这些值的计算是按元素计算的。这意味着给定列的所有值都同时乘以值1.882.你无需使用循环来迭代每一行!
I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column.
我想检查巴黎与安特卫普的数值比率,并将结果保存在一个新列中
# The calculation is again element-wise, so the / is applied for the values in each row.
# 计算同样是基于元素的,因此 / 应用于每行的值
air_quality["ratio_paris_antwerp"] = (
air_quality["station_paris"] / air_quality["station_antwerp"]
)
'''
Also other mathematical operators (+, -, *, /,…) or logical operators (<, >, ==,…) work element-wise.
The latter was already used in the subset data tutorial to filter rows of a table using a conditional expression.
If you need more advanced logic, you can use arbitrary Python code via apply().
此外,其他数学运算符(+、-、*、/、...)或逻辑运算符(<、>、==、...)也可以按元素工作
在子集数据教程中已经使用了后者来使用条件表达式筛选的表中的行
如果你需要更高级的逻辑,你可以通过apply()使用任意Python代码(apply这个方法会在高级应用中讲到)
'''
air_quality.head()
I want to rename the data columns to the corresponding station identifiers used by OpenAQ.
我想将数据列重命名为OpenAQ使用的相应站标识符。
'''
The rename() function can be used for both row labels and column labels.
Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.
rename()函数既可以用于行标签,也可以用于列标签。
提供一个字典,其中key为当前名称,value为新名称,以更新相应的名称。
'''
air_quality_renamed = air_quality.rename(
columns={
"station_antwerp": "BETR801",
"station_paris": "FR04014",
"station_london": "London Westminster",
}
)
air_quality_renamed.head()
'''
The mapping should not be restricted to fixed names only, but can be a mapping function as well.
For example, converting the column names to lowercase letters can be done using a function as well:
映射不应仅限于固定名称,也可以是映射函数。
例如,将列名转换为小写字母也可以使用函数来完成。
'''
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
air_quality_renamed.head()
六、如何计算汇总统计
1. 聚合数据
What is the average age of the Titanic passengers?
泰坦尼克号乘客的平均年龄是什么
titanic["Age"].mean()
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.
可以使用不同的统计学,并且可以将其应用于具有数值数据的列。操作通常会排除丢失的数据,并在默认情况下跨行操作。
What is the median age and ticket fare price of the Titanic passengers?
泰坦尼克号乘客的中位数年龄和票价是多少?
'''
The statistic applied to multiple columns of a DataFrame (the selection of two columns returns a DataFrame, see the subset data tutorial)
is calculated for each numeric column.
应用于DataFrame的多个列(选择两个列返回一个DataFrame,请参阅子集数据教程)的统计将为每个数字列计算。
'''
titanic[["Age", "Fare"]].median()
'''
The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function from the first tutorial?
可以同时计算多个列的聚合统计量,还记得第一个教程中的描述函数吗?
'''
titanic[["Age", "Fare"]].describe()
'''
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.
可以使用DataFrame定义给定列的聚合统计信息的特定组合,而不是预定义的统计信息。
agg() method:
'''
titanic.agg(
{
"Age": ["min", "max", "median", "skew"],
"Fare": ["min", "max", "median", "mean"],
}
)
2. 聚合按类别分组的统计信息
What is the average age for male versus female Titanic passengers?
泰坦尼克号上男女乘客的平均年龄是多少?
'''
As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex", "Age"]].
Next, the groupby() method is applied on the Sex column to make a group per category.
The average age for each gender is calculated and returned.
由于我们感兴趣的是每个性别的平均年龄,因此首先对这两列进行子选择:titanic[["Sex", "Age"]]。
接下来,在Sex列上应用groupby()方法,为每个类别创建一个组。
计算并返回每个性别的平均年龄。
'''
titanic[["Sex", "Age"]].groupby("Sex").mean()
Calculating a given statistic (e.g.
mean
age) for each category in a column (e.g. male/female in theSex
column) is a common pattern. Thegroupby
method is used to support this type of operations. This fits in the more generalsplit-apply-combine
pattern:为列中的每个类别(例如性别列中的男性/女性)计算给定的统计数据(例如平均年龄)是一种常见模式。groupby方法用于支持此类操作。这适用于更通用的split-apply-combine模式:
Split
the data into groups将数据
分
成几组
Apply
a function to each group independently对每个组独立
应用
一个函数
Combine
the results into a data structure将结果
组合
成一个数据结构
The apply and combine steps are typically done together in pandas.
应用和组合步骤通常在pandas中一起完成。
'''
In the previous example, we explicitly selected the 2 columns first.
If not, the mean method is applied to each column containing numerical columns by passing numeric_only=True:
先前的例子中,我们首先显式地选择了两列
如果不是,则通过传递numeric_only=True将mean方法应用于每个列,其中也包含数值列
'''
titanic.groupby("Sex").mean(numeric_only=True)
'''
It does not make much sense to get the average value of the Pclass.
If we are only interested in the average age for each gender,
the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:
得到Pclass的平均值没有多大意义。
如果我们只对每个性别的平均年龄感兴趣,则对列(通常为矩形括号[])的选择在分组数据也支持
'''
titanic.groupby("Sex")["Age"].mean()
Note:The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a Categorical data type to handle this type of data. More information is provided in the user guide
Categorical data
section.注意:Pclass列包含数值数据,但实际上代表3个类别(或因子),分别标记为“1”、“2”和“3”。计算这些统计数据没有多大意义。因此,pandas提供了一个Categorical数据类型来处理这种类型的数据。用户指南
分类数据
部分提供了更多信息。
What is the mean ticket fare price for each of the sex and cabin class combinations?
每个性别和舱位组合的平均票价是多少?
'''
Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby() method.
分组可以由多个列同时完成。将列名作为列表提供给groupby()方法。
'''
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
3. 按类别统计记录数
What is the number of passengers in each of the cabin classes?
每个客舱的乘客人数是多少?
# The value_counts() method counts the number of records for each category in a column.
# value_counts()方法计算列中每个类别的记录数。
titanic["Pclass"].value_counts()
'''
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:
这个函数是一个快捷方式,因为它实际上是一个分组操作,结合了每个组内记录数量的计数:
'''
titanic.groupby("Pclass")["Pclass"].count()
Both
size
andcount
can be used in combination withgroupby
. Whereassize
includesNaN
values and just provides the number of rows (size of the table),count
excludes the missing values. In thevalue_counts
method, use thedropna
argument to include or exclude theNaN
values.size和count都可以与groupby结合使用。虽然size包含NaN值并且只提供行数(表的大小),但count排除了缺失的值。在value_counts方法中,使用dropna参数来包含或排除NaN值。
七、重塑和调整表格
0. cut()
- pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=‘raise’, ordered=True)
- 存放values到离散区间
- Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
- 当你需要分离并整理数据值到bins中时可以使用cut。这个函数对于从连续变量转变为分类变量也很有用。例如cut函数可以将ages年龄数据转换为年龄区间组。支持分组为相等数量的分组,或预先指定的分租数组
- Parameters(参数):
- x:array-like,被归类的输入数组,必须是一维
- bins:int, sequence of scalars标量序列, 或 IntervalIndex
- int:定义x范围内等宽bins的数量。x的范围每边扩展0.1%,以包括x的最小值和最大值。
- 标量序列:定义允许非均匀宽度的bin边缘。x的值域没有扩展。
- IntervalIndex:定义要使用的确切箱子。注意,bins的IntervalIndex必须不重叠。
- right:bool, 默认为 True
- 指定bins中是否包含最右边的边。如果
right == True
(默认值),则bins[1,2,3,4]
表示为(1,2],(2,3],(3,4]。当bins为IntervalIndex时,此参数被忽略。
- 指定bins中是否包含最右边的边。如果
- labels:array or False, default None
- 指定返回的bins的标签。必须与返回的bin长度相同。如果为False,则只返回bins的整数指示器。这会影响输出容器的类型(见下文)。当bins为IntervalIndex时,此参数被忽略。如果为True,则会报错。当ordered=False时,必须提供标签。
- retbins:bool, default False
- 是否返回bins。当以标量形式提供bin时非常有用。
- precision:int,default 3
- 存储和显示bin标签的精度。
- include_lowest:bool, default False
- 第一个区间是否应该左包含(就是左边也是闭区间)。
- duplicates:{default ‘raise’, ‘drop’}, optional
- 如果bin边不是唯一的,则抛出ValueError或删除非唯一的边。
- ordered:bool, default True
- 标签是否排序。适用于返回类型Categorical和Series(带有Categorical dtype)。如果为True,则生成的分类将被排序。如果为False,则产生的分类将是无序的(必须提供标签)
- Returns(返回值):
- out(输出的内容):Categorical, Series, or ndarray
- 一个类似数组的对象,表示x的每个值各自对应的bin。类型取决于标签的值。
- None(默认):对于Series x返回一个Series,对于所有其他输入返回一个Categorical。存储在其中的值是Interval dtype。
- 标量序列:对于Series x返回一个Series,对于所有其他输入返回一个Categorical。存储在其中的值是序列中的任何类型。
- False:返回整数数组。
- bins:numpy.ndarray or IntervalIndex.
- 计算的或指定的bins。仅当retbins=True时返回。对于标量或序列bins,它是一个带有计算bins的ndarray。如果set duplicate =drop, bins将删除非唯一的bin。对于IntervalIndex bins,它等于bins。
- out(输出的内容):Categorical, Series, or ndarray
The cut() function computes groupings for the values of the input array and is often used to transform continuous variables to discrete or categorical variables:
cut()函数为输入数组估算组群,并且常用于将连续变量转换为离散或分类变量
An integer
bins
will form equal-width bins.整型的
bins
将形成等宽的bins。(说白了就是将数组分成指定区间,例如1-100,101-200,201-300…)
ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
# 将数据切成3份
pd.cut(ages, bins=3)
'''=============================输出结果如下============================'''
# 可见将每个数都划分到了一个区间中,共3个区间,因为我们指定bins=3
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
A list of ordered bin edges will assign an interval for each variable.
有序的bin edges列表将为每个变量分配一个区间。(说白了就是我们指定几个区间,pandas会把值放进去)
# 指定(0,18],(18,35],(35,70]3个区间,将ages的值放进去
pd.cut(ages, bins=[0, 18, 35, 70])
'''=============================输出结果如下============================'''
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]
If the
bins
keyword is an IntervalIndex, then these will be used to bin the passed data.如果‘ bins ’关键字是一个IntervalIndex(区间索引),那么将被用来存放被传递的数据。(说白了就是另一种划分区间的方法)
pd.cut(ages, bins=pd.IntervalIndex.from_breaks([0, 40, 70]))
'''=============================输出结果如下============================'''
[(0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (0, 40], (40, 70], (40, 70]]
Categories (2, interval[int64, right]): [(0, 40] < (40, 70]]
n. 实战常用
- 拟合多项式并使用
# 计算5次多项式系数
coefficients = np.polyfit(pd.to_numeric(filtered_a['竞价空间']).to_numpy(), pd.to_numeric(filtered_a['全网日前出清电价']).to_numpy(),5)
ploy = np.poly1d(coefficients) # 获得5元拟合公式
# 获取x值
line_x = filtered_a['竞价空间']
line_y = ploy(line_x)# 用拟合出来的5次多项式获取y值
# 生成散点图
plt.scatter(filtered_a['竞价空间'],filtered_a['全网日前出清电价'])
# 五元拟合曲线
plt.plot(line_x,line_y,color='red')
plt.show()
- 截取前13列,删除第一行
titanic.iloc[0:,:13]# 截取前13列
a=a.drop(0)# 删除第一行
- 日期列数据设置为日期时间格式,方便截取
a['日期']=pd.to_datetime(a['日期'])# 将日期列数据设置为日期时间格式,方便截取
# 获取数据中所有的时间信息(以日为单位,看看数据时间跨度)
times = pd.period_range(a['日期'].min(),a['日期'].max())
times = times.to_timestamp().date # 只获取日期数据
- 根据日期取出数据,然后按列插入到新表单,.reset_index(drop=True)避免索引不匹配导致值放不进去
# 保存每日竞价空间到temp
temp = pd.DataFrame()
for time in times:
mask = (a['日期']==str(time))
# 避免索引不匹配导致赋值失败
temp[time]=a.loc[mask]['竞价空间'].reset_index(drop=True)
- 生成datetime.date求皮尔逊相关系数
import datetime
mine_need_time = datetime.date(2024,7,3)#要预测的日期7月3日
corr = temp.corr() # 皮尔逊相关系数
# 必须是7月3日之前的数据
corr = corr[corr.keys()<mine_need_time]
# 找出7月3号相似日,皮尔逊系数>0.8的日子
corr = corr[corr[mine_need_time]>0.97][mine_need_time]
- loc和iloc的区别
# 获取要预测的数据
mask = (a['日期']==str(mine_need_time))
filtered_b = a.loc[mask]
# 截取前13列
titanic.iloc[0:,:13]
'''
1. 索引类型:iloc基于位置索引,使用整数位置来选择数据;
而loc基于标签索引,可以使用标签(如字符串、日期等)来选择数据。
2. 使用场景:iloc适用于当你需要按数据的具体位置(如第几行、第几列)进行访问时;
而loc适用于当你知道数据的标签(如行名、列名)时,可以进行精确或基于条件的访问。
3. 切片方式:iloc的切片范围不包括最后一个位置(开区间),
而loc的切片范围包括最后一个索引(闭区间)。
在Pandas中,iloc和loc都是用于选择数据的方法,但它们在处理数据时使用的索引类型不同。
iloc通过位置索引来选择数据,而loc通过标签索引来选择数据。
选择使用哪种方法取决于你对数据的访问方式:如果知道数据的具体位置,使用iloc;
如果知道数据的标签,使用loc。
'''
- 生成时间序列
#pd.period_range(start='2017-06-01',end='2017-06-30') 生成随机时间序列,当指定freq='D'时,从start或end开始按天生成,
pd.period_range(start='2017-06-01',end='2017-06-30',freq='D')
# 将日期提取出来,并且年份只保留后两位
list = ["%s-%s-%s" %(str(item.year)[-2:],item.month,item.day) for item in data.index]
# index使用日期序列
data = pd.DataFrame({'A_sale': np.random.rand(30) * 1000,'B_sale':np.random.rand(30) * 200},index=pd.period_range(start='20170601', periods=30, freq='D'))
- 百分号形式
# A、B产品销售额量级不同,用相对比较
data['A_per'] = data['A_sale']/data['A_sale'].sum()
data['A_per%'] = data['A_per'].apply(lambda x:'%.2f%%' % (x*100)) # 百分号形式
- 随机序列
# 生成12个0-500的随机数,rand本身是生成0-1的随机数
np.random.rand(12)*500
pd.Series(np.random.randn(50)*100).sort_values() # 生成包含50个0-100随机数的升序有序序列
pd.Series(np.random.randn(50)*500).sort_values(ascending=False) # 0-500的,降序
- 数据A和前一天相减,可以生成A向下移一行的辅助列
data['shift_1_A'] = data.shift(1)['A'] # 将A向下移一行
data.fillna(0,inplace=True)# 缺失值填充0
data['z_growth'] = data['A'] - data['shift_1_A'] # 相减即可
- 散点矩阵
# 对data中每个变量两两生成散点图
pd.plotting.scatter_matrix(data,figsize=(8,8),
color='k',
marker='.',
diagonal='hist',
alpha=0.8,
range_padding=0.1)
- 是否是正态分布
'''
stats.kstest是SciPy库中的一个函数,用于执行Kolmogorov-Smirnov检验,
主要用于检验数据是否符合某个特定的分布,如正态分布、均匀分布等。
该检验是一种非参数统计方法,通过比较数据的累积分布函数与理论分布函数,计算两者之间的最大差异(D值)和p值来进行判断。
'''
from scipy import stats
stats.kstest(data['value1'],'norm',(u1,std1))