15 Python总结之数据分析与挖掘_p00000001-all.csv-CSDN博客

本文链接：https://blog.csdn.net/Mind_programmonkey/article/details/89384593

数据分析实例之2012美国联邦选举委员会

美国联邦选举委员会发布了有关政治竞选赞助方面的数据。其中包括赞助者的姓名、职业、雇主、地址以及出资额等信息。

这个数据集中一些特征：

contbr_employer:捐款雇主
cand_nm:候选人
contbr_occupation:捐款人职业
contb_receipt_amt:捐款金额

1 前期准备

1.1 使用pandas.read_csv载入数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
fec = pd.read_csv('./P00000001-ALL.csv') #加载数据
print(fec.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
None

1.2 观察样本记录

print(fec.iloc[123456])

cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object

1.3 使用unique获得所有不同的政治候选人名单

unique_cands = fec.cand_nm.unique() #使用unique获得所有不同的政治候选人名单
unique_cands

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

unique_cands[2]

'Obama, Barack'

1.4 使用相应的字典表示政党背景

parties = {'Bachmann, Michelle':'Republican',
           'Cain, Herman':'Republican',
           'Gingrich, Newt':'Republican',
           'Huntsman, Jon':'Republican',
           'John, Gary Earl':'Republican',
           'McCotter, Thaddeus G':'Republican',
           'Obama, Barack':'Democrat',
           'Paul, Ron':'Republican',
           'Pawlenty, Timothy':'Republican',
           'Perry, Rick':'Republican',
           "Roemer, Charles E. 'Buddy' III":'Republican',
           'Romney, Mitt':'Republican',
           'Santorum, Rick':'Republican'}  #表示政党背景的方式之一是使用相应的字典

1.5 在Series对象中使用map方法和上述的映射关系，从候选人姓名中计算出政党的数组

fec.cand_nm[123456:123461]

123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object

fec.cand_nm[123456:123461].map(parties)

123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object

# 将其作为一列加入
fec["party"] = fec.cand_nm.map(parties)
fec["party"].value_counts()

Democrat      589127
Republican    401114
Name: party, dtype: int64

1.6 数据准备要点

(fec.contb_receipt_amt > 0).value_counts()

True    991475
Name: contb_receipt_amt, dtype: int64

# 分析范围限制在正向贡献中
fec = fec[fec.contb_receipt_amt > 0]

# 为主要候选人准备一个仅对他们的竞选有贡献的子集
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]

2 按职业和雇主的捐献统计

根据职业分析捐献是一个常见的统计分析。例如，律师（法律代理人）倾向于捐更多的前给民主党，而企业主则更倾向于资助共和党。

2.1 首先，获得按职业的捐献总数

fec.contbr_occupation.value_counts()[:10]#获得按职业的捐献总数

RETIRED         233990
NOT PROVIDED     57151
ATTORNEY         34286
HOMEMAKER        29931
PHYSICIAN        23432
ENGINEER         14334
TEACHER          13990
CONSULTANT       13273
PROFESSOR        12555
NOT EMPLOYED      9828
Name: contbr_occupation, dtype: int64

2.2 通过将一种工作匹配到另一种来清理工作种类

occ_mapping = {
    'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
    'INFORMATION REQUESTED':'NOT PROVIDED',
    'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
    'C.E.O':'CEO'
}
f = lambda x:occ_mapping.get(x,x) #如果没有映射，则返回x
fec.contbr_occupation = fec.contbr_occupation.map(f)
emp_mapping = {
    'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
    'INFORMATION REQUESTED':'NOT PROVIDED',
    'SELF':'SELF-EMPLOYED',
    'SELF EMPLOYED':'SELF-EMPLOYED',
}
f = lambda x:emp_mapping.get(x,x) #如果没有映射，则返回x
fec.contbr_employer = fec.contbr_employer.map(f)

2.3 使用pivot_table按党派和职业聚合数据，然后过滤出至少捐赠200万美元的子集

by_occupation = fec.pivot_table('contb_receipt_amt',
                                index = 'contbr_occupation',
                                columns = 'party',aggfunc = 'sum') #按照党派和职业聚合数据
over_2mm = by_occupation[by_occupation.sum(1) > 2000000] #过滤出至少捐赠200万美元的子集
over_2mm

party	Democrat	Republican
contbr_occupation
ATTORNEY	11141982.97	7.463819e+06
C.E.O.	1690.00	2.592983e+06
CEO	2074284.79	1.636258e+06
CONSULTANT	2459912.71	2.530255e+06
ENGINEER	951525.55	1.807054e+06
EXECUTIVE	1355161.05	4.126300e+06
HOMEMAKER	4248875.80	1.362706e+07
INVESTOR	884133.00	2.422579e+06
LAWYER	3160478.87	3.903243e+05
MANAGER	762883.22	1.439982e+06
NOT PROVIDED	4866973.96	2.056547e+07
OWNER	1001567.36	2.406537e+06
PHYSICIAN	3735124.94	3.585995e+06
PRESIDENT	1878509.95	4.714024e+06
PROFESSOR	2165071.08	2.967027e+05
REAL ESTATE	528902.09	1.624752e+06
RETIRED	25305116.38	2.342938e+07
SELF-EMPLOYED	672393.40	1.625303e+06

2.4 以条形图的方式进行数据可视化更为简单

import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure()
over_2mm.plot(kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x28036e14da0>




<Figure size 432x288 with 0 Axes>

在这里插入图片描述

2.5 按候选人名称进行分组

def get_top_amounts(group,key,n = 5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.nlargest(n)

2.6 按职业和雇主进行聚合

grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts,'contbr_occupation',n = 7) #按照职业进行聚合

cand_nm        contbr_occupation
Obama, Barack  RETIRED              25305116.38
               ATTORNEY             11141982.97
               NOT PROVIDED          4866973.96
               HOMEMAKER             4248875.80
               PHYSICIAN             3735124.94
               LAWYER                3160478.87
               CONSULTANT            2459912.71
Romney, Mitt   RETIRED              11508473.59
               NOT PROVIDED         11396894.84
               HOMEMAKER             8147446.22
               ATTORNEY              5364718.82
               PRESIDENT             2491244.89
               EXECUTIVE             2300947.03
               C.E.O.                1968386.11
Name: contb_receipt_amt, dtype: float64

grouped.apply(get_top_amounts,"contbr_employer",n=7)

cand_nm        contbr_employer
Obama, Barack  RETIRED            22694358.85
               SELF-EMPLOYED      18626807.16
               NOT EMPLOYED        8586308.70
               NOT PROVIDED        5053480.37
               HOMEMAKER           2605408.54
               STUDENT              318831.45
               VOLUNTEER            257104.00
Romney, Mitt   NOT PROVIDED       12059527.24
               RETIRED            11506225.71
               HOMEMAKER           8147196.22
               SELF-EMPLOYED       7414115.22
               STUDENT              496490.94
               CREDIT SUISSE        281150.00
               MORGAN STANLEY       267266.00
Name: contb_receipt_amt, dtype: float64

3 捐赠金额分桶

3.1 使用cut函数将贡献者的数量按贡献大小离散化分桶

bins = np.array([0,1,10,100,1000,10000,
                 100000,1000000,10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins) #使用cut函数将贡献者的数量按贡献大小离散化分桶
labels.head()

411      (10, 100]
412    (100, 1000]
413    (100, 1000]
414      (10, 100]
415      (10, 100]
Name: contb_receipt_amt, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

labels.tail()

701381      (10, 100]
701382    (100, 1000]
701383        (1, 10]
701384      (10, 100]
701385    (100, 1000]
Name: contb_receipt_amt, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

3.2 将Obama和Romney的数据按名称和分类标签进行分组

grouped = fec_mrbo.groupby(['cand_nm',labels]) #将Obama和Romney的数据按名称和分类标签进行分组，以获得捐赠规模的直方图
print(grouped.size().unstack(0))

cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                       493.0          77.0
(1, 10]                    40070.0        3681.0
(10, 100]                 372280.0       31853.0
(100, 1000]               153991.0       43357.0
(1000, 10000]              22284.0       26186.0
(10000, 100000]                2.0           1.0
(100000, 1000000]              3.0           NaN
(1000000, 10000000]            4.0           NaN

3.3 对捐款数额进行求和并在桶内进行归一化，以便对按候选人划分的捐款总额百分比进行可视化

bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis = 1),axis = 0) #对捐款数额进行归一化
print(normed_sums)

cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.805182      0.194818
(1, 10]                   0.918767      0.081233
(10, 100]                 0.910769      0.089231
(100, 1000]               0.710176      0.289824
(1000, 10000]             0.447326      0.552674
(10000, 100000]           0.823120      0.176880
(100000, 1000000]         1.000000           NaN
(1000000, 10000000]       1.000000           NaN

normed_sums[:-2].plot(kind = 'barh')

<matplotlib.axes._subplots.AxesSubplot at 0x2801c458358>

在这里插入图片描述

注：排除了两个最大的箱体，因为这些箱体不是由个人捐赠的。

这种分析可以通过多种方式进行改进和提高。例如，你可以通过捐助者姓名和邮政编码聚合捐款，以便为那些进行很多次小额捐赠的人进行调整，他们并不会进行大型捐赠。

4 按州进行捐赠统计

4.1 将数据按照候选人和州进行聚合

grouped = fec_mrbo.groupby(['cand_nm','contbr_st']) #按照候选人和州进行聚合
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
print(totals[:10])

cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK             281840.15      86204.24
AL             543123.48     527303.51
AR             359247.28     105556.00
AZ            1506476.98    1888436.23
CA           23824984.24   11237636.60
CO            2132429.49    1506714.12
CT            2068291.26    3499475.45
DC            4373538.80    1025137.50
DE             336669.14      82712.00
FL            7318178.58    8338458.81

4.2 将每一行除以捐款总额，得到每个候选人按州的捐赠总额的相对百分比

percent = totals.div(totals.sum(1),axis = 0) #将每一行除以捐款总额，就可以得到每个候选人按州的捐赠总额的相对百分比
print(percent[:10])

cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK              0.765778      0.234222
AL              0.507390      0.492610
AR              0.772902      0.227098
AZ              0.443745      0.556255
CA              0.679498      0.320502
CO              0.585970      0.414030
CT              0.371476      0.628524
DC              0.810113      0.189887
DE              0.802776      0.197224
FL              0.467417      0.532583