【人工智能项目】- 机器学习实现收入分类预测报告

本文通过机器学习方法预测个人收入是否超过50k,采用随机森林和决策树算法进行特征重要性评估,并通过交叉验证评估模型准确性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

【人工智能项目】- 机器学习实现收入分类预测报告

题目

利用age、workclass、…、native_country等13个特征预测收入是否超过50k,是一个二分类问题。

训练集

32561个样本,每个样本14个特征,其中6个连续性特征、9个离散型特征。
在这里插入图片描述
特征介绍:
Age:年龄;
Workclass:离散值,表示工作类型,包括私人的,不为公司的,不为公司的,联邦政府的,地方政府的,州政府的,没有薪水的,从未工作过的;
Fnlwgt:连续值;
Education:学历背景;
Education-num:受教育时间;
Maritial_status:婚姻状况;
Occupation:职业;
Relationship:关系;
Race:种族;
Sex性别;
Captital_gain:资本收益;
Captital loss:损失;
Hours/week 工作时长;
Native country:国籍;
Income:收入。为该问题的label;

测试集

16281个样本,每个样本14个特征。
在这里插入图片描述

即在测试集中,根据age等14个特征,预测income是否超过50k,二分类问题。

说明

部分特征的值为“?”,表示缺失值,需要对其先处理。

实验部分

1.导入数据

# Data Manipulation 
import numpy as np
import pandas as pd

# Visualization 
import matplotlib.pyplot as plt
import missingno   #缺失值
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
from mpl_toolkits.mplot3d import Axes3D

# Feature Selection and Encoding
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine learning 
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
#import tensorflow as tf

# Grid and Random Search
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Metrics
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc

# Managing Warnings 
import warnings
warnings.filterwarnings('ignore')

# Plot the Figures Inline
%matplotlib inline
# 读取文件
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
# 查看前5行数据
train_data.head()
AgeWorkclassfnlgwtEducationEducation numMarital StatusOccupationRelationshipRaceSexCapital GainCapital LossHours/WeekNative countryIncome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
# 查看前5行数据
test_data.head()
AgeWorkclassfnlgwtEducationEducation numMarital StatusOccupationRelationshipRaceSexCapital GainCapital LossHours/WeekNative country
025Private22680211th7Never-marriedMachine-op-inspctOwn-childBlackMale0040United-States
138Private89814HS-grad9Married-civ-spouseFarming-fishingHusbandWhiteMale0050United-States
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servHusbandWhiteMale0040United-States
344Private160323Some-college10Married-civ-spouseMachine-op-inspctHusbandBlackMale7688040United-States
418?103497Some-college10Never-married?Own-childWhiteFemale0030United-States
train_data.shape
(32561, 15)
test_data.shape
(16281, 14)
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
Age               32561 non-null int64
Workclass         32561 non-null object
fnlgwt            32561 non-null int64
Education         32561 non-null object
Education num     32561 non-null int64
Marital Status    32561 non-null object
Occupation        32561 non-null object
Relationship      32561 non-null object
Race              32561 non-null object
Sex               32561 non-null object
Capital Gain      32561 non-null int64
Capital Loss      32561 non-null int64
Hours/Week        32561 non-null int64
Native country    32561 non-null object
Income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
train_data.describe(include=['O'])
WorkclassEducationMarital StatusOccupationRelationshipRaceSexNative countryIncome
count325613256132561325613256132561325613256132561
unique916715652422
topPrivateHS-gradMarried-civ-spouseProf-specialtyHusbandWhiteMaleUnited-States<=50K
freq22696105011497641401319327816217902917024720
test_data.describe(include=['O'])
WorkclassEducationMarital StatusOccupationRelationshipRaceSexNative country
count1628116281162811628116281162811628116281
unique91671565241
topPrivateHS-gradMarried-civ-spouseProf-specialtyHusbandWhiteMaleUnited-States
freq112105283740320326523139461086014662
train_data.columns
Index(['Age', 'Workclass', 'fnlgwt', 'Education', 'Education num',
       'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours/Week', 'Native country',
       'Income'],
      dtype='object')
train_data.dtypes
Age                int64
Workclass         object
fnlgwt             int64
Education         object
Education num      int64
Marital Status    object
Occupation        object
Relationship      object
Race              object
Sex               object
Capital Gain       int64
Capital Loss       int64
Hours/Week         int64
Native country    object
Income            object
dtype: object

2.可视化

train_data.loc[:,['Age','fnlgwt','Capital Gain','Capital Loss','Hours/Week']].plot(subplots=True,figsize=(15,10))
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001C5C361C898>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001C5C368E630>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001C5C36B2A58>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001C5C345CEB8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001C5C348D358>],
      dtype=object)

在这里插入图片描述

data_int=train_data.loc[:,['Age','fnlgwt','Capital Loss','Hours/Week']]
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(data_int.corr(),annot=True, linewidths=.5, fmt= '.1f',ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5c3604a20>

在这里插入图片描述

data_int.plot(kind='scatter',x='Hours/Week',y='fnlgwt',figsize=(15,8))
<matplotlib.axes._subplots.AxesSubplot at 0x1c5c3daab70>

在这里插入图片描述

data_int['Hours/Week'].plot(kind='hist',bins=50,figsize=(15,8))
plt.ylim(0,10000)
#每周工作时长看起来集中在40小时比较多
(0, 10000)

在这里插入图片描述

#看一下income与工作时长的关系
data_1=train_data.loc[:,['Hours/Week','Income']]
data_1.boxplot(by='Income',figsize=(8,8))
plt.ylim(20,70)
(20, 70)

在这里插入图片描述

plt.figure(figsize=(15,8))
sns.stripplot(x='Hours/Week',y='Income',data=data_1,jitter=True)
#小于5万美金的工作时长似乎在0-60之间分布比较均匀
#大于5万美金的似乎主要在30-50之间
#工作时间与薪资有一定相关性
<matplotlib.axes._subplots.AxesSubplot at 0x1c5c3e1eba8>

在这里插入图片描述

sns.pairplot(train_data.loc[:,['Age','fnlgwt','Capital Loss','Hours/Week','Income']],hue='Income',size=5)
<seaborn.axisgrid.PairGrid at 0x1c5c8ebf208>

在这里插入图片描述

x=train_data.loc[:,['Age', 'fnlgWt', 'Capital Loss', 'Hours/Week']]
y=train_data.Income
less_than_50k=(y.value_counts()[0])/len(y)
more_than_50k=(y.value_counts()[1])/len(y)
plt.figure(figsize=(10,8))
sns.countplot(y,)
print('年收入少于50k的占: %.2f' % less_than_50k)
print('年收入高于50k的占: %.2f' % more_than_50k)
年收入少于50k的占: 0.76
年收入高于50k的占: 0.24

在这里插入图片描述

train_data[['Education num','Education']].head(5)
#每种学历对应一个编号
Education numEducation
013Bachelors
113Bachelors
29HS-grad
3711th
413Bachelors
#大多数的学历在编号9,10,13
#9=HS-grad
#10=Some-college
#13=Bachelors
train_data['Education num'].value_counts().plot(kind='barh',figsize=(15,8),grid=False)
# print(data['education.num']==9)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5cf5f8a20>

在这里插入图片描述

#性别和收入
sex_with_income=train_data[['Sex','Income']]
plt.figure(figsize=(8,8))
sex_with_income.Sex.value_counts().plot()
sex_with_income.Income.value_counts().plot()
plt.legend()
<matplotlib.legend.Legend at 0x1c5cf5f8e80>

在这里插入图片描述

特征EDA

# 读取文件
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
train_data.describe(include=['O']).columns
Index(['Workclass', 'Education', 'Marital Status', 'Occupation',
       'Relationship', 'Race', 'Sex', 'Native country'],
      dtype='object')
test_data.describe(include=['O']).columns
Index(['Workclass', 'Education', 'Marital Status', 'Occupation',
       'Relationship', 'Race', 'Sex', 'Native country'],
      dtype='object')
train_data[['Income','Workclass']].groupby(['Workclass'],as_index=False).mean()
WorkclassIncome
0?0.104031
1Federal-gov0.386458
2Local-gov0.294792
3Never-worked0.000000
4Private0.218673
5Self-emp-inc0.557348
6Self-emp-not-inc0.284927
7State-gov0.271957
8Without-pay0.000000
train_data[['Income','Education']].groupby(['Education'],as_index=False).mean()
EducationIncome
010th0.066452
111th0.051064
212th0.076212
31st-4th0.035714
45th-6th0.048048
57th-8th0.061920
69th0.052529
7Assoc-acdm0.248360
8Assoc-voc0.261216
9Bachelors0.414753
10Doctorate0.740920
11HS-grad0.159509
12Masters0.556587
13Preschool0.000000
14Prof-school0.734375
15Some-college0.190235
train_data[['Income','Marital Status']].groupby(['Marital Status'],as_index=False).mean()
#婚姻状态的相关性不高,可以drop
Marital StatusIncome
0Divorced0.104209
1Married-AF-spouse0.434783
2Married-civ-spouse0.446848
3Married-spouse-absent0.081340
4Never-married0.045961
5Separated0.064390
6Widowed0.085599
train_data[['Income','Occupation']].groupby(['Occupation'],as_index=False).mean()
#有缺失数据,drop掉
OccupationIncome
0?0.103635
1Adm-clerical0.134483
2Armed-Forces0.111111
3Craft-repair0.226641
4Exec-managerial0.484014
5Farming-fishing0.115694
6Handlers-cleaners0.062774
7Machine-op-inspct0.124875
8Other-service0.041578
9Priv-house-serv0.006711
10Prof-specialty0.449034
11Protective-serv0.325116
12Sales0.269315
13Tech-support0.304957
14Transport-moving0.200376
train_data[['Income','Relationship']].groupby(['Relationship'],as_index=False).mean()
#drop
RelationshipIncome
0Husband0.448571
1Not-in-family0.103070
2Other-relative0.037717
3Own-child0.013220
4Unmarried0.063262
5Wife0.475128
train_data[['Income','Race']].groupby(['Race'],as_index=False).mean()
#保留
RaceIncome
0Amer-Indian-Eskimo0.115756
1Asian-Pac-Islander0.265640
2Black0.123880
3Other0.092251
4White0.255860
train_data[['Income','Sex']].groupby(['Sex'],as_index=False).mean()
#先保留
SexIncome
0Female0.109461
1Male0.305737
train_data[['Income','Native country']].groupby(['Native country'],as_index=False).mean()
#drop
Native countryIncome
0?0.250429
1Cambodia0.368421
2Canada0.322314
3China0.266667
4Columbia0.033898
5Cuba0.263158
6Dominican-Republic0.028571
7Ecuador0.142857
8El-Salvador0.084906
9England0.333333
10France0.413793
11Germany0.321168
12Greece0.275862
13Guatemala0.046875
14Haiti0.090909
15Holand-Netherlands0.000000
16Honduras0.076923
17Hong0.300000
18Hungary0.230769
19India0.400000
20Iran0.418605
21Ireland0.208333
22Italy0.342466
23Jamaica0.123457
24Japan0.387097
25Laos0.111111
26Mexico0.051322
27Nicaragua0.058824
28Outlying-US(Guam-USVI-etc)0.000000
29Peru0.064516
30Philippines0.308081
31Poland0.200000
32Portugal0.108108
33Puerto-Rico0.105263
34Scotland0.250000
35South0.200000
36Taiwan0.392157
37Thailand0.166667
38Trinadad&Tobago0.105263
39United-States0.245835
40Vietnam0.074627
41Yugoslavia0.375000
train_data.describe().columns
Index(['Age', 'fnlgwt', 'Education num', 'Capital Gain', 'Capital Loss',
       'Hours/Week', 'Income'],
      dtype='object')
g=sns.FacetGrid(train_data,col='Income')
g.map(plt.hist,'Age',bins=20)
print('观察:')
print("小于50k的年龄在20-40")
print('大于50k的人数少,在30-50之间')
print('年龄应该作为特征之一保留')
观察:
小于50k的年龄在20-40
大于50k的人数少,在30-50之间
年龄应该作为特征之一保留

在这里插入图片描述

#保留的分类特征:sex,race,education
grid=sns.FacetGrid(train_data,col='Income',row='Education',size=3,aspect=2)
grid.map(plt.hist,'Age',alpha=0.5,bins=20)
grid.add_legend()
print('观察:')
print('三种学历:bachelors,HS-grad,some-college')
print('some-college年薪小于5万的主要在20-30之间最多')
观察:
三种学历:bachelors,HS-grad,some-college
some-college年薪小于5万的主要在20-30之间最多

在这里插入图片描述

grid=sns.FacetGrid(train_data,col='Income',row='Education',size=3,aspect=2)
grid.map(plt.hist,'fnlgwt',alpha=0.5,bins=20)
grid.add_legend()
print('观察:')
print('drop特征fnlwgt')
观察:
drop特征fnlwgt

在这里插入图片描述

train_data.columns
Index(['Age', 'Workclass', 'fnlgwt', 'Education', 'Education num',
       'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours/Week', 'Native country',
       'Income'],
      dtype='object')
grid=sns.FacetGrid(train_data,col='Income',row='Race',size=3,aspect=2)
grid.map(plt.hist,'Hours/Week',alpha=0.5,bins=20)
grid.add_legend()
# print('观察:')
# print('drop特征hours.per.week)
<seaborn.axisgrid.FacetGrid at 0x1c5d1739278>

在这里插入图片描述

train_data.columns
Index(['Age', 'Workclass', 'fnlgwt', 'Education', 'Education num',
       'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours/Week', 'Native country',
       'Income'],
      dtype='object')
train_data
AgeWorkclassfnlgwtEducationEducation numMarital StatusOccupationRelationshipRaceSexCapital GainCapital LossHours/WeekNative countryIncome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States0
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States0
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States0
................................................
3256052Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States1

32561 rows × 15 columns

#保留的分类特征:sex,race,education
grid=sns.FacetGrid(train_data,row='Race',size=3,aspect=5)
grid.map(sns.pointplot,'Education','Income','Sex',markers=["^", "o"], linestyles=["-", "--"])
grid.add_legend()
print('女性的收入始终小于男性')
print('黑人博士学位的收入很高,亚裔女博士收入比亚裔男博士高')
print('亚裔和印度裔硕士的收入高,其中亚裔男性比女性高,印度裔女性与男性一样高')
女性的收入始终小于男性
黑人博士学位的收入很高,亚裔女博士收入比亚裔男博士高
亚裔和印度裔硕士的收入高,其中亚裔男性比女性高,印度裔女性与男性一样高

在这里插入图片描述

#categorical and numerical features的相关性
#sex,race,education
grid = sns.FacetGrid(train_data, row='Education', col='Income', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Age', alpha=0.5, ci=None)
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1c5d119ba20>

在这里插入图片描述

start

# 读取文件
import pandas as pd

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
train_data.columns
Index(['Age', 'Workclass', 'fnlgwt', 'Education', 'Education num',
       'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital Gain', 'Capital Loss', 'Hours/Week', 'Native country',
       'Income'],
      dtype='object')
#drop 特征
train_data=train_data.drop(['Workclass','fnlgwt','Education','Marital Status',
                'Occupation','Relationship','Capital Gain','Capital Loss','Native country','Hours/Week'],axis=1)
#drop 特征
test_data = test_data.drop(['Workclass','fnlgwt','Education','Marital Status',
                'Occupation','Relationship','Capital Gain','Capital Loss','Native country','Hours/Week'],axis=1)
train_data.shape
(32561, 5)
test_data.shape
(16281, 4)
train_data.Race.unique()
array([1, 2, 3, 4, 5], dtype=int64)
train_data
AgeEducation numRaceSexIncome
03913110
15013110
32559229110
32560529101

32561 rows × 5 columns

test_data
AgeEducation numRaceSex
025721
138911
16280351311

16281 rows × 4 columns

模型

target='Income'
x_columns=[x for x in train_data.columns if x not in [target]]
X_train=train_data[x_columns]
Y_train=train_data['Income']
from sklearn.model_selection import train_test_split
x_train,x_valid,y_train,y_valid=train_test_split(X_train,Y_train,test_size=0.1)
x_train.shape,y_train.shape,x_valid.shape,y_valid.shape
((29304, 4), (29304,), (3257, 4), (3257,))
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
RandomForestClassifier()
rf.feature_importances_
array([0.4581014 , 0.38324504, 0.03954043, 0.11911313])
from sklearn.model_selection import cross_val_score
scores=cross_val_score(rf,x_valid,y_valid)
print(round(scores.mean()*100,2),'%')
77.1 %

from sklearn.tree import DecisionTreeClassifier
rf = DecisionTreeClassifier()
rf.fit(x_train,y_train)
DecisionTreeClassifier()
rf.feature_importances_
array([0.39165374, 0.43142822, 0.0535285 , 0.12338954])
from sklearn.model_selection import cross_val_score
scores=cross_val_score(rf,x_valid,y_valid)
print(round(scores.mean()*100,2),'%')
74.82 %

submit.csv

predict = rf.predict(test_data)
predict
array([0, 0, 0, ..., 1, 1, 1], dtype=int64)
import pandas as pd
df = pd.DataFrame({"income":predict})
df.to_csv("submit.csv",index=None)
df.head()
income
00
10
20
30
40

小结

本节主要是通过机器学习实现对收入分类的预测。
瓷们,点赞评论收藏走起来呀!!!
在这里插入图片描述

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mind_programmonkey

你的鼓励是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值