使用随机森林,调参,类似这样,找到最大点,结果稍微提高了一两个百分点,Train内测试都在4.0以上了。
由于xgboost在我机器上跑的慢,所以,不再选用。
加上网上找来的各省人口,人均GDP数据
GDPAVG = {41336: 99, 41325: 99, 41367: 80, 41401: 662, 41415: 7327, 41324: 582, 41332: 404, 41335: 225, 41330: 119, 41380: 934, 41327: 309, 41345: 73, 41342: 93, 41326: 187, 41361: 255} Population = {41336: 346, 41325: 204, 41367: 168, 41401: 168, 41415: 9, 41324: 79, 41332: 103, 41335: 157, 41330: 244, 41380: 25, 41327: 161, 41345: 327, 41342: 528, 41326: 256, 41361: 115}
df['GDPAVG'] = df['State'].map(GDPAVG) df['Population'] = df['State'].map(Population)
import matplotlib.pyplot as plt
test = []
ranges= range(2,20)
for i in ranges:
rfc = RandomForestClassifier(n_estimators=230
,max_depth= 11
,max_features=4
,min_samples_split=10
,random_state=10
,min_samples_leaf=i
)
rfc.fit(x_train, y_train)
# rfc_y_predict = rfc.predict(x_test)
score = rfc.score(x_test, y_test)
test.append(score)
plt.plot(ranges,test,color="red",label="min_samples_leaf")
plt.legend()
plt.show()