金融风控项目

最新推荐文章于 2024-11-08 15:01:21 发布

ballhacker

最新推荐文章于 2024-11-08 15:01:21 发布

阅读量799

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/ballhacker/article/details/107722074

版权

基于一个用户的基本信息、历史信息来预测逾期与否。采样的具体数据是拍拍贷数据。https://www.kesci.com/home/competition/56cd5f02b89b5bd026cb39c9/content/1。

在此数据中提供了三种不同类型的数据:

Master: 用户的主要信息
Loginfo: 登录信息
Userupdateinfo: 修改信息

本次项目中，我们只使用Master的信息来预测一个用户是否会逾期。数据里有一个字段叫作Target是样本的标签（label)。

import numpy as np 
import math 
import pandas as pd 
pd.set_option('display.float_format',lambda x:'%.3f' % x)
import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns 
sns.set_palette('muted')
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')
import os

# 读取Master数据
data = pd.read_csv('data/Training/PPD_Training_Master_GBK_3_1_Training_Set.csv',encoding='gb18030')
print (data.shape)

# 展示记录
print(data.head())

# 正负样本的比例， 可以看出样本比例不平衡的
data.target.value_counts()

1. 数据的预处

缺失值。数据里有大量的缺失值，需要做一些处理。
字符串的清洗。比如“北京市”和“北京”合并成“北京”，统一转换成小写等
二值化。具体方法请参考课程里的介绍
衍生特征：比如户籍地和当前城市是否是同一个？
特征的独热编码：对于类别型特征使用独热编码形式

连续性特征的处理：根据情况来处理

from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_string_dtype
from sklearn.preprocessing import OneHotEncoder

def process_missing(df,kind):
    
    """
    1) Delete sparse columns/rows
    2) fill NA
    """
    
    for index in range(df.shape[0]):
        
        if df.i

最低0.47元/天解锁文章