Prerequisite
python运行环境通过miniconda安装,详见:使用PyFlink, 如何在 zeppelin 里高效的开发 PyFlink Job? 中的准备conda环境章节
注意:jupyter,grpcio,protobuf 是Zeppelin 需要的包
pip install jupyter
pip install grpcio
pip install protobuf
pip install pandas
pip install matplotlib
基本使用
通过设置zeppelin.python.useIPython = true
,将IPython作为默认的解释器
查看帮助
通过?
或 help
命令
%python.ipython
import sys
sys?
help(sys)
magic function
%python.ipython
%timeit range(2)
%alias parts echo first %s second %s
%parts A B
%conda list
Use ZeppelinContext
ZeppelinContext is a utlity class which provide the following features
- Dynamic forms
- Show DataFrame via builtin visualization
z.input(name='my_name', defaultValue='hello')
import pandas as pd
df = pd.DataFrame({'name':['a','b','c'], 'count':[12,24,18]})
z.show(df)
Run shell command
%python
!pip install pandas
数据可视化
基于 Bank (https://archive.ics.uci.edu/ml/datasets/bank+marketing)数据来做 Batch ETL 任务
下载数据curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
使用matplotlib
%python
%matplotlib inline
import matplotlib.pyplot as plt
print("hello world")
data=[1,2,3,4]
plt.figure()
plt.plot(data)
使用pandas
%python
import pandas as pd
rates = pd.read_csv("/data/flink/bank.csv", sep=";")
z.show(rates)
SQL over Pandas DataFrames
需要安装pip install -U pandasql
%python.sql
SELECT marital,count(1) as total FROM rates WHERE age < 40 group by marital
解决中文乱码问题
如果出现乱码,解决方案如下:
- 获取matplotlibrc文件所在路径,如/opt/miniconda3/lib/python3.9/site-packages/matplotlib/mpl-data
import matplotlib as mpl
mpl.matplotlib_fname()
- 修改matplotlibrc
font.family: sans-serif
font.sans-serif: SimHei,Simfang,SimKai,DejaVu Sans, Bitstream Vera Sans, sans-serif
axes.unicode_minus: False
- 在python脚本中增加配置支持中文的代码,详见下面代码中的mpl.rcParams针对三个配置项的设置
- 删除缓存
~/.cache/matplotlib/
- 程序需要重启,如
zeppelin-0.10.0-bin-all/bin/zeppelin-daemon.sh restart
%python
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = ['KaiTi', 'SimHei', 'FangSong'] # 汉字字体,优先使用楷体,如果找不到楷体,则使用黑体
mpl.rcParams['font.size'] = 12 # 字体大小
mpl.rcParams['axes.unicode_minus'] = False # 正常显示负号
weather = pd.read_csv('https://gitee.com/cloudcoder/data-visual/raw/master/data/london/london2018.csv')
weather.index=['一月','二月','三月','四月','五月','六月','七月','八月','九月','十月','十一月','十二月']
new_df = weather[['Tmax','Tmin','Rain','Sun']]
new_df.plot.pie(subplots=True, figsize=(32, 8), autopct='%1.1f%%', legend=False)
plt.show()
weather.plot(kind='bar', y=['Tmax', 'Tmin','Rain','Sun'], subplots=True, layout=(2,2), figsize=(10,5))
plt.show()