随着互联网的发展,移动终端设备的普及,用户数据的达到了一个爆炸式的增长,简单的awk命令对日志等信息的分析已经很难完成我们的需要了,需要用到hadoop等大型分布式处理能力的环境。本文通过show.log文件进行小例子演示:
对单个日志文件进行统计排序(文件链接show.log)
awk '{a[$1]+=1;}END{for(i in a){print a[i]" " i;}}' show.log |sort -n -r | head -n 10
awk '{print $1}' show.log | sort | uniq -c | sort -k1,1nr | head -10
执行分析案例
1.将/opdir/show.log文件上传到hadoop目录下:
hadoop fs -put /opdir/show.log /log/show/20160416
2.执行统计分析:
hadoop jar /usr/local/hadoop-2.7.2/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar -input /log/show/20160416/opdir/* -output /usr/ydcun/output20160416 -mapper 'awk "{print $1}"' -reducer "awk '{sum[\$1]++}END{for(key in sum) print key\"\t\"sum[key]}'"
3.查看结果
hadoop fs -cat /usr/ydcun/output20160416/part-00000
将分析结果导入到mysql中:
load.sh文件内容
#!/bin/bash
MYPATH="/home/centos1/Desktop"
HADOOPCMD="/usr/local/hadoop-2.7.2/bin/hadoop"
$HADOOPCMD fs -cat /usr/ydcun/output20160416/* | /usr/bin/awk '{sum+=$2}END{print sum"\t20160101"}' > statpv.20160416
$HADOOPCMD fs -cat /usr/ydcun/output20160416/* | /usr/bin/awk '{sum[$1]=$2}END{for(key in sum) print key"\t"sum[key]"\t20151201"}' | sort -k2,2 -n -r | head -n 10 > stattop.20160416
mysql -uroot -p123 -e "use test;
load data local infile '$MYPATH/statpv.20160416' into table statpv_00;
load data local infile '$MYPATH/stattop.20160416' into table stattop_00;"
用python读取mysql数据
show.py文件内容
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import commands
(status,output1) = commands.getstatusoutput('mysql -uhduser -phduser -Ne "use test;select * from statpv_00;"')
(status,output2) = commands.getstatusoutput('mysql -uhduser -phduser -Ne "use test;select userid,pv from stattop_00;"')
print "Content-type: text/html\r\n\r\n"
print output1 + "<br><br>" + output2.replace('\n','<br>')
启动python
1.将show.py文件放到cgi-lib文件夹下面并且都赋予执行权限
2.python -m CGIHTTPServer (这个命令不能用root账号来执行)
3.http://10.9.110.63:8000/cgi-bin/show.py 路径查看结果