Spark SQL 之 聚合函数 (Aggregation)
案例来自于世界杯足球赛相关数据
数据样式
创建 WorldCUP 的case class
case class WorldCup ( year: String,
country: String,
winner: String,
runnerup: String,
third: String,
fouth: String,
goalscored: Integer,
qualifiedTeam: Integer,
matchedPlayed: Integer,
attendance: Long) ;
案例一: 统计夺冠前三的球队
val fifa_file = "file:///d:/WorldCups.csv" ;
val spark = SparkSession. builder ( ) . config ( conf) . getOrCreate ( ) ;
import spark. implicits. _;
val fifa_ds = spark. read. format ( "csv" ) .
option ( "sep" , "," ) .
option ( "header" , true ) .
option ( "inferSchema" , true ) .
load ( fifa_file) . as[ WorldCup] ;
fifa_ds. printSchema ( ) ;
fifa_ds. persist ( ) ;
fifa_ds. groupBy ( "winner" ) . count ( ) . orderBy ( col ( "count" ) . desc) . show ( 3 ) ;
输出日志
+----------+-----+
| winner| count|
+----------+-----+
| Brazil| 5|
| Italy| 4|
| Germany FR| 3|
+----------+-----+
only showing top 3 rows
案例二: 统计夺冠前三的球队的首次和末次夺冠时间
fifa_ds. groupBy ( "winner" ) . agg (
count ( "winner" ) . as ( "count" ) ,
min ( col ( "year" ) ) , max ( columnName = "year" )
) . orderBy ( col ( "count" ) . desc) . show ( 3 ) ;
输出日志
+----------+-----+---------+---------+
| winner| count| min( year) | max( year) |
+----------+-----+---------+---------+
| Brazil| 5| 1958| 2002|
| Italy| 4| 1934| 2006|
| Germany FR| 3| 1954| 1990|
+----------+-----+---------+---------+
only showing top 3 rows
案例三: 统计夺冠前三的球队的所有夺冠年份用逗号分隔
fifa_ds. groupBy ( "winner" ) . agg (
collect_list ( "year" ) ,
count ( "winner" ) . as ( "count" )
) . orderBy ( col ( "count" ) . desc) . show ( 3 , false ) ;
输出日志
+----------+------------------------------+-----+
| winner | collect_list( year) | count|
+----------+------------------------------+-----+
| Brazil | [ 1958, 1962, 1970, 1994, 2002] | 5 |
| Italy | [ 1934, 1938, 1982, 2006] | 4 |
| Germany FR| [ 1954, 1974, 1990] | 3 |
+----------+------------------------------+-----+
only showing top 3 rows
案例四: 统计不同决赛场次的冠军数
fifa_ds. groupBy ( "qualifiedTeam" ) . pivot ( "winner" ) . count ( ) . show ( ) ;
输出日志
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
| qualifiedTeam| Argentina| Brazil| England| France| Germany| Germany FR| Italy| Spain| Uruguay|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+
| 13| null| null| null| null| null| null| null| null| 2|
| 16| 1| 3| 1| null| null| 2| 1| null| null|
| 15| null| null| null| null| null| null| 1| null| null|
| 24| 1| 1| null| null| null| 1| 1| null| null|
| 32| null| 1| null| 1| 1| null| 1| 1| null|
+-------------+---------+------+-------+------+-------+----------+-----+-----+-------+