hive安装

最新推荐文章于 2024-08-19 19:55:55 发布

王小禾

最新推荐文章于 2024-08-19 19:55:55 发布

阅读量283

点赞数

分类专栏：服务安装 Hive 文章标签： hive

本文链接：https://blog.csdn.net/answer100answer/article/details/88946069

版权

服务安装同时被 2 个专栏收录

11 篇文章

订阅专栏

Hive

3 篇文章

订阅专栏

文章目录

https://mp.weixin.qq.com/s/qtpPM5v27WGBcYyYeBTczQ

环境

hadoop2.7.2
mysql5.7
jdk8

概念

关于hive metastore
转载：Hive为什么要启用Metastore？

1.Metadata概念：

元数据包含用Hive创建的database、table等的元信息。元数据存储在关系型数据库中。如Derby、MySQL等。

2.Metastore作用：

客户端连接metastore服务，metastore再去连接MySQL数据库来存取元数据。有了metastore服务，就可以有多个客户端同时连接，而且这些客户端不需要知道MySQL数据库的用户名和密码，只需要连接metastore 服务即可。

3.Metastore 有3中开启方式:

默认开启方式:
没有配置metaStore的时候,每当开启bin/hive;或者开启hiveServer2的时候,都会在内部启动一个metastore
嵌入式服务;资源比较浪费,如果开启多个窗口,就会存在多个metastore server。（）
local mataStore(本地)
当metaStore和装载元数据的数据库(MySQL)存在同一机器上时配置是此模式,
开启metastore服务就只需要开启一次就好,避免资源浪费!
Remote Metastore(远程)
当metaStore和装载元数据的数据库(MySQL)不存在同一机器上时配置是此模式,
开启metastore服务就只需要开启一次就好,避免资源浪费!

Metastore三种配置方式

由于元数据不断地修改、更新，所以Hive元数据不适合存储在HDFS中，一般存在RDBMS中。

1、内嵌模式（Embedded）

hive服务和metastore服务运行在同一个进程中，derby服务也运行在该进程中.内嵌模式使用的是内嵌的Derby数据库来存储元数据，也不需要额外起Metastore服务。
这个是默认的，配置简单，但是一次只能一个客户端连接（这句话说实在有点坑，其实就是你启动一个hive服务会内嵌一个metastore服务，然后在启动一个又会内嵌一个metastore服务，并不是说你的客户端只能启动一个hive，是能启动多个，但是每个都有metastore，浪费资源），适用于用来实验，不适用于生产环境。

2、本地模式（Local）:本地安装mysql 替代derby存储元数据

不再使用内嵌的Derby作为元数据的存储介质，而是使用其他数据库比如MySQL来存储元数据。hive服务和metastore服务运行在同一个进程中，mysql是单独的进程，可以同一台机器，也可以在远程机器上。（我之前有种方式是：只在接口机配置hive，并配置mysql数据库，用户和密码等；但是集群不配置hive，不起hive任何服务，就属于这种情况）

这种方式是一个多用户的模式，运行多个用户client连接到一个数据库中。这种方式一般作为公司内部同时使用Hive。每一个用户必须要有对MySQL的访问权利，即每一个客户端使用者需要知道MySQL的用户名和密码才行。

本次安装就是本地模式，无需启动metastore 服务。

3、远程模式（Remote）: 远程安装mysql 替代derby存储元数据

Hive服务和metastore在不同的进程内，可能是不同的机器
远程元存储需要单独起metastore服务，然后每个客户端都在配置文件里配置连接到该metastore服务。将metadata作为一个单独的服务进行启动。各种客户端通过beeline来连接，连接之前无需知道数据库的密码。
仅连接远程的mysql并不能称之为“远程模式”，是否远程指的是metastore和hive服务是否在同一进程内

安装

本次以本地模式安装。
安装机器：~~hadoop2~~ ,已换成cluster-host2一台机器

1.mysql

在mysql中建hive用户(root用户也行的)

GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' IDENTIFIED BY 'Hobe199**' WITH GRANT OPTION;

flush privileges;

验证hive用户能连接并建库

mysql -uhive -pHobe19**
# 创建hive库
create database hive;

2.hive-metastore

下载hive

获取hive请参考官网uri https://archive.apache.org/dist/hive/hive-1.2.1/

参考： https://www.cnblogs.com/dxxblog/p/8193967.html

解压并配置环境变量

解压

[hadoop@hadoop2 bigdata]$ pwd
/home/hadoop/bigdata
[hadoop@hadoop2 bigdata]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz

下载mysql连接java的驱动并拷入hive home的lib下

cp mysql-connector-java-5.1.47-bin.jar /home/hadoop/hive-current/lib/

hive配置文件

[hadoop@hadoop2 ~]$ cd /home/hadoop/hive-current/conf/
cp hive-env.sh.template hive-env.sh 
cp hive-default.xml.template hive-site.xml 
cp hive-log4j2.properties.template hive-log4j2.properties 
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

其实只用写这一个文件。。。
hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- start -->

        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://localhost:3306/hive</value>
        </property>

        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
        </property>

        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>hive</value>
        </property>

        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>Hobe199**</value>
        </property>
</configuration>

其中，主要是mysql相关的配置。另外还需要修改一个与 ${system:java.io.tmpdir} 相关的location。
hive-site.xml

    <property>                                                                                            
      <name>hive.exec.local.scratchdir</name>                                                             
      <value>/home/hadoop/cluster-data/hive/${user.name}/scratchdir</value>                               
      <description>Local scratch space for Hive jobs</description>                                        
    </property>                                                                                           
    <property>                                                                                            
      <name>hive.downloaded.resources.dir</name>                                                          
      <value>/home/hadoop/cluster-data/hive/${user.name}/resources/${hive.session.id}_resources</value>   
      <description>Temporary local directory for added resources in the remote file system.</description> 
   </property>
       <property>                                                                 
      <name>hive.querylog.location</name>                                      
      <value>/home/hadoop/cluster-data/querylog</value>                        
      <description>Location of Hive run time structured log file</description> 
    </property>

启动metastore

先需要schematool -dbType mysql -initSchema操作
启动ms：

[hadoop@10 ~]$ nohup ./hive-current/bin/hive --service metastore -p 9083 &

[hadoop@hadoop2 bin]$ hive

Logging initialized using configuration in file:/home/hadoop/bigdata/apache-hive-1.2.1-bin/conf/hive-log4j.properties
Mon Apr 01 13:51:48 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

如有警告 SSL ，参考 https://blog.csdn.net/u012922838/article/details/73291524 解决。

[hadoop@hadoop2 conf]$ hive

Logging initialized using configuration in file:/home/hadoop/bigdata/apache-hive-1.2.1-bin/conf/hive-log4j.properties
hive>show databases;
OK
default
Time taken: 0.269 seconds, Fetched: 1 row(s)
hive>

会发现mysql hive用户hive库中已有了很多初始化的表。

1. 创建一个分区表并加载数据

创建一个库:
hive_test1并use
创建分区表：

-- 分区表在SELECT的时候必须指定WHERE
-- 创建表
create table IF NOT EXISTS employee_partition(
    id string,
    name string,
    age int,
    tel string
) 
PARTITIONED BY (
    city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- 清除数据
TRUNCATE TABLE employee_partition;
INSERT OVERWRITE TABLE employee_partition PARTITION (city_id="beijing") 
VALUES 
("1","wanghongbing",18,"130"),
("2","wangxiaojing",17,"150"),
("3","songweiguang",16,"135");

SELECT * from employee_partition where city_id="beijing";
--聚合操作
SELECT count(*) from employee_partition where city_id="beijing" AND age>=17;

show partitions employee_partition;

插入数据：

hive> INSERT OVERWRITE TABLE employee_partition PARTITION (city_id="beijing") 
    > VALUES 
    > ("1","wanghongbing",18,"130"),
    > ("2","wangxiaojing",17,"150"),
    > ("3","songweiguang",16,"135");
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190821165406_ad3eb5da-7622-4974-ada0-405eef181159
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1565062438979_0005, Tracking URL = http://cluster-host1:8088/proxy/application_1565062438979_0005/
Kill Command = /home/hadoop/hadoop-current/bin/hadoop job  -kill job_1565062438979_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-08-21 16:54:17,409 Stage-1 map = 0%,  reduce = 0%
2019-08-21 16:54:22,678 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.5 sec
MapReduce Total cumulative CPU time: 3 seconds 500 msec
Ended Job = job_1565062438979_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition/city_id=beijing/.hive-staging_hive_2019-08-21_16-54-06_555_1374265165073341927-1/-ext-10000
Loading data to table hive_test1.employee_partition partition (city_id=beijing)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.5 sec   HDFS Read: 4774 HDFS Write: 167 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 500 msec
OK
Time taken: 17.935 seconds

查询结果如下：

# select * 不走mr
hive> SELECT * from employee_partition where city_id="beijing";
OK
1       wanghongbing    18      130     beijing
2       wangxx    17      150     beijing
3       songxx    16      135     beijing
Time taken: 0.511 seconds, Fetched: 3 row(s)

# 有mr任务
hive> SELECT count(*) from employee_partition where city_id="beijing" AND age>=17;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20190821165443_468f9aa2-5a02-467d-8492-85d0858a2073
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1565062438979_0006, Tracking URL = http://cluster-host1:8088/proxy/application_1565062438979_0006/
Kill Command = /home/hadoop/hadoop-current/bin/hadoop job  -kill job_1565062438979_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-08-21 16:54:51,033 Stage-1 map = 0%,  reduce = 0%
2019-08-21 16:55:04,578 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.34 sec
2019-08-21 16:55:18,000 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.96 sec
MapReduce Total cumulative CPU time: 6 seconds 960 msec
Ended Job = job_1565062438979_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.96 sec   HDFS Read: 9317 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 960 msec
OK
2
Time taken: 35.767 seconds, Fetched: 1 row(s)

2. 加载文本数据

准备hive表的数据文件

# vim first_table.txt
1	李晨	50
2	成龙	60
3	王源	40
4	胡歌	50

创一张表

hive> create table first_table(id int, name string, age int) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.4 seconds

加载数据至表中

hive> load data local inpath '/home/hadoop/test/first_table.txt' into table first_table;
Loading data to table default.first_table
Table default.first_table stats: [numFiles=1, totalSize=48]
OK
Time taken: 0.914 seconds
hive> select id,name,age from first_table;
OK
1	李晨	50
2	成龙	60
3	王源	40
4	胡歌	50
Time taken: 0.332 seconds, Fetched: 4 row(s)

至此，hive环境搭建成功。

HIVE 外部表和内部表数据导入方式以及区别

https://my.oschina.net/u/2500254/blog/1439297
https://mp.weixin.qq.com/s/qtpPM5v27WGBcYyYeBTczQ

hive数据在hdfs的位置

我们在元数据信息mysql中查看：
use hive;

mysql> select * from DBS;
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                                             | NAME       | OWNER_NAME | OWNER_TYPE |
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
|     1 | Default Hive database | hdfs://cluster-host1:9000/user/hive/warehouse               | default    | public     | ROLE       |
|     2 | NULL                  | hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db | hive_test1 | hadoop     | USER       |
+-------+-----------------------+-------------------------------------------------------------+------------+------------+------------+
2 rows in set (0.00 sec)

有两个库，一个默认库，一个新建的hive_test1库。

我们看一下表信息：

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME           | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
|      1 |  1566377621 |     2 |                0 | hadoop |         0 |     1 | employee_partition | MANAGED_TABLE | NULL               | NULL               |                    |
|      2 |  1566380354 |     2 |                0 | hadoop |         0 |     3 | first_table        | MANAGED_TABLE | NULL               | NULL               |                    |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------------+---------------+--------------------+--------------------+--------------------+
2 rows in set (0.00 sec)

我们到hdfs上找到该路径：

[hadoop@cluster-host2 ~]$ hadoop fs -ls hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db  
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2019-08-21 16:54 hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition
drwxr-xr-x   - hadoop supergroup          0 2019-08-21 17:42 hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/first_table

该路径下放了表和分区的信息：

[hadoop@cluster-host2 ~]$ hadoop fs -text hdfs://cluster-host1:9000/user/hive/warehouse/hive_test1.db/employee_partition/city_id=beijing/000000_0
1,wanghongbing,18,130
2,wangxx,17,150
3,songxx,16,135

可以直观地看到，hive的具体数据存放在hdfs上的某个路径下。元数据关系是存在mysql中的。对于集群来说，保证hdfs上的数据不丢失就行。