Hive Database : A basic Introduction

Reading Time: 3 minutes

What is Hive?

Hive is a data warehouse infrastructure tool which process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Why to use Hive?

1) Most of the data warehousing application work with SQL based quering language, Hive supports easy portability of SQL-based application to Hadoop

2) Faster results even for tremendous datasets.

3) As data volume and variety increases, more machines can be added without any corresponding reduction in the performance

Features of Hive

1) It accelerate queries as it provide indexes, including bitmap indexes.

2) It stores metadata which reduce the time to perform semantic checks during query execution.

3) It provide built-in functions to manipulate dates, strings and other data-mining tools.

4) It supports different file formats like Avro Files, ORC Files, Parquet etc

Architecture of Hive

Major components of hive are as follows :

1) Metastore : This component is responsible for storing all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is to be stored.

2) Driver : It acts like a controller which receives the HiveQL statements and starts the execution of statement by creating sessions and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of an HiveQL statement. The driver also acts as a collection point of data or query result obtained after the Reduce operation.

3) Compiler : The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.

In other words, the process can be described by the following flow :

Parser —> Semantic Analyser —> Logical Plan Generator —> Query Plan Generator.

4) Optimizer : Performs various transformations on the execution plan to get an optimized DAG(directed acyclic graph).

5) Executor : After compilation and Optimization, the Executor executes the tasks according to the DAG. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run

6) CLI, UI and Thrift Server : Interface that can be used by the users to submit queries and get the result.

How hive interact with hadoop ?

1) Execute Query : From hive interface(UI or Command Line) query is sent to driver

for execution.

2) Check Syntax and Get Plan : The driver takes the help of query compiler which parses query to check the syntax and query plan or requirement of query.

3) Get Metadata: The compiler sends metadata request to Metastore and in return metastore sends metadata as response to compiler.

4) Execute Plan: The compiler checks the requirement and resends the plan to the driver. The driver sends the execute plan to the execution engine.

5) Execute Job: An execution engine, such as Tez or MapReduce, executes the compiled query. The resource manager, YARN, allocates resources for applications across the cluster. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

6) Fetch Query Result : The execution engine receive the result from data nodes and send it to the driver, which return the result to the hive interface over a JDBC/ODBC connection.

After knowing all these internal stuff about execution of hive queries, i hope you will find it more interesting to work on it.

Stay tuned for further blogs!!

References:

https://www.tutorialspoint.com/hive/hive_introduction.htm

2. https://en.wikipedia.org/wiki/Apache_Hive

Recommend

Video optimization for the web simplified with ImageKit

有一天...

意大利高端休闲服饰集团 Slowear 二代掌门人去世，享年62岁

Chanel 与法国时尚学院（IFM）共同推出工艺课堂项目

英国男性健康服务公司 Numan 完成4000万美元B轮融资

张一鸣独特“生命科学”探索，线上线下联动为抖音和头条导流

重复背景怎么玩？这篇文章全总结好了！（内附9.5G素材资源）

抖音张一鸣：我的大学四年收获及工作感悟

LVMH 集团旗下媒体广告公司收购了一家拥有85年历史的艺术书籍出版商

国际品牌营销服务商The Independents 收购活动组织机构 Bureau Betak 控制性股权

About Joyk