hiveql - Can you explain when and why mapreduce is invoked in hive

Question

Welcome To Ask or Share your Answers For Others

hiveql - Can you explain when and why mapreduce is invoked in hive

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:43:57+0000

Take the simple hive query below:

Describe table;

This reads data from the hive metastore and is the simplist and fastest query in hive.

select * from table;

This query needs only read data from HDFS. So far neither requires any map or reduce phases.

select * from table where color in ('RED','WHITE','BLUE')

This query requires only a map only, there is no reduce phase. There is no aggregation function of any kind. Here we are filtering to collect records that are RED, WHITE, or BLUE.

select count(1) from table;

This query requires only a reduce phase. No mapping required because we are counting all the records in the table. If we want to count across elements then we will be adding a map phase prior to the reduce phase. See below:

Select color
, count(1) as color_count 
  from table  
  group by color;

This query has an aggregation function and a group by statement. We are counting the number of elements in the table that are RED, WHITE, or BLUE. This counting requires a map and a reduce job.

Essentially we create a key value pair in the above job. We map records to a key. In this case it will be RED, WHITE, and BLUE. Then a value of one is made. So the key:value is color:1. Then we can sum the value across the key color. This is a map and reduce job.

Now take the same query and an order by clause.

Select color
, count(1) as color_count 
  from table  
  group by color
  order by colour_count desc;

This adds another reduce phase and forces a single reducer for the data set to passed through. This is necessary because we want to ensure that global ordering is maintained. Count(distinct color) also forces a single reducer and requires a map and reduce phase.

As you add complexity to your hive query you in a similar fashion add map and reduce jobs required to obtain the requested results.

If you want to find out how hive will manage a query you can use the explain caluse in front of your query.

 Explain select * from table;

This can give you an idea of how the query is being executed under the hood. It will show you dependencies of stages and to what if any aggregations are resulting in reduce jobs and operators are resulting in map jobs.

Categories

hiveql - Can you explain when and why mapreduce is invoked in hive

hiveql - Can you explain when and why mapreduce is invoked in hive

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags