The Optimal Hive MAP Creation through Hive ORC format

Read Time:2 Minute, 27 Second

Many times we are facing a state of affairs that we’ve got very small tables in hive however when we question these tables then it takes long time.

Right here I’m going to explain Map side join and its benefits over the normal join operation in Hive Map. However before understanding about this, we should first understand the idea of ‘Join’ and what takes place internally while we carry out the be part of in Hive.

`Join’ is a clause that combines the information of tables (or Data-Sets).

Count on that we’ve two tables A and B. whilst we perform Join operation on them, it’ll go back the facts which are the aggregate of all columns o f A and B.

Mapjoin is a little-regarded feature of Hive. It lets in a desk to be loaded into memory in order that a (very speedy) be part of will be finished absolutely within a mapper while not having to use a Map/lessen step. In case your queries regularly rely upon small table joins (e.g. city or country, etc.) you might see a completely widespread velocity-up from the use of mapjoins.

There are approaches to permit it. First is through using a hint, which seems like /*+ MAPJOIN(aliasname), MAPJOIN(anothertable) */. This C-style comment has to be positioned right now following the SELECT. It directs Hive to load aliasname (that is a desk or alias of the query) into memory.

SELECT /*+ MAPJOIN(c) */ * FROM orders o JOIN cities c ON (o.city_id = c.id);

Some other (higher, for my part) manner to show on mapjoins is to let Hive do it mechanically. Simply set hive.auto.convert.jointo authentic for your config, and Hive will automatically use mapjoins for any tables smaller than hive.mapjoin.smalltable.filesize.

Assume that we’ve two tables of which one of them is a small table. When we submit a map reduce task, a Map reduce local task undertaking may be created earlier than the authentic join Map Reduce task if you want to study information of the small table from HDFS and shop it into an in-memory hash table..

Growing Hive tables is a common experience to everybody that usesHadoop. It allows us to combine and merge datasets into precise, customized tables. And, there are many ways to do it.

We have a few recommended suggestions for Hive table creation that can boom your query speeds and optimize and decrease the garage space of your tables. And it’s less difficult than you would possibly assume.

For engineers and builders, this indicates you can decrease your block count and file sizes, and make your analysts and statistics scientists glad on the identical time.

In a nutshell, the stairs are: