Generating sequence number in Apache Hive
Sometimes when processing data with hive or impala, we need to generete a sequence number in lines. It can help us creating surrogate keys, create a sequence to order by instead ordering by string and gain speed and some other strategy. On this post, I want to share with you some tips on creating sequence number column on Hive or Impala table.
Starting the lab
For example, imagine we have a dataset containing the stackoverflow survey questions. Like the following text file.
And now, imagine you receive a task to put these questions in order and generate a sequence number to each line with Apache Hive.
Creating a column with row_number()
Using hive, you can do this job with the function row_number(). So, after creating the table and inserting data, when executing select we have a result like this:
To use the row_number() function and add a sequence number column you must inform a column to partition by over or/and order by over. In our example we can use the question column and create the sequence number based on this column.
select
row_number() over(order by question) id
,question
from stackoverflow_survey_raw;
And then, executing the query, we have as a result the questions ordered and a sequence column id created.
Putting data in order as in the file
But, if we want to create the sequence column based on the order the questions are in the file, we have a trick at our disposal. When using the row_number function, we can order by NULL.
select
row_number() over(order by null) id
,question
from stackoverflow_survey_raw;
Running this query on Apache Impala we receive the data in the sema sequence it is in the file. But executing with Apache Hive, we have as result the data from the file ordered in descending order.
Then, to put data in order like it is in the file using Apache Hive, we can use the previus query as subquery and order the id column descending.
select
row_number() over(order by s.id desc) id
,s.question
from (select
row_number() over(order by null) id
, * from stackoverflow_survey_raw) s;
Now, the result is perfect. The questions are ordered based in the sequence it is in the file. 🎉