Utilizing Sqoop and Hive for data analysis can allow you to reuse your SQL skills to take advantage of the Hadoop ecosystem without having to learn a new language and lots of new processes. This article assumes you already have the Hortonworks Data Platform already installed on your machine/virtual host. This article will show you how to take data from a SQL store, import it into Hive, and then do a basic Hive query.
Step 1
First, we’ll import a test data set that has plenty of records in it. This data set is more than a small 100 row set file but actually includes 100K records of generated user info and 50K records of generated order data. Using https://www.generatedata.com/, the dataset was assembled and made available as a public gist. It can be accessed as follows:
git clone https://gist.github.com/4c5599b08483264e1e22.git ./data_sql
This will go and make a local copy the SQL file and save it in a directory called data_sql
.
Step 2
Next, we will go ahead and import the SQL file we just copied into the local MySQL on HDP.
mysql < ./data_sql/data.sql
The data.sql
file creates a new database schema called company_info
. The company_info database will have two tables user_info
and orders
. You can explore the data set by accessing the MySQL CLI.
Step 3
With the MySQL database seeded we can big the process of importing the data into Hive to actually do something useful. To import the data into Hive we will be using Sqoop.
sqoop import --connect jdbc:mysql://localhost:3306/company_info --table user_info --username root --hive-import sqoop import --connect jdbc:mysql://localhost:3306/company_info --table orders --username root --hive-import
These two commands will import the data we just stored in MySQL and move it to Hive.
Step 4
After the imports finish we can go ahead and boot up the Hive CLI by typing in hive
in to the command line. From there we can go ahead and execute a query. For something interesting involving a time series we could execute the following query:
select user_id, avg(unix_timestamp(created_on)-unix_timestamp(lag_created_on))/86400 from (select user_id, created_on, lag(created_on) over (partition by user_id order by created_on) as lag_created_on from orders) q group by user_id;
This should return the user_id and the average time between order purchases.
Conclusion
This short article describes a few simple steps to go and start using Sqoop and Hive. More robust articles containing advanced functionality like import appending and other query types are in other articles.