Lab: hive
Goals
- Create an external table on top of data stored as CSV
- Create a managed ORC table
- Load data from the CSV table to the ORC table (with some transformations)
Create an external table
For this lab we will be using a very small dataset of NYC taxi drivers.
Using the official Hive Data Definition Langage:
- Using the HDFS CLI, take a look at the data used for this lab at
/education/ece/big-data/2020/fall/bda/resources/lab4/nyc_drivers/drivers.csv - Copy the
nyc_driversfolder to your user directory in HDFS:#export ece_group=gr1 hdfs dfs -mkdir -p "/user/$USER/labs" hdfs dfs -cp /education/ece/big-data/2020/fall/bda/resources/lab4/nyc_drivers "/user/$USER/labs/" - Open a Beeline session by typing
beeline - Create an external table targeting our data with this statement (to be completed, replace
YOUR_USERNAME):SET hivevar:username=YOUR_USERNAME; CREATE EXTERNAL TABLE hive_labs.${username}_nyc_drivers_ext ( driver_id INT, -- COMPLETE HERE ) ROW FORMAT SERDE -- COMPLETE HERE STORED AS TEXTFILE LOCATION -- COMPLETE HERE TBLPROPERTIES ('skip.header.line.count'='1'); - Check that the table is correctly created by selecting all the data in it. If you see only
NULLvalues, your schema is not correct.
Create a managed ORC table
Tip: to create a managed ORC table, you don’t have to specify a LOCATION nor a SERDE:
CREATE ...
STORED AS ORC;- Create a managed ORC table (not external) that must have the same schema as the external table created above (
${username}_nyc_drivers_ext) but with:- The
_extprefix removed from the name:${username}_nyc_drivers - The column
namedevided intofirst_nameandlast_name - The columne
locationrenamed asaddress(becauseLOCATIONis a Hive keyword) - The column
certifiedas aBOOLEAN
- The
- Check that your table was created using the HDFS CLI at
/warehouse/tablespace/managed/hive/hive_labs.db/$USER_nyc_drivers(should be empty)
Load data from the CSV table to the ORC table
Now we want to populate our ORC table from our CSV table. Using the Hive Data Manipulation Language:
- Write a statement to insert data to the ORC table by applying 2 transformations (check the available HiveQL string functions):
- Split
nameintofirst_nameandlast_name - Transform
certifiedfromSTRINGtoBOOLEAN - Rename
locationtoaddress
- Split
- Execute your query
- Check what the data looks like in the managed table using the HDFS CLI at
/warehouse/tablespace/managed/hive/hive_labs.db/$USER_nyc_drivers