Prerequisites ============= * [Start](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive.html) an EMR cluster with Apache Hive installed. * [Setup](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html) a Hive Metastore on AWS RDS. Load data in Hive ================= Execute the SQL in the file [create-hive-tables.sql](https://github.com/qubole/quark/blob/master/docs/examples/create-hive-tables.sql) on your Apache Hive Cluster. Load data into Redshift ======================= Run the following command in psql: create table store_sales_cube( d_year INT, d_moy INT, d_qoy INT, d_date timestamp, cd_gender varchar(10), cd_marital_status varchar(10), cd_education_status varchar(50), s_store_id varchar(100), s_store_name varchar(100), grouping_id varchar(100), sum_wholesale_cost double precision, sum_list_price double precision, sum_sales_price double precision, sum_extended_price double precision, sum_coupon_amt double precision, sum_net_profit double precision); create table store_sales_partition( ss_sold_date_sk int, ss_sold_time_sk int, ss_item_sk int, ss_customer_sk int, ss_cdemo_sk int, ss_hdemo_sk int, ss_addr_sk int, ss_store_sk int, ss_promo_sk int, ss_ticket_number int, ss_quantity int, ss_wholesale_cost float, ss_list_price float, ss_sales_price float, ss_ext_discount_amt float, ss_ext_sales_price float, ss_ext_wholesale_cost float, ss_ext_list_price float, ss_ext_tax float, ss_coupon_amt float, ss_net_paid float, ss_net_paid_inc_tax float, ss_net_profit float); Load data into this table using the following command: copy store_sales_cube from 's3://public-qubole/datasets/quark_example/store_sales_cube.csv' credentials 'aws_access_key_id=****;aws_secret_access_key=****'; copy store_sales_partition from 's3://public-qubole/datasets/quark-example/store_sales_partition/' credentials 'aws_access_key_id=***;aws_secret_access_key=***' csv Set up SQLLine ============== Copy jline-xxx.jar, sqlline-xxx.jar, and quark jdbc driver jar into a directory, for the purpose of this example, the directory name is "sqlLine" cd quark mkdir sqlline cd sqlline wget http://central.maven.org/maven2/sqlline/sqlline/1.1.9/sqlline-1.1.9.jar wget http://central.maven.org/maven2/jline/jline/2.13/jline-2.13.jar Install quark jdbc driver jar ----------------------------- mkdir quark cd quark git clone git@bitbucket.org:qubole/quark.git quark-src cd quark-src mvn install -DskipTests copy the jar: quark-jdbc/target/quark-jdbc-*.jar to our sqlLine directory. Install Hive JDBC jars ---------------------- cd sqlline wget https://amazon-odbc-jdbc-drivers.s3.amazonaws.com/public/AmazonHiveJDBC_1.0.4.1004.zip unzip AmazonHiveJDBC_1.0.4.1004.zip cd AmazonHiveJDBC_1.0.4.1004 For more information refer to the [EMR documentation](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html) Install Postgres JDBC jar ------------------------- cd sqlline wget https://jdbc.postgresql.org/download/postgresql-9.4-1206-jdbc42.jar Install MySQL Jar ----------------- Visit the MySQL JDBC [download page](http://dev.mysql.com/downloads/connector/j/). Download the jar and copy it to the `sqlline` directory. Quark JDBC Jar Configuration ============================ Edit [quark-config.json](https://github.com/qubole/quark/blob/master/docs/examples/quark-config.json) and fill in credentials for AWS RedShift and Apache Hive in `dataSources` array. Store the file in a location accessible by the Quark JDBC driver. Start up Sqlline ================ java -Djava.ext.dirs=/home/user/sqlLine/ sqlline.SqlLine !connect jdbc:quark:quark-config.json com.qubole.quark.jdbc.QuarkDriver Views ===== In the above example AWS Redshift contains the view `store_sales_partition` for the table `tpcds_orc_500.store_sales`. It described by the following WHERE clause: ss_sold_date_sk >= 2452640 and ss_customer_sk > 3 and ss_customer_sk < 20 Quark redirects the following query to the view in AWS Redshift even though the table in the sql query `hive.tpcds_orc_500.store_sales` is a Apache Hive table. select * from hive.tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452640 and ss_customer_sk > 3 and ss_customer_sk < 20 Cubes ===== The cube setup in this example is described in more detail in the [CUBE blog post.](https://www.qubole.com/blog/product/cube-keyword-in-apache-hive/) Redshift contains the cube table for a star-schema join on the fact table store_sales. Queries: select d_year, d_qoy, sum(ss_sales_price) as sales from hive.tpcds_orc_500.store_sales_2002_plus join hive.tpcds_orc_500.date_dim on ss_sold_date_sk = d_date_sk group by d_year, d_qoy; The above query runs on AWS Redshift and returns the following rows: +-------------+-------------+---------------------------+ | d_year | d_qoy | sales | +-------------+-------------+---------------------------+ | 2002 | 1 | 1.43591514E9 | | 2003 | 1 | 1.02997072E8 | | 2002 | 2 | 1.36243994E9 | | 2002 | 3 | 2.52206029E9 | | 2002 | 4 | 4.13493248E9 | +-------------+-------------+---------------------------+ The following query however cannot be answered by the cube in RedShift and is executed on Hive. select d_year, d_dom, sum(ss_sales_price) from hive.tpcds_orc_500.store_sales_2002_plus join hive.tpcds_orc_500.date_dim on ss_sold_date_sk = d_date_sk group by d_year, d_dom;