Due to the low concurrency query execution limit of Athena, the usability of the stored in S3 is very low compared to the huge effort to maintain. Replacing the query engine with another, such as Snowflake and Google Big Query, is an option but takes huge time and effort to migrate. While they’re being migrated, it’d be great to provide an alternative method to replace the query engine, Athena, more flexibly. The idea comes from the base product of Athena, Apache Presto.

 

Goal

This document demonstrates the following features:

  • Running SQL query to AWS S3 bucket on Presto using AWS Glue catalog as a Hive Metastore
  • Presto Client REST API
    • Query submission
    • Query status check
    • Query result retrieval
  • UI support to visualize the data using Apache Superset

Every component is running on docker containers orchestrated by docker-compose.

 

IAM Role and Policy Requirements

The same permissions to run Athena query are required:

  • S3 bucket permissions
  • AWS Glue catalog permission

If running on an EC2 instance, add an IAM role, attach the policy to the role, and attach the role to the instance.

 

Installing Apache Presto

Installing Apache Presto is pretty simple. This guide document from prestodb describes the Dockerfile to launch a presto container. Presto requires the metastore, and presto supports many types of datastore for metastore. To utilize the AWS Glue catalog as a Hive metastore, simply add the following lines in etc/catalog/hive.properties file

connector.name=hive-hadoop2
hive.metastore=glue
hive.metastore.glue.region=ap-northeast-1
hive.parquet.use-column-names=true

 

  • This configuration assumes that the AWS Glue catalog resides on ap-northeast-1.hive
  • parquet.use-column-names=true is a remedy for the errors when mismatching column types were found during the query execution.

In this demo, ahanaio/prestodb-sandbox:latest docker image is used to launch the container

 

Executing a sample test query using presto-cli

Download presto-cli executable and add it to bin/ directory. When executing the binary, it allows executing Presto query. To test if AWS Glue catalog is properly connected as a hive metastore, simple execute the following query:

 

presto> SHOW CATALOGS;
 Catalog
---------
 hive
 jmx
 memory
 system
 tpcds
 tpch
(6 rows)

Query 20210501_100326_00063_6ksa5, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

 

 

hive catalog holds the databases from AWS Glue catalog. To list up the databases:

presto> SHOW SCHEMAS FROM hive;
 Schema
--------------------
 my_table
 default
 information_schema
 sampledb
(4 rows)

Query 20210501_100434_00066_6ksa5, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [4 rows, 55B] [12 rows/s, 168B/s]

presto>

 

 

To list up the tables in the database:

presto> SHOW SCHEMAS FROM hive;
 Schema
--------------------
 aa
 default
 information_schema
 sampledb
(4 rows)

Query 20210501_100434_00066_6ksa5, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [4 rows, 55B] [12 rows/s, 168B/s]

presto> SHOW TABLES FROM hive.aa;
 Table
-----------------
 staging_channel
 staging_message
(2 rows)

Query 20210501_100505_00067_6ksa5, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [2 rows, 54B] [3 rows/s, 89B/s]

presto>

 

 

USE expression works on CATALOG and SCHEMA as well.

presto> use hive.aa;
USE
presto:aa> use hive;
USE
presto:hive>

 

Submitting the Presto query using Presto API

Presto supports Client REST API. To submit the query string:

curl -X POST http://presto:8080/v1/statement -d "SELECT * FROM my_table.user LIMIT 10"
{
 "id": "20210501_094004_00061_6ksa5",
 "infoUri": "http://ec2-15-165-33-20.ap-northeast-2.compute.amazonaws.com:8080/ui/query.html?20210501_094004_00061_6ksa5",
 "nextUri": "http://ec2-15-165-33-20.ap-northeast-2.compute.amazonaws.com:8080/v1/statement/queued/20210501_094004_00061_6ksa5/1?slug=x1d121af12afd45eca1687e968b671bd5",
 "stats": {
 "state": "QUEUED",
 "queued": true,
 "scheduled": false,
 "nodes": 0,
 "totalSplits": 0,
 "queuedSplits": 0,
 "runningSplits": 0,
 "completedSplits": 0,
 "cpuTimeMillis": 0,
 "wallTimeMillis": 0,
 "queuedTimeMillis": 0,
 "elapsedTimeMillis": 0,
 "processedRows": 0,
 "processedBytes": 0,
 "peakMemoryBytes": 0,
 "peakTotalMemoryBytes": 0,
 "peakTaskTotalMemoryBytes": 0,
 "spilledBytes": 0
 },
 "warnings": []
}

 

The value of ”nextUri” holds the endpoint for checking the query execution status. If the query execution is not complete, the response of the ”nextUri” will remain the same except for the status value in either “QUEUED” or “RUNNING”. When the query execution is complete, it’ll show the result as follows:

+ Recent posts