How to create an Apache Iceberg table on Amazon S3 using CDK Custom Resource

Xan Huang
4 min readJun 19, 2023

Background

Infrastructure as Code (IaC) is a software engineering practice that allows the management and provisioning of infrastructure resources using code. When deploying on AWS, you have a few option available which includes CloudFormation / CDK / Terraform / etc. This blog discusses the use of CDK (Python) to deploy a resource (Apache Iceberg on S3) that is not yet supported on CDK but can be done via the use of CDK Custom Resource.

What is a CDK Custom Resource?

CDK Custom Resource is a feature provided by the AWS Cloud Development Kit (CDK) that allows developers to define and deploy AWS resources not yet supported by CDK constructs. It enables the creation of custom resources by executing a Lambda function in response to AWS CloudFormation events, giving developers the flexibility to extend CDK’s capabilities and integrate with other AWS services or external systems seamlessly.

Diving into the components of the Custom Resource

Apache Iceberg tables are currently (as of Jun 2023) not yet supported in AWS CDK. You can use aws_glue library to create an EXTERNAL table but Apache Iceberg requires the creation of an additional metadata.json file which it uses to keep track of the table content. You could generate the metadata.json file yourself and upload it as part of an aws_glue.CfnTable creation, but that would be pretty hackish and you run the risk of it breaking in the future.

The recommended way at this point in time is to create the tables via an Athena query, and you can integrate the query execution into CDK via the use of CDK Custom Resource and an AWS SDK function call. Specifically, we will be looking at using Athena.startQueryExecution.

There are 3 main components of the custom resource to take note of:

Role (or Policy) — We will need to define the permissions that the Lambda will assume when creating/destroying the custom resource. You can define it either as a IAM role, or an AwsCustomResourcePolicy which will be used to generate IAM Policy statements based on the resources specified. In this example, we created an IAM role and granted it full access to S3 and Athena. For your own usage, you should scope it down to the least required privilege instead.

role=iam.Role(
scope=self,
id=f'Cdk-CustomResourcesIcebergTable-{table_name}-LambdaRole',
assumed_by=iam.ServicePrincipal('lambda.amazonaws.com'),
managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonS3FullAccess"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonAthenaFullAccess")
],
)

on_create function — We will specify the AWS SDK action that we want CDK to execute when deploying the resource. In this case, we used the Athena service action, startQueryExecution, to execute a query to create the Apache Iceberg table. We provide it with a set of parameters that specifies:

  1. Catalog Database — the name of the database to create the table.
  2. QueryString — the actual query string that contains the “CREATE TABLE” SQL statement.
  3. ResultConfiguration — we provided a child attribute, OutputLocation, to specify the S3 location to store the result output. This is required when running queries in Athena but we don’t actually need it for our purpose.
  4. WorkGroup — the Athena work group to execute this query.
on_create=AwsSdkCall(
action='startQueryExecution',
service='Athena',
physical_resource_id=PhysicalResourceId.of('CustomResourceIceBergTable-{table_name}'),
parameters={
"QueryExecutionContext": {
"Database": database_name,
},
"QueryString": "CREATE TABLE "+table_name+" ("+columns+") "+ IcebergTable.getPartitionedBy(partitioned_by) +" LOCATION 's3://"+S3_bucket+"/tables/"+table_name+"' TBLPROPERTIES ('table_type'='iceberg');",
"ResultConfiguration": {
"OutputLocation":"s3://"+S3_bucket+"/athena_temp/"
},
"WorkGroup": workgroup
}
)

Note that the input parameters specified here follows the specification of AWS Javascript SDK for Athena even though we are using CDK Python. E.g. action=‘startQueryExecution’

on_delete function — This will perform the same Athena action as the “on_create” function above, but we will replace the QueryString with a “DROP TABLE” SQL statement instead.

on_delete=AwsSdkCall(
action='startQueryExecution',
service='Athena',
physical_resource_id=PhysicalResourceId.of('CustomResourceIceBergTable-{table_name}'),
parameters={
"QueryExecutionContext": {
"Database": database_name,
},
"QueryString": "DROP TABLE "+table_name,
"ResultConfiguration": {
"OutputLocation":"s3://"+S3_bucket+"/athena_temp/"
},
"WorkGroup": workgroup
}
)

While not mentioned here, you can also create an on_update function call that gets triggered when the resource is updated.

Putting it all together

Combining the above code snippets, we have the following full Python code on a CDK Custom Resource that you can use to create an Apache Iceberg table on Amazon S3.

from aws_cdk import (
aws_logs as logs,
aws_iam as iam,
)

from aws_cdk.custom_resources import (
AwsCustomResource,
PhysicalResourceId,
AwsSdkCall
)

from constructs import Construct

class IcebergTable(Construct):

def getPartitionedBy(partitioned_by: str) -> str:
if len(partitioned_by) > 0:
return " PARTITIONED BY ("+partitioned_by+") "
else:
return ""

def __init__(self, scope: Construct, database_name: str, table_name: str, columns: str, partitioned_by: str, S3_bucket: str, workgroup: str):
super().__init__(scope, table_name)

res = AwsCustomResource(
scope=self,
install_latest_aws_sdk=False,
id=table_name,
role=iam.Role(
scope=self,
id=f'Cdk-CustomResourcesIcebergTable-{table_name}-LambdaRole',
assumed_by=iam.ServicePrincipal('lambda.amazonaws.com'),
managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonS3FullAccess"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonAthenaFullAccess")
],
),
log_retention=logs.RetentionDays.INFINITE,
on_create=AwsSdkCall(
action='startQueryExecution',
service='Athena',
physical_resource_id=PhysicalResourceId.of('CustomResourceIceBergTable-{table_name}'),
parameters={
"QueryExecutionContext": {
"Database": database_name,
},
"QueryString": "CREATE TABLE "+table_name+" ("+columns+") "+ IcebergTable.getPartitionedBy(partitioned_by) +" LOCATION 's3://"+S3_bucket+"/tables/"+table_name+"' TBLPROPERTIES ('table_type'='iceberg');",
"ResultConfiguration": {
"OutputLocation":"s3://"+S3_bucket+"/athena_temp/"
},
"WorkGroup": workgroup
}
),
on_delete=AwsSdkCall(
action='startQueryExecution',
service='Athena',
physical_resource_id=PhysicalResourceId.of('CustomResourceIceBergTable-{table_name}'),
parameters={
"QueryExecutionContext": {
"Database": database_name,
},
"QueryString": "DROP TABLE "+table_name,
"ResultConfiguration": {
"OutputLocation":"s3://"+S3_bucket+"/athena_temp/"
},
"WorkGroup": workgroup
}
),
resource_type='Custom::CustomResourcesIcebergTable'
)

The IceBergTable takes in 6 parameters:

  1. database_name — name of the catalog database
  2. table_name — name of the table to be created
  3. columns — comma separated list of <name type>
  4. partitioned_by — optional comma separate list of columns to partition data by. Leave blank if not required.
  5. S3_bucket — name of S3 bucket to store the IceBerg table data
  6. workgroup — name of Athena workgroup to use for query execution

To use it in our CDK stack, we will import it first before declaring an IcebergTable as follows:

from .CustomResources.IcebergTable import IcebergTable

tbl_transactions = IcebergTable(self,
database_name='txn_database',
table_name='transactions',
columns='\
date date,\
ref string,\
deposit double,\
withdrawal double,\
bank string,\
account string,\
category string,\
subcategory string,\
stmttype string,\
amount double,\
txntype string',
partitioned_by='month(date), account',
S3_bucket=s3_cf_analysis.bucket_name,
workgroup='primary'
)

Ending Note

This blog discusses a way to use CDK Custom Resource to create a resource that is not yet supported natively in CDK and shows us a way to extend the capabilities of CDK.

--

--

Xan Huang
0 Followers

I love to build and experiment with cloud technologies!