How to quickly build a data lake using Amazon Web Services (AWS)
Intro — Data lake in AWS is more than just dumping data into S3
At some point in time, you might want to create a data lake and, like me, you might want/need to use AWS. As you may know, a data lake:
- facilitate data exploitation;
- open new business opportunities by easily provide business insights (using analytical and/or visualization tools for example);
- avoid hit the live DB and affect your customers meanwhile we do it;
- increase your product usage by exposing analytics services to applications (API’s);
- avoid data duplication costs especially in a microservice strategy;
- provide a simpler, safer, coherent and transparent way for you to access data by storing in a centralized place where everyone (Sales, CX, customers, developers, etc) know where to find/search ;)
Sounds great right? However, as soon as you start doing, you will realize that creating a data lake in AWS is more than just dumping data into S3 because you should do it:
- in a way that that information is accessible to its intended users. In other words, to prevent a data swamp;
- in a way that you always have access to the raw data (as a backup);
- in a secure way by encrypting production data;
- in a compliant way by restricting access to production data (not available to everyone inside the organization);
- in a cost-efficient way by compressing the data; “Parquet (and ORC) are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON”;
- in a cost-efficient way by moving data that is not being accessed frequently to a cheaper AWS storage like AWS S3 (infrequent access) or AWS Glacier;
Real-time ingestion flow as a solution
There are “multiple data lake ingestion methods” but, if you have an event-driven architecture (does not need to be based on microservices, it can be a monolithic application that publishes integration events), you might immediately jump into Amazon Kinesis Firehose. Even if you have historical data in a transactional database or in an in-house server, you might still want to build an app that queries this data and publishes it in a form of integration events that will somehow end up in the Amazon Kinesis Firehose. I used this real-time ingestion flow as much as possible because:
- AWS Kinesis Firehose streams handle conversion, compression and organize your bucket automatically;
- gives more flexibility, i.e., using real-time integration event allows to easily attach analytics to them. It also allows to reuse those integration events for other purposes;
- reuse of best practices / faster time to market: if I setup AWS Kinesis Firehose with compression and conversion, I know that all data that pass there will be… compressed and converted :). Have this done by a background job (or some AWS service like Storage Gateway) for existent historical data is a more error-prone approach and probably more costly;
That being said, I just wanted to create a data lake in AWS that follows a real-time data ingestion flow and even if you don’t have an event-driven architecture you might still find this useful.
Create the data lake (Part I) — Architecture
To create such flow we will:
- create a worker (or a data lake ingestion microservice if you prefer) that sends records to S3/data lake using Kinesis data delivery streams. Typically in a real word product, those records are integration events that your product produces;
- setup a Kinesis data delivery stream in a way that automatically converts and compresses the data and automatically stores in S3 on an organized way ;)
To sum up, something like this:
- This is more than just a real-time data ingestion flow to create a data lake. This is the recommended flow for batch processing and real-time analytics using AWS services where the data is stored in a data lake/S3 buckets by a Kinesis data delivery stream;
- the data lake ingestion microservice is just an abstract layer so the other microservices or monolithic app does not need to know about Kinesis or how to publish an event. By centralizing the data lake creation in one place you can easily improve along the way by, for example, logging Kinesis failed requests, support more events or even switch the cloud provider;
- AWS Glue, AWS redshift as well as Business Insights tools like AWS QuickSight or Microsoft PowerBI are out of the scope of this post but I just want to share that (1) you can directly fill AWS redshift directly from a Kinesis data delivery stream and (2) you don’t necessarily need a data warehouse like AWS redshift to create your first reports. “Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If you want to build your own custom Amazon S3 data lake, AWS Glue can make all your data immediately available for analytics without moving the data”;
Time to build it :)
Create the data lake (Part II) — Setup AWS Kinesis Firehose with S3 as a target destination
Login in AWS console and choose Firehose service (if you don’t have an AWS account, you can create one for free but be aware that Kinesis Firehose resources are not covered under the AWS Free Tier, and usage-based charges apply). Since we want to “continuously collect, transform, and load streaming data into Amazon S3”, select “Create delivery stream”.
Step 1: Name and source: Give a stream name and select “Direct PUT or other sources” as we “want to use the Firehose PutRecord() or PutRecordBatch() API to send source records to the delivery stream”;
Step 2: Process records: For demo purposes, I don’t need “Record transformation” also because we will be publishing a JSON format event and Kinesis Firehose knows how to convert it but I definitely want to enable “Record format conversion” to Apache Parquet for example. As AWS says, “data in Apache parquet or Apache ORC format is typically more efficient to query than JSON”. Plus, converting to Apache Parquet ensure that “data is compressed using Snappy compression before it is delivered to S3”. Great no? :)
As you can see, to Kinesis know how to convert a JSON record to apache parquet, we need to “specify a schema for source records”. For demo purpose, I just created a table manually in AWS Glue:
Step 3: Choose destination: “Amazon S3 is the only available destination because we enabled Record format conversion in Step 2: Process records” so just select an S3 destination bucket. I leave S3 prefix as empty (default) but I set S3 error prefix as “/error”. Especially if data is critical, you want to enable S3 backup as well:
Step 4: Configure settings: I leave everything as default except the S3 encryption part. For sure you want to have your production data encrypted by default in AWS S3 so be sure that you have this enabled. You will also need to create an IAM role but AWS will pre-select one for you to make your life easier:
Now, just review and go ahead. Will take some seconds but eventually will be shown as Active in AWS console ;)
The only thing that is missing is too restricting access to production data. There are multiple ways of going but start by using the IAM principals of your organization might be enough and quicker. However, be aware that a more maintainable approach using IAM roles, for example, is preferable.
At this point, you have a Kinesis stream set up to store data:
- in an encrypted way;
- in a cost-effective format as “Parquet (and ORC) are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON.”;
- in an organized way, i.e., stored by date with appropriate prefixes for backup and error.
Time to fill the lake and query the data ;)
Create the data lake (Part III) — Fill (and query)
I’ll create a .net core console app using dotnet new command to represent data lake ingestion microservice. For demo purposes, I’ll generate the bulk of generic events and send them to the Kinesis stream that we just set up using AWS net SDK. In your product, you might to consume those events from a message broker like Kafka or RabbitMq and use a programming language/AWS sdk of your preference:
dotnet new console --name AWSFirehosePublisher
reference AWS net SDK in AWSFirehosePublisher.csproj:
<PackageReference Include="AWSSDK.Extensions.NETCore.Setup" Version="3.3.100.1" /><PackageReference Include="AWSSDK.KinesisFirehose" Version="3.3.100.10" />
and use PutRecordAsync on Program.cs to start publishing your awesome events :)
var data = "{\"id\": \"" + id + "\"}";......return _firehoseClient.PutRecordAsync(putRecordRequest);
et vòila! Stream shows activity and data is stored in S3!
With S3 Select you can extract records from a single CSV, JSON or Parquet file using SQL expressions. S3 Select supports GZIP and BZIP2 compressed files and server-side encrypted files. To analyze data in S3 that needs more complex SQL expressions, see Amazon Athena ;)
As I just want to check if I can read the data and data is in the way I expect, I’ll be using S3 select. To use S3 select, simply pick the file -> go to the tab “Select from” and press “show file preview”:
Next steps?
- Ideally, you want to set up those AWS services using Infrastructure as a Code (IaC);
- Create a nice dashboard on AWS QuickSight and embedded in your app?
I hope you find it helpful :) thanks for reading!