How to quickly build a data lake using Amazon Web Services (AWS)

Rui Pedrosa
Nerd For Tech
Published in
8 min readApr 30, 2019

--

Intro — Data lake in AWS is more than just dumping data into S3

At some point in time, you might want to create a data lake and, like me, you might want/need to use AWS. As you may know, a data lake:

  1. facilitate data exploitation;
  2. open new business opportunities by easily provide business insights (using analytical and/or visualization tools for example);
  3. avoid hit the live DB and affect your customers meanwhile we do it;
  4. increase your product usage by exposing analytics services to applications (API’s);
  5. avoid data duplication costs especially in a microservice strategy;
  6. provide a simpler, safer, coherent and transparent way for you to access data by storing in a centralized place where everyone (Sales, CX, customers, developers, etc) know where to find/search ;)

Sounds great right? However, as soon as you start doing, you will realize that creating a data lake in AWS is more than just dumping data into S3 because you should do it:

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber — STG312 — re:Invent 2017

To sum up, I hope you realize that building a data lake from scratch involves a considerable amount of topics as data ingestion, data organization, security, costs, etc.

Real-time ingestion flow as a solution

There are “multiple data lake ingestion methods” but, if you have an event-driven architecture (does not need to be based on microservices, it can be a monolithic application that publishes integration events), you might immediately jump into Amazon Kinesis Firehose. Even if you have historical data in a transactional database or in an in-house server, you might still want to build an app that queries this data and publishes it in a form of integration events that will somehow end up in the Amazon Kinesis Firehose. I used this real-time ingestion flow as much as possible because:

  • AWS Kinesis Firehose streams handle conversion, compression and organize your bucket automatically;
  • gives more flexibility, i.e., using real-time integration event allows to easily attach analytics to them. It also allows to reuse those integration events for other purposes;
  • reuse of best practices / faster time to market: if I setup AWS Kinesis Firehose with compression and conversion, I know that all data that pass there will be… compressed and converted :). Have this done by a background job (or some AWS service like Storage Gateway) for existent historical data is a more error-prone approach and probably more costly;

That being said, I just wanted to create a data lake in AWS that follows a real-time data ingestion flow and even if you don’t have an event-driven architecture you might still find this useful.

Create the data lake (Part I) — Architecture

To create such flow we will:

  • create a worker (or a data lake ingestion microservice if you prefer) that sends records to S3/data lake using Kinesis data delivery streams. Typically in a real word product, those records are integration events that your product produces;
  • setup a Kinesis data delivery stream in a way that automatically converts and compresses the data and automatically stores in S3 on an organized way ;)

To sum up, something like this:

Data lake: real-time data ingestion recommended flow for batch processing and real-time analytics

Time to build it :)

Create the data lake (Part II) — Setup AWS Kinesis Firehose with S3 as a target destination

Login in AWS console and choose Firehose service (if you don’t have an AWS account, you can create one for free but be aware that Kinesis Firehose resources are not covered under the AWS Free Tier, and usage-based charges apply). Since we want to “continuously collect, transform, and load streaming data into Amazon S3”, select “Create delivery stream”.

Step 1: Name and source: Give a stream name and select “Direct PUT or other sources” as we “want to use the Firehose PutRecord() or PutRecordBatch() API to send source records to the delivery stream”;

Step 1: Name and source

Step 2: Process records: For demo purposes, I don’t need “Record transformation” also because we will be publishing a JSON format event and Kinesis Firehose knows how to convert it but I definitely want to enable “Record format conversion” to Apache Parquet for example. As AWS says, “data in Apache parquet or Apache ORC format is typically more efficient to query than JSON”. Plus, converting to Apache Parquet ensure that “data is compressed using Snappy compression before it is delivered to S3”. Great no? :)

Step 2: Process records

As you can see, to Kinesis know how to convert a JSON record to apache parquet, we need to “specify a schema for source records”. For demo purpose, I just created a table manually in AWS Glue:

Step 2: Process records — Specify a schema for source records

Step 3: Choose destination: “Amazon S3 is the only available destination because we enabled Record format conversion in Step 2: Process records” so just select an S3 destination bucket. I leave S3 prefix as empty (default) but I set S3 error prefix as “/error”. Especially if data is critical, you want to enable S3 backup as well:

Step 3: Choose destination

Step 4: Configure settings: I leave everything as default except the S3 encryption part. For sure you want to have your production data encrypted by default in AWS S3 so be sure that you have this enabled. You will also need to create an IAM role but AWS will pre-select one for you to make your life easier:

Step 4: Configure settings

Now, just review and go ahead. Will take some seconds but eventually will be shown as Active in AWS console ;)

The only thing that is missing is too restricting access to production data. There are multiple ways of going but start by using the IAM principals of your organization might be enough and quicker. However, be aware that a more maintainable approach using IAM roles, for example, is preferable.

At this point, you have a Kinesis stream set up to store data:

Time to fill the lake and query the data ;)

Create the data lake (Part III) — Fill (and query)

I’ll create a .net core console app using dotnet new command to represent data lake ingestion microservice. For demo purposes, I’ll generate the bulk of generic events and send them to the Kinesis stream that we just set up using AWS net SDK. In your product, you might to consume those events from a message broker like Kafka or RabbitMq and use a programming language/AWS sdk of your preference:

dotnet new console --name AWSFirehosePublisher

reference AWS net SDK in AWSFirehosePublisher.csproj:

<PackageReference Include="AWSSDK.Extensions.NETCore.Setup" Version="3.3.100.1" /><PackageReference Include="AWSSDK.KinesisFirehose" Version="3.3.100.10" />

and use PutRecordAsync on Program.cs to start publishing your awesome events :)

var data = "{\"id\": \"" + id + "\"}";......return _firehoseClient.PutRecordAsync(putRecordRequest);

et vòila! Stream shows activity and data is stored in S3!

Data stored in S3

With S3 Select you can extract records from a single CSV, JSON or Parquet file using SQL expressions. S3 Select supports GZIP and BZIP2 compressed files and server-side encrypted files. To analyze data in S3 that needs more complex SQL expressions, see Amazon Athena ;)

As I just want to check if I can read the data and data is in the way I expect, I’ll be using S3 select. To use S3 select, simply pick the file -> go to the tab “Select from” and press “show file preview”:

Next steps?

I hope you find it helpful :) thanks for reading!

--

--