Scheduling data imports from AWS S3 to query in AWS Athena SQL table

Aishani Pachauri
4 min readJul 1, 2023

--

This was quite helpful when data dumped in S3 was required for analysis and we decided to query that data in Athena as a SQL based table.

AWS Athena uses Presto with full SQL support to support both non-relational and relational databases. We will be using AWS Crawlers to automatically populate the table created in AWS Athena with data from an AWS S3 bucket.

AWS service used to work on data analysis through relational databases with SQL and Amazon
Diagram to show architectural flow we will be working on!

(You can directly create a table from an S3 bucket in AWS Athena if you don’t want to populate it again and again)

As always, I am assuming you have interacted with Amazon Web Services Console before and have created an S3 bucket.

Start with AWS Glue Console: Creating a crawler

Go to Athena console, click on Launch query editor. Select Create in Tables and views. Select AWS glue crawler.

Create a data scraper crawler to get database rows and columns here
Access AWS Glue console to create a crawler from Athena query editor

This will take you to AWS glue crawler console. Here you can scan data from an S3 bucket.

You can even schedule a cron job for scanning data from S3 if data is added to it regularly. This will populate the table at a given interval by rescanning data which is new.

Go ahead click on crawlers and then create crawler here:

data analytics for big data tables SQL engineering databases
Creating a crawler

You are supposed to configure properties for the crawler now:

  1. Add a name, optionally a description and tags ( it’s good practice to do so, for easy access and documentation)
  2. Select ‘Not yet’ in “Is your data already mapped to Glue tables”. Next add a data source by accessing S3 bucket on button click to ‘add a data source’.
    a. Here, select the location of S3 folder you want to be scanned, no worries if it has subfolders.
    b. Subsequent crawler runs depends totally on if you will use this again. In our case, yes, by scheduling.
  3. Add an IAM role. If you don’t have one, simply click on create new IAM role to create one. Click next.
  4. Create a database or just add it in the default like employee_db in our case (the one you saw in AWS Athena query editor, above table and views)

5. Schedule the crawler by changing ‘On demand’ from the dropdown. You can get various cron expressions from AWS Schedule Expressions guide.

6. Review it all once and click on create crawler.

Successfully created database crawler in Amazon Web Services console
Crawler ready to try out.

Creating table in Athena

Once created, you will see your crawler in the list with State as ‘Ready’. Select the crawler and Go to Actions. Select Run crawler from the dropdown and let it run.

After the crawler is completely run with no error, check your Athena query editor and after changing the database to the one you mentioned in the crawler. You should see your table there.

Select that table and do view table by clicking on the kebab menu (the three verticle dots for options) beside the table name.

The crawler treats each file in S3 bucket as a single row of data to be put in Athena table. You can try out array of data in one file and see how that works with Athena table structure.

Athena goddess AWS Amazon query editor for important settings general.
Add a location for query output in settings.

Do not forget to add an output location for you AWS Athena queries

Whats next?

You can always change crawler setting to keep the same tabel schema while recrawling by editing crawler properties from Actions dropdown in AWS Glue console-> Crawlers.

It even let’s you define how would a table schema change if you add new properties to the JSON object file saved in S3 while recrawling and adding new columns from the same edit option.

Further readings & error handling for things you might stumble upon while working on this ✨:

Hope you liked it, connect with me here: Aishani Pachauri | LinkedIn

More bite-sized content: Aishanii — DEV Community👩‍💻

Check out more such blogs in the series: List: AWS Serverless👩‍💻 | Curated by Aishani Pachauri | Medium

--

--