AWS S3 Data Source Template

Overview

The AWS S3 data source template enables Validatar to discover and profile data files stored in Amazon S3 buckets. It uses Python scripts with the boto3 library to list objects, read file schemas, and calculate data quality metrics.

Platform: Amazon S3
Connection Category: Script
Template Category: Marketplace

What's Included

Default Parameters

Parameter	Type	Description
`bucket_name`	String	S3 bucket name
`prefix`	String	Key prefix to scope discovery (e.g., `data/warehouse/`)
`aws_access_key_id`	Secret	AWS access key
`aws_secret_access_key`	Secret	AWS secret key
`aws_region`	String	AWS region (e.g., `us-east-1`)
`file_format`	Dropdown	Expected file format (CSV, Parquet, JSON)

Data Type Mappings

Maps inferred types from file schemas (varies by file format — Parquet files have explicit types, CSV files use inference).

Metadata Ingestion

The ingestion script:

Lists objects in the bucket matching the prefix and format
S3 prefixes (folders) become schemas
Each file or file group becomes a table
Reads file headers/schemas to discover columns

Profiling

The profiling script downloads sample data and calculates standard metrics (record count, null count, distinct count, etc.).

Installation

Customization

IAM role authentication — Modify the script to use IAM roles instead of access keys
Partitioned data — Handle Hive-style partitioning (year=2024/month=01/)
Large files — Configure sampling for files too large to download fully

Documentation Index