Overview
The AWS S3 data source template enables Validatar to discover and profile data files stored in Amazon S3 buckets. It uses Python scripts with the boto3 library to list objects, read file schemas, and calculate data quality metrics.
Platform: Amazon S3
Connection Category: Script
Template Category: Marketplace
What's Included
Default Parameters
| Parameter | Type | Description |
|---|---|---|
bucket_name |
String | S3 bucket name |
prefix |
String | Key prefix to scope discovery (e.g., data/warehouse/) |
aws_access_key_id |
Secret | AWS access key |
aws_secret_access_key |
Secret | AWS secret key |
aws_region |
String | AWS region (e.g., us-east-1) |
file_format |
Dropdown | Expected file format (CSV, Parquet, JSON) |
Data Type Mappings
Maps inferred types from file schemas (varies by file format — Parquet files have explicit types, CSV files use inference).
Metadata Ingestion
The ingestion script:
- Lists objects in the bucket matching the prefix and format
- S3 prefixes (folders) become schemas
- Each file or file group becomes a table
- Reads file headers/schemas to discover columns
Profiling
The profiling script downloads sample data and calculates standard metrics (record count, null count, distinct count, etc.).
Installation
Customization
- IAM role authentication — Modify the script to use IAM roles instead of access keys
- Partitioned data — Handle Hive-style partitioning (
year=2024/month=01/) - Large files — Configure sampling for files too large to download fully