AWS S3 Data Source Template

Prev Next

Overview

The AWS S3 data source template enables Validatar to discover and profile data files stored in Amazon S3 buckets. It uses Python scripts with the boto3 library to list objects, read file schemas, and calculate data quality metrics.

Platform: Amazon S3
Connection Category: Script
Template Category: Marketplace

What's Included

Default Parameters

Parameter Type Description
bucket_name String S3 bucket name
prefix String Key prefix to scope discovery (e.g., data/warehouse/)
aws_access_key_id Secret AWS access key
aws_secret_access_key Secret AWS secret key
aws_region String AWS region (e.g., us-east-1)
file_format Dropdown Expected file format (CSV, Parquet, JSON)

Data Type Mappings

Maps inferred types from file schemas (varies by file format — Parquet files have explicit types, CSV files use inference).

Metadata Ingestion

The ingestion script:

  • Lists objects in the bucket matching the prefix and format
  • S3 prefixes (folders) become schemas
  • Each file or file group becomes a table
  • Reads file headers/schemas to discover columns

Profiling

The profiling script downloads sample data and calculates standard metrics (record count, null count, distinct count, etc.).

Installation

Customization

  • IAM role authentication — Modify the script to use IAM roles instead of access keys
  • Partitioned data — Handle Hive-style partitioning (year=2024/month=01/)
  • Large files — Configure sampling for files too large to download fully

Related Articles