Data Source Templates: The Complete Picture

Overview

The preceding articles in this section each cover one component of a data source template in depth. This article steps back to show how all those components work together — from the moment a template is configured to the point where users are running tests and viewing profiling results against live data sources.

If you haven't read the other articles yet, this one provides a useful map of the territory. If you've already read them, this is where the pieces connect.

The Data Source Lifecycle

A data source template enables a lifecycle that flows from platform configuration through to continuous data quality monitoring:

┌─────────────────────┐
│  Data Source Template │
│  (Platform Blueprint) │
└──────────┬──────────┘
           │ defines how to work with this platform
           ▼
┌─────────────────────┐
│  Connection          │
│  (Specific Instance) │
└──────────┬──────────┘
           │ connects to a specific database/API/file system
           ▼
┌─────────────────────┐
│  Data Source          │
│  (Validatar Object)  │
└──────────┬──────────┘
           │ uses template scripts and definitions
           ▼
┌─────────────────────┐
│  Metadata Ingestion  │
│  (Schema Discovery)  │
└──────────┬──────────┘
           │ populates the data catalog
           ▼
┌─────────────────────┐
│  Data Profiling      │
│  (Statistical Analysis)│
└──────────┬──────────┘
           │ measures data characteristics
           ▼
┌─────────────────────────────────────────┐
│  Test Generation & Execution             │
│  (Recommendations, Templates, Macros)    │
└──────────┬──────────────────────────────┘
           │ validates data quality
           ▼
┌─────────────────────┐
│  Trust Scores &      │
│  Monitoring          │
└─────────────────────┘

Each stage depends on the one before it, and the template's components contribute at every step.

How Each Component Contributes

Data Types → Correct Metadata Classification

When metadata ingestion discovers columns, the data type mappings translate platform-specific types into Validatar's internal type system. This classification determines:

Which profiling metrics apply to each column (numeric profiles for numeric columns, string profiles for string columns)
What test recommendations are relevant
How columns appear in the data explorer

If data types are wrong: Columns get classified incorrectly. Profiling skips applicable metrics or runs inappropriate ones. Test recommendations miss the mark.

Metadata Ingestion Scripts → Catalog Population

The ingestion scripts (SQL or Python) discover the structure of each data source and populate Validatar's data catalog. This catalog is the foundation for everything downstream:

Profile sets need to know which tables and columns exist before they can profile them
Test recommendations analyze the catalog to suggest relevant tests
Template tests use metadata to find matching structures and generate child tests
Macro parameters (Schema, Table, Column dropdowns) are populated from the catalog

If ingestion is wrong: The catalog is incomplete or inaccurate. Profiling misses objects, test recommendations are irrelevant, and macro dropdowns show incorrect options.

Profile Definitions → Data Quality Metrics

The profiling configuration (SQL definitions or Python scripts) determines what data quality metrics are available. Profile sets on each data source select which metrics to run and how often.

Profile results serve multiple purposes:

Data explorer — users browse current and historical profile values to understand their data
Trust scores — profiling metrics feed into the quality scoring system
Test data sets — tests can use profile results as dynamic thresholds or comparison baselines
Anomaly detection — historical profile trends help identify unexpected changes

If profiling is misconfigured: Key metrics are missing, trust scores are incomplete, and users lack the statistical foundation for data quality decisions.

Macros → Reusable Test Patterns

Macros bridge the gap between template-level platform knowledge and day-to-day test creation. They encode SQL patterns as parameterized snippets with metadata-linked dropdowns, enabling users to create effective tests without writing SQL from scratch.

Macros depend on metadata ingestion to populate their parameter dropdowns. A macro with Schema, Table, and Column parameters is only as useful as the metadata catalog is complete.

If macros are poorly designed: Users fall back to writing custom SQL, losing the consistency and reusability benefits. The barrier to creating tests goes up.

Parameters and Execution Scripts → Session Setup

Default parameters provide Python templates with the configuration they need to connect to external systems. Execution scripts ensure database sessions are in the right state before any query runs.

These are the "plumbing" that makes everything else work. They're consumed by ingestion scripts, profiling, and macro execution.

If parameters are missing or execution scripts are wrong: Ingestion fails to connect, profiling queries error out, and test execution produces unexpected results.

SQL Template vs. Python Template: Side by Side

Component	SQL Template	Python Template
General	Same: name, version, category, delimiters	Same (delimiters less relevant)
Data Types	Maps SQL types to Validatar types	Maps script return types to Validatar types
Parameters	Rarely used (connection string handles config)	Essential — API keys, file paths, connection info
Execution Scripts	Session setup SQL (SET statements, USE commands)	Less common — scripts manage their own setup
Metadata Ingestion	Three SQL scripts (schema, table, column)	One Python script returning up to 3 DataFrames
Profiling	Individual profile definitions with SQL expressions	Profile scripts returning metric DataFrames
Macros	SQL snippets with metadata-linked parameters	Same — macros work identically on both

The key insight: macros and the general configuration are the same regardless of template type. The differences are in how the template connects to data (parameters vs. connection strings), discovers metadata (SQL vs. Python), and calculates profiles (SQL expressions vs. Python scripts).

Common Scenarios

Setting Up a Snowflake Template

Snowflake is a SQL template with some platform-specific considerations:

General — Category: Database. Delimiters: " / ". Connection type: Snowflake.
Data Types — Map Snowflake types including VARIANT, OBJECT, ARRAY, and NUMBER variants.

Execution Scripts — Use pre-execution to set warehouse, role, and session parameters:

USE WAREHOUSE VALIDATAR_WH;
USE ROLE VALIDATAR_ROLE;
ALTER SESSION SET TIMEZONE = 'UTC';

Metadata Ingestion — Query INFORMATION_SCHEMA.SCHEMATA, INFORMATION_SCHEMA.TABLES, and INFORMATION_SCHEMA.COLUMNS. Filter out INFORMATION_SCHEMA schema.
Profiling — Standard profile definitions work with Snowflake SQL. Platform-specific functions like APPROX_PERCENTILE can improve performance for large tables.
Macros — Standard SQL macros. Use Snowflake-specific syntax where needed (e.g., FLATTEN for semi-structured data).

Creating a Python Template for an API Data Source

For a REST API with Swagger documentation:

General — Category: Script. Connection type: Python Script.
Parameters — Define api_base_url (String), api_key (Secret), page_size (Integer), environment (Dropdown: prod/staging/dev).
Metadata Ingestion — Python script that reads the Swagger spec to discover endpoints (schemas), resources (tables), and fields (columns).
Profiling — Python script that samples each endpoint and calculates record counts, null counts, and distinct counts.
Data Types — Map JSON types (string, integer, number, boolean, array, object) to Validatar types.
Macros — May be limited if the API doesn't support ad-hoc queries. Consider macros that construct API filter parameters.

Customizing an Existing Template

When the built-in template works for most purposes but needs adjustments:

Export the existing template as a backup
Modify the specific component that needs adjustment:
- Add new data type mappings for custom types
- Adjust ingestion scripts to include or exclude specific schemas
- Add custom macros for domain-specific testing patterns
- Add platform-specific profile definitions
Test with a single data source before rolling out changes
Consider whether the customization should be a template modification or a per-data-source override

Tip: If the customization only applies to one or a few data sources, use per-data-source overrides (on the Schema Metadata or Profile Sets pages) rather than modifying the template. Template changes affect all data sources using that template.

The Template as Living Infrastructure

Data source templates aren't set-and-forget configuration. They evolve with your data platform:

New database versions may introduce new data types that need mappings
Schema changes may require ingestion script adjustments
New testing patterns become macros that benefit all users on the platform
Performance tuning may improve profiling efficiency for large databases
Platform features (like Snowflake's semi-structured data or Databricks' Unity Catalog) may warrant new profile definitions or macro patterns

Treat templates as shared infrastructure. Version them, test changes carefully, and export backups before modifying production templates. When you develop a template that works well, consider exporting it for use in other environments or publishing it to the Validatar Marketplace.

Documentation Index