Replication of Your Metadata to a Data Processing Engine

Prev Next

Overview

Validatar stores all of its operational data — metadata catalogs, test definitions, execution results, profiling metrics, trust scores, and more — in its internal repository database (SQL Server or PostgreSQL). While this database powers the application, it isn't designed for ad-hoc analytics or cross-platform reporting.

Repository replication solves this by continuously syncing Validatar's internal data to a Snowflake data processing engine. Once configured, Validatar incrementally loads repository tables into a dedicated Snowflake schema, giving your team direct SQL access to all of Validatar's operational data. This enables custom reporting, integration with BI tools, cross-environment comparisons, and any analysis that benefits from having Validatar's data in a queryable warehouse.

Prerequisites

Before configuring repository replication, you need:

  • A Snowflake data processing engine already created in Validatar (see Creating Data Processing Engines)
  • The Snowflake Results Storage license feature enabled for your Validatar environment
  • Administrator access — only users with the Global Configuration Admin role can configure replication

Configuring Repository Replication

Navigate to Settings > Configuration > Repository Replication.

Repository Replication configuration page

Step 1: Select a Snowflake Data Processing Engine

Choose the Snowflake data processing engine that will receive the replicated data. The dropdown lists all verified Snowflake engines configured in your environment.

Step 2: Specify a Schema

Enter the Snowflake schema name where replicated tables will be created. The default is REPOSITORY.

Note: The replication schema must be different from the default schema of the selected data processing engine. Validatar creates and manages this schema automatically — you don't need to create it in Snowflake ahead of time.

Step 3: Set a Sync Schedule

Click Set Sync Schedule to configure how often replication runs.

Sync schedule configuration

You can schedule replication to run:

  • Daily — once per day or multiple times per day
  • Weekly — on specific days of the week
  • Monthly — on specific days of the month

For most organizations, a daily sync is sufficient. If your team needs near-real-time access to test results or execution history, consider scheduling multiple runs per day.

Step 4: Save

Click Save to persist the configuration. Replication will begin according to the schedule you set.

How Incremental Loading Works

Repository replication uses an incremental loading strategy to minimize the data transferred on each run.

The Watermark Approach

Each replicated table tracks a watermark — the timestamp of the last successful sync, stored as DateLastSynced per table. On each replication run:

  1. Validatar queries its internal database for rows where the sync column is greater than the watermark
  2. Only new and modified rows since the last sync are selected
  3. These rows are serialized, streamed to a Snowflake staging table, and merged into the target tables
  4. The watermark advances to the latest timestamp seen

The sync column is DateModified when available, falling back to DateCreated for tables that don't track modification timestamps.

The Staging and Merge Process

Replication uses a staging-then-merge pattern:

  1. Staging — Changed rows are serialized as JSON and uploaded to a transient staging table (RepositoryReplicationStaging) in Snowflake
  2. Merge — For each table, Validatar executes a MERGE INTO statement that:
    • Updates existing rows (matched on the primary key ID)
    • Inserts new rows that don't exist in the target
  3. Cleanup — Staging records are deleted after a successful merge

This approach is efficient (only changed data is transferred), idempotent (re-running the same batch produces the same result), and handles both inserts and updates in a single operation.

Full Re-Sync

If you need to force a complete reload of specific tables — for example, after a schema change or data correction — you can reset the sync dates for individual tables. This clears the watermark and causes the next replication run to pull all rows for those tables.

What Gets Replicated

Validatar replicates 55 tables from its internal repository, covering every major area of the platform. All table names in Snowflake use SNAKE_CASE naming convention.

Metadata Catalog

Snowflake Table What It Contains
META_SCHEMA Schemas discovered from data sources
META_TABLE Tables and views in each schema
META_TABLE_VERSION Historical versions of table metadata
META_COLUMN Columns within tables
META_COLUMN_VERSION Historical versions of column metadata
META_DATA_SOURCE Data source configurations
META_DATA_PERMISSION Permissions on metadata objects
META_LINEAGE_SOURCE Lineage source mappings
META_LINEAGE_TARGET Lineage target mappings

Profiling and Trust Scores

Snowflake Table What It Contains
META_DATA_LATEST_PROFILE Most recent profile results per object
META_DATA_PROFILE_SET Profile set configurations
DATA_PROFILE_DEF Profile definition metadata
META_TABLE_TRUST_SCORE Trust scores per table
TRUST_SCORE_IMPACT_RATING Trust score impact rating configurations

Tests and Executions

Snowflake Table What It Contains
TEST Test definitions
TEST_VERSION Historical versions of test configurations
TEST_DATA_SET Data sets associated with tests
TEST_DATA_SET_TYPE Data set type classifications
TEST_EXECUTION Individual test execution results
TEST_META_LINK Links between tests and metadata objects
TEST_OVERALL_RESULT_STATUS Aggregated test result statuses

Template Tests

Snowflake Table What It Contains
TEMPLATE_TEST_VERSION Template test version configurations
TEMPLATE_TEST_DATA_SET Template test data set definitions
TEMPLATE_TEST_CHILD_OBJECT Materialized child test object mappings
TEMPLATE_TEST_EXECUTION Template test execution results

Jobs and Scheduling

Snowflake Table What It Contains
BATCH Execution batch records
JOB Job definitions
JOB_STEP Steps within jobs
JOB_STEP_EXECUTION Individual job step execution results
JOB_SCHEDULE Job schedule associations
SCHEDULE Schedule definitions
SCRIPT_EXECUTION Script execution records

Projects and Permissions

Snowflake Table What It Contains
PROJECT Project definitions
PROJECT_DATA_SOURCE Data source assignments to projects
PROJECT_MEMBER User membership in projects
PROJECT_MEMBER_ROLE Role assignments for project members

Connections

Snowflake Table What It Contains
CONNECTION Connection definitions
CONNECTION_VERSION Connection version configurations

Custom Fields

Snowflake Table What It Contains
CUSTOM_FIELD Custom field definitions
CUSTOM_FIELD_SECTION Custom field section groupings
CUSTOM_FIELD_SECTION_FIELD Fields within sections
CUSTOM_FIELD_OBJECT_VALUE Custom field values assigned to objects
CUSTOM_FIELD_OPTION Dropdown options for custom fields

Users and Roles

Snowflake Table What It Contains
USER User accounts
USER_IDENTITY User identity records (SSO, local)
USER_ROLE Role assignments
USER_GROUP User group definitions
USER_GROUP_USER Group membership
ROLE Role definitions

Configuration and System

Snowflake Table What It Contains
CODE_DEF Code definition lookup values
COMMENT Comments on objects
EVENT System events
EVENT_TYPE Event type classifications
EXCEPTION_LOG Exception and error logs
FOLDER Folder hierarchy for organizing objects
QUALITY_DIMENSION Quality dimension definitions
REPORT Report configurations
REPORT_PERMISSION Report access permissions
SEVERITY_LEVEL Severity level definitions

Helper Views

After each replication run, Validatar automatically creates and refreshes helper views in the replication schema that simplify common reporting queries:

View Purpose
VW_DATASET_HELPER_TABLE_PROFILES Joins latest profile results with profile definitions for table-level profiles — one row per table per profile metric
VW_DATASET_HELPER_COLUMN_PROFILES Same structure for column-level profiles — one row per column per profile metric
VW_DATASET_TABLES A wide view of tables with pivoted custom field values and profile metrics joined in, ready for BI consumption
VW_DATASET_COLUMNS Same structure for columns — includes custom field pivots and column-level profile metrics

The VW_DATASET_TABLES and VW_DATASET_COLUMNS views are dynamically generated based on your custom field and profile definitions. When custom fields or profile definitions change, these views are automatically rebuilt on the next replication run.

Schema Evolution

Validatar handles schema changes automatically. On every replication run, the system compares the expected column structure of each Snowflake table against the actual table in Snowflake. If a mismatch is detected — a new column was added to the Validatar model, a column type changed, or a column was removed — the Snowflake table is dropped and recreated with the correct schema. The sync date for that table is reset, triggering a full reload on the next pass.

This means you don't need to manually manage the Snowflake schema as Validatar evolves. Upgrades that add new columns to internal tables are handled transparently.

Replication Status

After the initial replication completes, the Replication Status section on the configuration page displays per-table statistics:

  • Table Name — the Snowflake table
  • Row Count — total rows in the Snowflake table
  • Unchanged — rows that haven't changed since the last sync
  • Modified — rows updated since the last sync
  • Inserted — new rows since the last sync

You can also view a comparison between your Validatar repository and Snowflake to identify any unexpected mismatches. If a table shows a discrepancy, you can clear its sync date to force a full re-sync.

Best Practices

  • Schedule replication during off-peak hours — The initial full sync can be resource-intensive. Schedule it for a time when both your Validatar instance and Snowflake have available capacity.
  • Use a dedicated schema — Keep replicated data separate from other Snowflake objects. The default REPOSITORY schema is a good choice.
  • Monitor the Replication Status page — Check periodically for unexpected mismatches between Validatar and Snowflake row counts. Small differences immediately after a sync are normal (new data may have arrived between the sync and the comparison), but persistent large mismatches indicate an issue.
  • Grant read-only access in Snowflake — Create a Snowflake role with SELECT-only permissions on the replication schema for your reporting and BI users. Don't modify the replicated tables directly — Validatar manages them.
  • Use the helper views for reporting — The VW_DATASET_TABLES and VW_DATASET_COLUMNS views are designed for BI tools and dashboards. They pre-join the most commonly needed dimensions, saving you from writing complex queries against the raw tables.

How This Fits Into the Bigger Picture

Repository replication connects Validatar's internal data to your broader analytics ecosystem. With your metadata, test results, profiling metrics, and trust scores in Snowflake, you can:

  • Build custom dashboards in tools like Tableau, Power BI, or Looker
  • Run cross-environment comparisons by replicating multiple Validatar instances to the same Snowflake account
  • Create automated alerts based on test result trends or trust score changes
  • Feed Validatar data into data governance platforms for unified metadata management
  • Perform historical analysis of testing coverage, execution trends, and data quality improvements over time

For more on data processing engines, see What Is a Data Processing Engine? and Creating Data Processing Engines.