What · How · Why · Where · Importance
What
A data contract is a versioned, machine-readable specification that defines: schema (columns, types), freshness SLA (data must be updated within N minutes), and quality rules (no nulls in key fields, value ranges).
How
Typically YAML or JSON files stored alongside code. A validation script checks the live database against the contract on every pipeline run — PASS or FAIL with details.
Why
Without contracts, downstream teams discover schema changes by breakage. Regulatory reports built on stale or schema-drifted data fail audits. Contracts make implicit expectations explicit and enforceable.
Where
Between every producer–consumer boundary: core banking → DWH, DWH → risk engine, data lake → ML models, API → downstream services.
Importance
Contracts are the primary mechanism for catching data incidents before they reach regulators or customers. They encode quality as code — not documentation.
What a Data Contract Contains
Using the Open Data Contract Standard (ODCS) as a reference.
📄 Schema Definition
Column names, data types, nullability, and primary key constraints. Any schema drift (column rename, type change) should break the contract and alert the producer team — not silently corrupt downstream reports.
⏰ Freshness SLA
Maximum allowed age of the most recent record. For a core banking EOD snapshot, this might be 60 minutes after midnight. For a real-time transaction feed, it might be 5 minutes. Stale data = SLA breach = pipeline halt.
📊 Quality Rules
Business-logic constraints: no null acct_no, bal_amt >= 0, txn_type IN ('CR','DR'), row count must be within 10% of yesterday. These rules run automatically and are versioned with the contract.
👥 Roles & Ownership
Named data owner, steward, and support contact. When a contract fails at 2 AM, the on-call rotation knows exactly who owns that dataset. Prevents "not my problem" escalations.
🔗 Servers & Environments
Connection details per environment (dev, UAT, prod). The same contract validates the dataset in each environment — preventing a contract that passes in dev but fails in prod due to environment-specific data.
🕐 Versioning
Contracts are versioned with semantic versioning. A breaking schema change increments the major version and requires downstream teams to acknowledge before the change deploys. Non-breaking additions increment the minor version.
🤔 The Three-System Inconsistency Problem
- Core Banking (PostgreSQL)Live
bal_amtreflects intraday transactions. Updated continuously. - DW Snapshot (MySQL)
ACCT_BALANCE_EODis T-1 — last night's balance. Intentionally stale by design. - MIS Reporting (SQLite)Aggregated weekly report — 7-day lag. Used for board dashboards.
- The ProblemA risk analyst querying all three sees three different "balances" for the same account — without contracts, they don't know which is correct for their use case.
- The SolutionData contracts declare the expected staleness of each system. Consumers choose the right system for their SLA. The AI agent knows to use live PostgreSQL for exposure queries, not the MIS layer.
name: savings_account
version: "1.2.0"
description: Core banking savings account master table
owner: retail_banking_team@bank.com
steward: data_platform@bank.com
servers:
production:
type: postgresql
host: corebank-db-prod
database: finacle
table: ACCT_MASTER
freshness_sla_minutes: 15
schema:
- name: acct_no type: VARCHAR nullable: false primary_key: true
- name: cust_id type: VARCHAR nullable: false
- name: acct_type type: CHAR(3) nullable: false
- name: bal_amt type: DECIMAL nullable: false
- name: branch_code type: VARCHAR nullable: false # 40 nulls injected!
- name: open_date type: DATE nullable: false
- name: status type: VARCHAR nullable: false
quality_rules:
- rule: no_nulls
column: acct_no
- rule: no_nulls
column: cust_id
- rule: value_range
column: bal_amt
min: 0
- rule: accepted_values
column: acct_type
values: [SAV, CUR, LON]
- rule: row_count_threshold
min_rows: 900
max_rows: 1200
contracts/ — YAML Contracts + validate_contracts.py
The demo ships two contracts: savings_account.yml and loan_position.yml.
Running python validate_contracts.py checks schema, freshness, and quality rules against the live databases.
It will FAIL on the 40 intentionally null branch_code values and the 3 negative balances —
exactly the "wow moment" where contract validation catches real problems before they reach a regulatory report.