Data Contracts

Explicit, machine-enforceable agreements between data producers and consumers — covering schema, quality, freshness, and SLA obligations.

Overview

What · How · Why · Where · Importance

What

A data contract is a versioned, machine-readable specification that defines: schema (columns, types), freshness SLA (data must be updated within N minutes), and quality rules (no nulls in key fields, value ranges).

How

Typically YAML or JSON files stored alongside code. A validation script checks the live database against the contract on every pipeline run — PASS or FAIL with details.

Why

Without contracts, downstream teams discover schema changes by breakage. Regulatory reports built on stale or schema-drifted data fail audits. Contracts make implicit expectations explicit and enforceable.

🏠

Where

Between every producer–consumer boundary: core banking → DWH, DWH → risk engine, data lake → ML models, API → downstream services.

Importance

Contracts are the primary mechanism for catching data incidents before they reach regulators or customers. They encode quality as code — not documentation.

Anatomy

What a Data Contract Contains

Using the Open Data Contract Standard (ODCS) as a reference.

📄 Schema Definition

Column names, data types, nullability, and primary key constraints. Any schema drift (column rename, type change) should break the contract and alert the producer team — not silently corrupt downstream reports.

⏰ Freshness SLA

Maximum allowed age of the most recent record. For a core banking EOD snapshot, this might be 60 minutes after midnight. For a real-time transaction feed, it might be 5 minutes. Stale data = SLA breach = pipeline halt.

📊 Quality Rules

Business-logic constraints: no null acct_no, bal_amt >= 0, txn_type IN ('CR','DR'), row count must be within 10% of yesterday. These rules run automatically and are versioned with the contract.

👥 Roles & Ownership

Named data owner, steward, and support contact. When a contract fails at 2 AM, the on-call rotation knows exactly who owns that dataset. Prevents "not my problem" escalations.

🔗 Servers & Environments

Connection details per environment (dev, UAT, prod). The same contract validates the dataset in each environment — preventing a contract that passes in dev but fails in prod due to environment-specific data.

🕐 Versioning

Contracts are versioned with semantic versioning. A breaking schema change increments the major version and requires downstream teams to acknowledge before the change deploys. Non-breaking additions increment the minor version.

🤔 The Three-System Inconsistency Problem

Example: savings_account.yml
name: savings_account
version: "1.2.0"
description: Core banking savings account master table
owner: retail_banking_team@bank.com
steward: data_platform@bank.com

servers:
  production:
    type: postgresql
    host: corebank-db-prod
    database: finacle
    table: ACCT_MASTER

freshness_sla_minutes: 15

schema:
  - name: acct_no        type: VARCHAR   nullable: false   primary_key: true
  - name: cust_id        type: VARCHAR   nullable: false
  - name: acct_type      type: CHAR(3)   nullable: false
  - name: bal_amt        type: DECIMAL   nullable: false
  - name: branch_code    type: VARCHAR   nullable: false   # 40 nulls injected!
  - name: open_date      type: DATE      nullable: false
  - name: status         type: VARCHAR   nullable: false

quality_rules:
  - rule: no_nulls
    column: acct_no
  - rule: no_nulls
    column: cust_id
  - rule: value_range
    column: bal_amt
    min: 0
  - rule: accepted_values
    column: acct_type
    values: [SAV, CUR, LON]
  - rule: row_count_threshold
    min_rows: 900
    max_rows: 1200
🔧 In the Demo

contracts/ — YAML Contracts + validate_contracts.py

The demo ships two contracts: savings_account.yml and loan_position.yml. Running python validate_contracts.py checks schema, freshness, and quality rules against the live databases. It will FAIL on the 40 intentionally null branch_code values and the 3 negative balances — exactly the "wow moment" where contract validation catches real problems before they reach a regulatory report.