Architecture — Banking Context Engineering Demo

1. High-Level Design

The system is organised as eight progressive layers. Each layer produces a persistent artifact that becomes an input to the next. Nothing flows backwards — the architecture is a directed acyclic pipeline from raw data to governed intelligence.

┌─────────────────────────────────────────────────────────────────────────────────┐ │ BANKING CONTEXT ENGINEERING STACK │ └─────────────────────────────────────────────────────────────────────────────────┘ LAYER 0 — LEGACY ESTATE (3 heterogeneous databases) ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ PostgreSQL :5433 │ │ MySQL :3306 │ │ SQLite (file) │ │ ACCT_MASTER │ │ ACCT_BALANCE_EOD │ │ REPORT_SUMMARY │ │ LOAN_HDR │ │ INTEREST_ACCR │ │ │ │ TXN_LOG │ │ │ │ │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │ │ │ └───────────────────────┼───────────────────────┘ ▼ LAYER 1 — DISCOVERY (DDL extraction + lineage) ┌───────────────────────────────────────────────────────────────┐ │ extract_ddl.py → 04_lineage/ddl/ + source_to_target.yml │ └──────────────────────────────────┬────────────────────────────┘ ▼ LAYER 2 — ENRICHMENT (business metadata overlay) ┌───────────────────────────────────────────────────────────────┐ │ business_metadata.yml → apply_enrichment.py → enriched_catalog.json │ └──────────────────────────────────┬────────────────────────────┘ ▼ LAYER 3 — QUALITY (profiling + contracts) ┌─────────────────────────┐ ┌──────────────────────────────────┐ │ Great Expectations │ │ ODCS Contracts (YAML) │ │ 12 rules per table │ │ savings_account.yml │ │ catches nulls + neg bal │ │ loan_position.yml │ └──────────┬──────────────┘ └─────────────────┬────────────────┘ └────────────────────────────────────┘ ▼ LAYER 4 — SEMANTIC MODEL (dbt staging + marts + ontology) ┌─────────────────────────┐ ┌──────────────────────────────────┐ │ dbt staging/marts │ │ RDFLib Ontology (banking.ttl) │ │ customer_360 │ │ Basel III risk weights │ │ portfolio_at_risk metric│ │ GDPR column annotations │ └──────────┬──────────────┘ └─────────────────┬────────────────┘ └────────────────────────────────────┘ ▼ LAYER 5 — KNOWLEDGE GRAPH (Neo4j) ┌───────────────────────────────────────────────────────────────┐ │ load_graph.py → Neo4j :7474 │ │ (Customer)─[:OWNS]→(SavingsAccount)─[:HELD_AT]→(Branch) │ │ (Customer)─[:OWNS]→(LoanAccount)─[:DISBURSED_AT]→(Branch) │ └──────────────────────────────────┬────────────────────────────┘ ▼ LAYER 6 — SEMANTIC CORE (aggregator bundle) ┌───────────────────────────────────────────────────────────────┐ │ build_semantic_core.py → semantic_core.json │ │ glossary + contracts + OSI hints + ontology TTL + graph counts│ └──────────────────────────────────┬────────────────────────────┘ ▼ LAYER 7 — CONTEXT FIBER (serving API) ┌───────────────────────────────────────────────────────────────┐ │ FastAPI :8000 /context /query /graph-rag /audit-log │ └──────────────────────────────────┬────────────────────────────┘ ▼ LAYER 8 — AI AGENT + GOVERNANCE ┌─────────────────────────┐ ┌──────────────────────────────────┐ │ agent.py (CLI) │ │ RBAC + masking + audit_log │ │ NL → GraphRAG → Cypher │ │ 5 roles, per-column masking │ │ → Neo4j → answer │ │ every query logged to JSONL │ └─────────────────────────┘ └──────────────────────────────────┘

2. Data Flow Diagram

The table below maps every artifact produced and consumed at each layer, showing exactly which files cross layer boundaries.

Step	Script / Tool	Inputs	Output Artifact	Consumed by
03 — Mock Data	`generate_data.py`	Faker library, .env	PostgreSQL tables, MySQL tables, SQLite file	All downstream layers
04 — DDL Extraction	`extract_ddl.py`	PostgreSQL, MySQL, SQLite	`04_lineage/ddl/*.sql`, `source_to_target.yml`	05 enrichment, OpenMetadata
05 — Enrichment	`apply_enrichment.py`	DDL files, `business_metadata.yml`	`enriched_catalog.json`	11 semantic core
06 — Ontology	`export_ontology.py`	`banking_ontology.py` (RDFLib)	`banking.ttl` (~120 triples)	07 graph loader, 11 semantic core
07 — Knowledge Graph	`load_graph.py`	PostgreSQL (live), `banking.ttl`	Neo4j nodes & relationships	12 API, 13 agent
08 — dbt	`dbt run`	PostgreSQL (raw tables)	Staging views, `customer_360` mart, metric YAMLs	11 semantic core
09 — Great Expectations	`run_checkpoints.py`	PostgreSQL (live)	Validation JSON, HTML data docs	Client walkthrough (visual evidence)
10 — Data Contracts	`validate_contracts.py`	PostgreSQL, contract YAMLs	PASS/FAIL/WARN console output	Client walkthrough (wow moment 2)
11 — Semantic Core	`build_semantic_core.py`	enriched_catalog.json, glossary.yml, domain_products.yml, contracts, osi_model.yml, banking.ttl, Neo4j counts	`semantic_core.json`	12 API, 13 agent
12 — Context Fiber API	`api.py` (FastAPI)	`semantic_core.json`, Neo4j, governance YAMLs	REST responses, Cypher results, audit events	13 agent, external clients
13 — AI Agent	`agent.py`	Ollama (LLM), Neo4j, `semantic_core.json`	NL answer + provenance + audit log	End user
14 — Observability	`drift_detect.py`	PostgreSQL, `baseline_stats.json`	DRIFT / OK per column	Ops / on-call

3. Component Inventory

🐘

PostgreSQL 15

Core banking — port 5433

Primary source of truth. Holds ACCT_MASTER, LOAN_HDR, TXN_LOG. Runs in Docker. Port 5433 (not default 5432) to avoid host conflicts.

1,000 accounts, 500 loans, 10,000 transactions
Intentional quality issues seeded at generation time

DockerRDBMS

🐬

MySQL 8

DW EOD snapshot — port 3306

Simulates a data warehouse end-of-day snapshot. Holds ACCT_BALANCE_EOD and INTEREST_ACCR. Values are T-1 (slightly stale vs PostgreSQL), modelling real-world 3-system inconsistency.

DockerRDBMS

📄

SQLite

MIS reporting — file-based

Lightweight file database holding REPORT_SUMMARY — pre-aggregated daily totals for the management reporting layer. Requires no container.

Embedded DB

🕸️

Neo4j 5

Knowledge graph — ports 7474 / 7687

Graph database holding 1,000 Customer nodes, 1,000 Account nodes, 500 Loan nodes, 10 Branch nodes. Relationships are ontology-aligned. Queried via Cypher.

Browser UI: localhost:7474
Bolt: neo4j://localhost:7687

DockerGraph DB

🐍

dbt Core

Semantic transformation layer

Staging models standardise raw columns; customer_360 mart computes balance_category, overdue_flag, total_exposure, Basel III RWA, and Capital Adequacy Ratio per customer.

PythonYAML

✅

Great Expectations

Data quality profiling

12 expectation rules per table. Catches the 40 null branch_codes and 3 negative balances seeded at generation time. Produces HTML data docs as visual evidence.

Python

📜

RDFLib Ontology

BankAccount class hierarchy

OWL-lite ontology in Python: BankAccount → SavingsAccount / CurrentAccount / LoanAccount → NPALoan. GDPR policy annotations on PII columns. Exported as RDF Turtle (~120 triples).

Python

⚡

FastAPI (Context Fiber)

Serving layer — port 8000

Three endpoints: /context/{id} (entity lookup), /query (NL → Cypher), /graph-rag (keyword context retrieval). Enforces RBAC masking and writes every call to audit_log.jsonl.

REST APIPython

🤖

Ollama (LLM)

llama3 / mistral — local inference

Local LLM server. Agent sends enriched prompt (question + GraphRAG context chunks) and receives a Cypher query. No cloud dependency. Switch models via OLLAMA_MODEL in .env.

Local LLM

4. Low-Level Design — Layer by Layer

Layer 0 — Legacy Estate (Mock Data Generation)

File: 03_mock_data/generate_data.py

The generator is idempotent: it truncates tables before reinserting, so running it twice leaves the database in the same state. It uses a fixed random seed for reproducibility.

generate_data.py │ ├── PostgreSQL (port 5433) │ ├── ACCT_MASTER 1,000 rows (800 SAV, 200 CUR) │ │ └── intentional: 40 null branch_code, 3 negative bal_amt │ ├── LOAN_HDR 500 rows (home / personal / auto) │ │ └── intentional: ~33 overdue_days > 90 (NPA threshold) │ └── TXN_LOG 10,000 rows (last 90 days) │ ├── MySQL (port 3306) │ ├── ACCT_BALANCE_EOD T-1 balances (±2% vs PostgreSQL) │ └── INTEREST_ACCR daily accruals │ └── SQLite (mis_reporting.db) └── REPORT_SUMMARY 30 daily rows

Layer 1 — Discovery & Lineage

Files: 04_lineage/extract_ddl.py, 04_lineage/source_to_target.yml

extract_ddl.py connects to each database via SQLAlchemy, reflects the schema using inspect(engine), and writes a CREATE TABLE DDL file per table. The source_to_target.yml is a hand-authored lineage manifest declaring the full pipeline: PostgreSQL → dbt staging → marts → Neo4j → API.

Layer 2 — Business Metadata Enrichment

Files: 05_enrichment/business_metadata.yml, 05_enrichment/apply_enrichment.py

The YAML assigns owners, domains, sensitivity classifications (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED), and usage notes to each of the 6 datasets. apply_enrichment.py merges these properties with the DDL column list to produce enriched_catalog.json — a per-table, per-column metadata record.

Layer 3 — Data Quality

Great Expectations

Two expectation suites: acct_master_suite (11 rules) and loan_hdr_suite (10 rules). Key rules:

expect_column_values_to_not_be_null on branch_code → catches 40 nulls
expect_column_min_to_be_between(min_value=0) on bal_amt → catches 3 negatives
expect_column_values_to_be_in_set on acct_type → validates SAV/CUR/LON enum

ODCS Data Contracts

Three contracts in 10_contracts/. Each YAML specifies: schema (column names + types), freshness_sla_minutes, and a list of SQL-based quality rules. The validator executes each rule as a SQL COUNT(*) and compares to the threshold.

Wow Moment 2: The savings_account contract rule no_negative_balances FAILS because 3 accounts have bal_amt < 0. This catch happens before data ships to regulatory reporting.

Layer 4 — Semantic Model

dbt Project Structure

08_dbt_project/ ├── models/ │ ├── staging/ │ │ ├── stg_acct_master.sql # cast + rename + add acct_age_days │ │ ├── stg_loan_hdr.sql # add basel3_risk_weight, rwa_amt │ │ └── stg_txn_log.sql # cast txn_date, normalise channel │ ├── marts/ │ │ └── customer_360.sql # join stg_acct + stg_loan → exposure │ └── metrics/ │ └── portfolio_at_risk.yml # 5 dbt metrics definitions └── profiles.yml # points to postgres :5433

customer_360 Mart — computed fields

Column	Formula	Purpose
`balance_category`	CASE bal_amt: ≥500k PLATINUM, ≥100k GOLD, ≥10k SILVER, else BASIC	Customer segmentation
`overdue_flag`	overdue_days > 0	Loan health indicator
`total_exposure`	bal_amt + outstanding_amt	Credit risk view
`total_rwa`	SUM(rwa_amt) per customer	Basel III capital requirement
`estimated_car_pct`	bal_amt / total_rwa × 100	Capital adequacy ratio
`car_status`	≥8% ADEQUATE, ≥4.5% MINIMUM, else BREACH	RBI compliance flag

RDFLib Ontology (banking.ttl)

Class Hierarchy schema:FinancialProduct └── bank:BankAccount ├── bank:SavingsAccount # rdfs:comment "SAV accounts, GDPR-annotated" ├── bank:CurrentAccount └── bank:LoanAccount └── bank:NPALoan # subclass: overdue_days > 90 GDPR Annotations bank:cust_id gdpr:personalData "true"^^xsd:boolean bank:cust_id gdpr:lawfulBasis "contract" bank:bal_amt gdpr:sensitivity "financial"

Layer 5 — Knowledge Graph

File: 07_knowledge_graph/load_graph.py

Node Labels & Properties (:Customer {cust_id, name, city, ontologyClass}) (:SavingsAccount {acct_no, bal_amt, balance_category, branch_code, ontologyClass}) (:CurrentAccount {acct_no, bal_amt, branch_code, ontologyClass}) (:LoanAccount {loan_id, outstanding_amt, overdue_days, loan_type, rwa_amt, ontologyClass}) (:Branch {branch_code, city}) Relationships (Customer)-[:OWNS]->(SavingsAccount) (Customer)-[:OWNS]->(CurrentAccount) (Customer)-[:OWNS]->(LoanAccount) (SavingsAccount)-[:HELD_AT]->(Branch) (CurrentAccount)-[:HELD_AT]->(Branch) (LoanAccount)-[:DISBURSED_AT]->(Branch) Indexes CREATE INDEX ON :Customer(cust_id) CREATE INDEX ON :SavingsAccount(acct_no) CREATE INDEX ON :LoanAccount(loan_id)

Layer 6 — Semantic Core Aggregator

File: 11_semantic_core/build_semantic_core.py

This is the single integration point. It reads from all upstream artifacts and produces semantic_core.json — the sole input to the API and agent. Structure:

semantic_core.json schema: { "glossary": { termName → {definition, formula, regulatory_ref} }, "domain_products": [ {name, input_ports, output_ports, sla, lifecycle} ], "contracts": [ {dataset, rules, freshness_sla_minutes} ], "osi_fields": { tableName → [ {column, ai_context_hint, sensitivity} ] }, "ontology_excerpt": "...RDF Turtle snippet...", "graph_counts": { Customer: 1000, SavingsAccount: 800, ... }, "enriched_catalog": { tableName → {owner, domain, columns: [...]} } }

Layer 7 — Context Fiber API

File: 12_context_fiber/api.py

Endpoint Design

Endpoint	Method	Auth	What it does
`/context/{entity_id}`	GET	?role=	Neo4j entity lookup → semantic model slice → masked response per RBAC role
`/query`	POST	body.role	NL question → GraphRAG context → Ollama Cypher generation → sanitise → Neo4j execute → answer + provenance
`/graph-rag`	GET	?role=	Keyword-scored TF-IDF search over semantic_core.json → top-k context chunks
`/audit-log`	GET	role=auditor only	Last N lines from audit_log.jsonl
`/health`	GET	none	Liveness check

The /query endpoint sanitises generated Cypher before execution: it rejects any query containing CREATE, DELETE, SET, or MERGE — only read-only traversals are permitted.

Layer 8 — AI Agent & Governance

Files: 13_agent/agent.py, governance/

5. Database Schema

── PostgreSQL (core_banking) ────────────────────────────────────────── ACCT_MASTER acct_no VARCHAR(12) PK cust_id VARCHAR(10) FK → (conceptual customer) acct_type CHAR(3) CHECK IN ('SAV','CUR','LON') bal_amt NUMERIC(15,2) int_rate NUMERIC(5,3) branch_code VARCHAR(6) NULLABLE ← 40 intentional NULLs open_date DATE status VARCHAR(10) LOAN_HDR loan_id VARCHAR(12) PK cust_id VARCHAR(10) principal_amt NUMERIC(15,2) outstanding_amt NUMERIC(15,2) emi_amt NUMERIC(10,2) overdue_days INTEGER disbursement_date DATE loan_type VARCHAR(20) IN ('home_loan','personal_loan','auto_loan') TXN_LOG txn_id VARCHAR(14) PK acct_no VARCHAR(12) FK → ACCT_MASTER txn_type VARCHAR(10) IN ('CREDIT','DEBIT') amount NUMERIC(12,2) txn_date TIMESTAMP channel VARCHAR(10) IN ('branch','mobile','atm','internet') ── MySQL (dw_snapshot) ───────────────────────────────────────────────── ACCT_BALANCE_EOD acct_no VARCHAR(12) PK eod_balance NUMERIC(15,2) ← T-1 value, intentionally ≠ PostgreSQL report_date DATE branch_code VARCHAR(6) INTEREST_ACCR acct_no VARCHAR(12) accrued_interest NUMERIC(10,4) calc_date DATE ── SQLite (mis_reporting.db) ──────────────────────────────────────────── REPORT_SUMMARY report_date DATE PK total_accounts INTEGER total_deposits NUMERIC(18,2) portfolio_at_risk NUMERIC(5,4) avg_loan_balance NUMERIC(12,2)

6. API Design

Request / Response shapes

POST /query Request: { "question": "How many customers have NPA loans?", "role": "risk_officer" } Response: { "question": "...", "cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 RETURN count(c) AS npa_customers", "answer": "42 customers have NPA loans (overdue > 90 days).", "provenance": { "nodes_traversed": ["Customer","LoanAccount"], "filters": ["overdue_days > 90"] }, "role": "risk_officer", "masked_columns": [] } GET /context/{entity_id}?role=data_analyst Response: { "entity_id": "ACCT0000001", "type": "SavingsAccount", "neo4j_data": { "acct_no": "ACCT0000001", "bal_amt": 245000.0, ... }, "semantic_slice": [ { "column": "bal_amt", "ai_context_hint": "current live balance...", "sensitivity": "CONFIDENTIAL" } ], "masked_fields": [], "role": "data_analyst" } GET /graph-rag?q=portfolio+at+risk&top_k=3 Response: { "query": "portfolio at risk", "top_k": 3, "chunks": [ { "source": "glossary.PAR", "score": 0.91, "text": "Portfolio at Risk: proportion of loan portfolio..." }, { "source": "osi.LOAN_HDR.overdue_days", "score": 0.74, "text": "overdue_days: AI hint — use to identify NPA loans..." }, ... ] }

7. Agent Prompt Pipeline

The agent never queries Neo4j directly. Every answer is mediated through the LLM's Cypher generation, ensuring every traversal is inspectable and logged.

User: "Which customers have both NPA loans and PLATINUM balance?" │ ▼ 1. GraphRAG retrieval → keyword-score semantic_core.json → top-3 context chunks: glossary.NPA → "overdue_days > 90" osi.ACCT_MASTER.bal_amt → "balance_category computed by customer_360" glossary.PLATINUM → "bal_amt >= 500000" │ ▼ 2. Prompt construction System: "You are a Cypher expert. Schema: [nodes + rels]. Context: [chunks]." User: "Generate Cypher for: Which customers have both NPA loans and PLATINUM balance?" │ ▼ 3. Ollama inference (llama3 / mistral, local) → raw LLM output containing Cypher block │ ▼ 4. Cypher extraction + sanitisation → parse ```cypher ... ``` block → reject if contains CREATE / DELETE / SET / MERGE → validate MATCH + RETURN present │ ▼ 5. Neo4j execution MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) (c)-[:OWNS]->(a:SavingsAccount) WHERE l.overdue_days > 90 AND a.balance_category = 'PLATINUM' RETURN c.cust_id, c.name, l.overdue_days, a.bal_amt │ ▼ 6. Answer + provenance → format result rows as human-readable answer → record nodes traversed: [Customer, LoanAccount, SavingsAccount] → log to governance/audit_log.jsonl │ ▼ User: "3 customers match: CUST_0042 (overdue 112d, balance 620,000)..."

8. Governance Model

RBAC Roles & Column Masking

Role	Can see PII (cust_id, name)	Can see financials (bal_amt, outstanding_amt)	Audit log access	Typical user
`data_analyst`	✓	✓	✗	Analytics team
`risk_officer`	✓	✓	✗	Risk & credit team
`compliance_officer`	✗ (masked)	✓	✗	Compliance / Basel III
`auditor`	✓	✗ (masked)	✓ (read-only)	Internal audit
`dba`	✓	✓	✓	Database administrator

Audit Log Record

# governance/audit_log.jsonl — one JSON object per line { "timestamp": "2026-05-15T17:31:04Z", "role": "risk_officer", "endpoint": "/query", "question": "Which customers have NPA loans?", "cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 ...", "result_rows": 42, "duration_ms": 143, "masked_cols": [] }

9. Architecture Decisions

Decision	Choice Made	Why	Trade-off
Graph database	Neo4j 5 (Docker)	Multi-hop relationship queries are natural; Cypher is readable for LLM generation	Heavier than SQLite; requires Docker; Bolt port must be open
LLM runtime	Ollama (local llama3/mistral)	Zero cloud dependency; demo runs offline; no API key management	Slower than cloud APIs; quality depends on local hardware (8 GB+ RAM needed)
Semantic aggregation	Single `semantic_core.json`	Decouples agent + API from each upstream layer; one file to cache/invalidate	Must be regenerated after any upstream change; grows large if catalog is big
GraphRAG implementation	TF-IDF keyword scoring over semantic_core.json	No embedding model needed; fast; deterministic; works offline	Semantic similarity blind spots; a vector store (pgvector, Chroma) would perform better
Data contract format	ODCS YAML + Python SQL validator	Human-readable; version-controllable; no external SaaS dependency	No built-in notification or CI integration out of the box
Ontology format	RDFLib (OWL-lite) → Turtle export	Python-native; no Protégé dependency; TTL is diffable in git	Not a full OWL reasoner; limited inferencing compared to Jena/OWL API
PostgreSQL port	5433 (non-standard)	Avoids conflict with any existing PostgreSQL on the host machine	All scripts must explicitly set port; easy to misconfigure
dbt profiles location	`--profiles-dir .` (local to project)	Self-contained demo; no mutation of `~/.dbt/profiles.yml`	Must always pass `--profiles-dir .` flag; non-standard dbt UX

Architecture Overview

Contents

1. High-Level Design

2. Data Flow Diagram

3. Component Inventory

4. Low-Level Design — Layer by Layer

Layer 0 — Legacy Estate (Mock Data Generation)

Layer 1 — Discovery & Lineage

Layer 2 — Business Metadata Enrichment

Layer 3 — Data Quality

Great Expectations

ODCS Data Contracts

Layer 4 — Semantic Model

dbt Project Structure

customer_360 Mart — computed fields

RDFLib Ontology (banking.ttl)

Layer 5 — Knowledge Graph

Layer 6 — Semantic Core Aggregator

Layer 7 — Context Fiber API

Endpoint Design

Layer 8 — AI Agent & Governance

5. Database Schema

6. API Design

Request / Response shapes

7. Agent Prompt Pipeline

8. Governance Model

RBAC Roles & Column Masking

Audit Log Record

9. Architecture Decisions