System Design

Architecture Overview

End-to-end data flow, component design, and technology decisions for the Banking Context Engineering demo — from legacy raw tables to an AI-queryable knowledge graph.

Contents

  1. High-Level Design — the eight-layer model
  2. Data Flow Diagram
  3. Component Inventory
  4. Low-Level Design — layer by layer
  5. Database Schema
  6. API Design
  7. Agent Prompt Pipeline
  8. Governance Model
  9. Architecture Decisions

1. High-Level Design

The system is organised as eight progressive layers. Each layer produces a persistent artifact that becomes an input to the next. Nothing flows backwards — the architecture is a directed acyclic pipeline from raw data to governed intelligence.

┌─────────────────────────────────────────────────────────────────────────────────┐ BANKING CONTEXT ENGINEERING STACK └─────────────────────────────────────────────────────────────────────────────────┘ LAYER 0 — LEGACY ESTATE (3 heterogeneous databases) ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ PostgreSQL :5433 │ │ MySQL :3306 │ │ SQLite (file) ACCT_MASTER │ │ ACCT_BALANCE_EOD │ │ REPORT_SUMMARY LOAN_HDR │ │ INTEREST_ACCR │ │ TXN_LOG │ │ │ │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │ │ │ └───────────────────────┼───────────────────────┘ LAYER 1 — DISCOVERY (DDL extraction + lineage) ┌───────────────────────────────────────────────────────────────┐ extract_ddl.py → 04_lineage/ddl/ + source_to_target.yml └──────────────────────────────────┬────────────────────────────┘ LAYER 2 — ENRICHMENT (business metadata overlay) ┌───────────────────────────────────────────────────────────────┐ business_metadata.yml → apply_enrichment.py → enriched_catalog.json └──────────────────────────────────┬────────────────────────────┘ LAYER 3 — QUALITY (profiling + contracts) ┌─────────────────────────┐ ┌──────────────────────────────────┐ Great Expectations │ │ ODCS Contracts (YAML) 12 rules per table │ │ savings_account.yml catches nulls + neg bal │ │ loan_position.yml └──────────┬──────────────┘ └─────────────────┬────────────────┘ └────────────────────────────────────┘ LAYER 4 — SEMANTIC MODEL (dbt staging + marts + ontology) ┌─────────────────────────┐ ┌──────────────────────────────────┐ dbt staging/marts │ │ RDFLib Ontology (banking.ttl) customer_360 │ │ Basel III risk weights portfolio_at_risk metric│ │ GDPR column annotations └──────────┬──────────────┘ └─────────────────┬────────────────┘ └────────────────────────────────────┘ LAYER 5 — KNOWLEDGE GRAPH (Neo4j) ┌───────────────────────────────────────────────────────────────┐ load_graph.py → Neo4j :7474 (Customer)─[:OWNS]→(SavingsAccount)─[:HELD_AT]→(Branch) (Customer)─[:OWNS]→(LoanAccount)─[:DISBURSED_AT]→(Branch) └──────────────────────────────────┬────────────────────────────┘ LAYER 6 — SEMANTIC CORE (aggregator bundle) ┌───────────────────────────────────────────────────────────────┐ build_semantic_core.py → semantic_core.json glossary + contracts + OSI hints + ontology TTL + graph counts └──────────────────────────────────┬────────────────────────────┘ LAYER 7 — CONTEXT FIBER (serving API) ┌───────────────────────────────────────────────────────────────┐ FastAPI :8000 /context /query /graph-rag /audit-log └──────────────────────────────────┬────────────────────────────┘ LAYER 8 — AI AGENT + GOVERNANCE ┌─────────────────────────┐ ┌──────────────────────────────────┐ agent.py (CLI) │ │ RBAC + masking + audit_log NL → GraphRAG → Cypher │ │ 5 roles, per-column masking → Neo4j → answer │ │ every query logged to JSONL └─────────────────────────┘ └──────────────────────────────────┘

2. Data Flow Diagram

The table below maps every artifact produced and consumed at each layer, showing exactly which files cross layer boundaries.

Step Script / Tool Inputs Output Artifact Consumed by
03 — Mock Data generate_data.py Faker library, .env PostgreSQL tables, MySQL tables, SQLite file All downstream layers
04 — DDL Extraction extract_ddl.py PostgreSQL, MySQL, SQLite 04_lineage/ddl/*.sql, source_to_target.yml 05 enrichment, OpenMetadata
05 — Enrichment apply_enrichment.py DDL files, business_metadata.yml enriched_catalog.json 11 semantic core
06 — Ontology export_ontology.py banking_ontology.py (RDFLib) banking.ttl (~120 triples) 07 graph loader, 11 semantic core
07 — Knowledge Graph load_graph.py PostgreSQL (live), banking.ttl Neo4j nodes & relationships 12 API, 13 agent
08 — dbt dbt run PostgreSQL (raw tables) Staging views, customer_360 mart, metric YAMLs 11 semantic core
09 — Great Expectations run_checkpoints.py PostgreSQL (live) Validation JSON, HTML data docs Client walkthrough (visual evidence)
10 — Data Contracts validate_contracts.py PostgreSQL, contract YAMLs PASS/FAIL/WARN console output Client walkthrough (wow moment 2)
11 — Semantic Core build_semantic_core.py enriched_catalog.json, glossary.yml, domain_products.yml, contracts, osi_model.yml, banking.ttl, Neo4j counts semantic_core.json 12 API, 13 agent
12 — Context Fiber API api.py (FastAPI) semantic_core.json, Neo4j, governance YAMLs REST responses, Cypher results, audit events 13 agent, external clients
13 — AI Agent agent.py Ollama (LLM), Neo4j, semantic_core.json NL answer + provenance + audit log End user
14 — Observability drift_detect.py PostgreSQL, baseline_stats.json DRIFT / OK per column Ops / on-call

3. Component Inventory

🐘
PostgreSQL 15
Core banking — port 5433
Primary source of truth. Holds ACCT_MASTER, LOAN_HDR, TXN_LOG. Runs in Docker. Port 5433 (not default 5432) to avoid host conflicts.
  • 1,000 accounts, 500 loans, 10,000 transactions
  • Intentional quality issues seeded at generation time
DockerRDBMS
🐬
MySQL 8
DW EOD snapshot — port 3306
Simulates a data warehouse end-of-day snapshot. Holds ACCT_BALANCE_EOD and INTEREST_ACCR. Values are T-1 (slightly stale vs PostgreSQL), modelling real-world 3-system inconsistency.
DockerRDBMS
📄
SQLite
MIS reporting — file-based
Lightweight file database holding REPORT_SUMMARY — pre-aggregated daily totals for the management reporting layer. Requires no container.
Embedded DB
🕸️
Neo4j 5
Knowledge graph — ports 7474 / 7687
Graph database holding 1,000 Customer nodes, 1,000 Account nodes, 500 Loan nodes, 10 Branch nodes. Relationships are ontology-aligned. Queried via Cypher.
  • Browser UI: localhost:7474
  • Bolt: neo4j://localhost:7687
DockerGraph DB
🐍
dbt Core
Semantic transformation layer
Staging models standardise raw columns; customer_360 mart computes balance_category, overdue_flag, total_exposure, Basel III RWA, and Capital Adequacy Ratio per customer.
PythonYAML
Great Expectations
Data quality profiling
12 expectation rules per table. Catches the 40 null branch_codes and 3 negative balances seeded at generation time. Produces HTML data docs as visual evidence.
Python
📜
RDFLib Ontology
BankAccount class hierarchy
OWL-lite ontology in Python: BankAccount → SavingsAccount / CurrentAccount / LoanAccount → NPALoan. GDPR policy annotations on PII columns. Exported as RDF Turtle (~120 triples).
Python
FastAPI (Context Fiber)
Serving layer — port 8000
Three endpoints: /context/{id} (entity lookup), /query (NL → Cypher), /graph-rag (keyword context retrieval). Enforces RBAC masking and writes every call to audit_log.jsonl.
REST APIPython
🤖
Ollama (LLM)
llama3 / mistral — local inference
Local LLM server. Agent sends enriched prompt (question + GraphRAG context chunks) and receives a Cypher query. No cloud dependency. Switch models via OLLAMA_MODEL in .env.
Local LLM

4. Low-Level Design — Layer by Layer

Layer 0 — Legacy Estate (Mock Data Generation)

File: 03_mock_data/generate_data.py

The generator is idempotent: it truncates tables before reinserting, so running it twice leaves the database in the same state. It uses a fixed random seed for reproducibility.

generate_data.py │ ├── PostgreSQL (port 5433) │ ├── ACCT_MASTER 1,000 rows (800 SAV, 200 CUR) │ │ └── intentional: 40 null branch_code, 3 negative bal_amt │ ├── LOAN_HDR 500 rows (home / personal / auto) │ │ └── intentional: ~33 overdue_days > 90 (NPA threshold) │ └── TXN_LOG 10,000 rows (last 90 days) │ ├── MySQL (port 3306) │ ├── ACCT_BALANCE_EOD T-1 balances (±2% vs PostgreSQL) │ └── INTEREST_ACCR daily accruals │ └── SQLite (mis_reporting.db) └── REPORT_SUMMARY 30 daily rows

Layer 1 — Discovery & Lineage

Files: 04_lineage/extract_ddl.py, 04_lineage/source_to_target.yml

extract_ddl.py connects to each database via SQLAlchemy, reflects the schema using inspect(engine), and writes a CREATE TABLE DDL file per table. The source_to_target.yml is a hand-authored lineage manifest declaring the full pipeline: PostgreSQL → dbt staging → marts → Neo4j → API.

Layer 2 — Business Metadata Enrichment

Files: 05_enrichment/business_metadata.yml, 05_enrichment/apply_enrichment.py

The YAML assigns owners, domains, sensitivity classifications (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED), and usage notes to each of the 6 datasets. apply_enrichment.py merges these properties with the DDL column list to produce enriched_catalog.json — a per-table, per-column metadata record.

Layer 3 — Data Quality

Great Expectations

Two expectation suites: acct_master_suite (11 rules) and loan_hdr_suite (10 rules). Key rules:

ODCS Data Contracts

Three contracts in 10_contracts/. Each YAML specifies: schema (column names + types), freshness_sla_minutes, and a list of SQL-based quality rules. The validator executes each rule as a SQL COUNT(*) and compares to the threshold.

Wow Moment 2: The savings_account contract rule no_negative_balances FAILS because 3 accounts have bal_amt < 0. This catch happens before data ships to regulatory reporting.

Layer 4 — Semantic Model

dbt Project Structure

08_dbt_project/ ├── models/ │ ├── staging/ │ │ ├── stg_acct_master.sql # cast + rename + add acct_age_days │ │ ├── stg_loan_hdr.sql # add basel3_risk_weight, rwa_amt │ │ └── stg_txn_log.sql # cast txn_date, normalise channel │ ├── marts/ │ │ └── customer_360.sql # join stg_acct + stg_loan → exposure │ └── metrics/ │ └── portfolio_at_risk.yml # 5 dbt metrics definitions └── profiles.yml # points to postgres :5433

customer_360 Mart — computed fields

ColumnFormulaPurpose
balance_categoryCASE bal_amt: ≥500k PLATINUM, ≥100k GOLD, ≥10k SILVER, else BASICCustomer segmentation
overdue_flagoverdue_days > 0Loan health indicator
total_exposurebal_amt + outstanding_amtCredit risk view
total_rwaSUM(rwa_amt) per customerBasel III capital requirement
estimated_car_pctbal_amt / total_rwa × 100Capital adequacy ratio
car_status≥8% ADEQUATE, ≥4.5% MINIMUM, else BREACHRBI compliance flag

RDFLib Ontology (banking.ttl)

Class Hierarchy schema:FinancialProduct └── bank:BankAccount ├── bank:SavingsAccount # rdfs:comment "SAV accounts, GDPR-annotated" ├── bank:CurrentAccount └── bank:LoanAccount └── bank:NPALoan # subclass: overdue_days > 90 GDPR Annotations bank:cust_id gdpr:personalData "true"^^xsd:boolean bank:cust_id gdpr:lawfulBasis "contract" bank:bal_amt gdpr:sensitivity "financial"

Layer 5 — Knowledge Graph

File: 07_knowledge_graph/load_graph.py

Node Labels & Properties (:Customer {cust_id, name, city, ontologyClass}) (:SavingsAccount {acct_no, bal_amt, balance_category, branch_code, ontologyClass}) (:CurrentAccount {acct_no, bal_amt, branch_code, ontologyClass}) (:LoanAccount {loan_id, outstanding_amt, overdue_days, loan_type, rwa_amt, ontologyClass}) (:Branch {branch_code, city}) Relationships (Customer)-[:OWNS]->(SavingsAccount) (Customer)-[:OWNS]->(CurrentAccount) (Customer)-[:OWNS]->(LoanAccount) (SavingsAccount)-[:HELD_AT]->(Branch) (CurrentAccount)-[:HELD_AT]->(Branch) (LoanAccount)-[:DISBURSED_AT]->(Branch) Indexes CREATE INDEX ON :Customer(cust_id) CREATE INDEX ON :SavingsAccount(acct_no) CREATE INDEX ON :LoanAccount(loan_id)

Layer 6 — Semantic Core Aggregator

File: 11_semantic_core/build_semantic_core.py

This is the single integration point. It reads from all upstream artifacts and produces semantic_core.json — the sole input to the API and agent. Structure:

semantic_core.json schema: { "glossary": { termName → {definition, formula, regulatory_ref} }, "domain_products": [ {name, input_ports, output_ports, sla, lifecycle} ], "contracts": [ {dataset, rules, freshness_sla_minutes} ], "osi_fields": { tableName → [ {column, ai_context_hint, sensitivity} ] }, "ontology_excerpt": "...RDF Turtle snippet...", "graph_counts": { Customer: 1000, SavingsAccount: 800, ... }, "enriched_catalog": { tableName → {owner, domain, columns: [...]} } }

Layer 7 — Context Fiber API

File: 12_context_fiber/api.py

Endpoint Design

EndpointMethodAuthWhat it does
/context/{entity_id} GET ?role= Neo4j entity lookup → semantic model slice → masked response per RBAC role
/query POST body.role NL question → GraphRAG context → Ollama Cypher generation → sanitise → Neo4j execute → answer + provenance
/graph-rag GET ?role= Keyword-scored TF-IDF search over semantic_core.json → top-k context chunks
/audit-log GET role=auditor only Last N lines from audit_log.jsonl
/health GET none Liveness check
The /query endpoint sanitises generated Cypher before execution: it rejects any query containing CREATE, DELETE, SET, or MERGE — only read-only traversals are permitted.

Layer 8 — AI Agent & Governance

Files: 13_agent/agent.py, governance/

5. Database Schema

── PostgreSQL (core_banking) ────────────────────────────────────────── ACCT_MASTER acct_no VARCHAR(12) PK cust_id VARCHAR(10) FK → (conceptual customer) acct_type CHAR(3) CHECK IN ('SAV','CUR','LON') bal_amt NUMERIC(15,2) int_rate NUMERIC(5,3) branch_code VARCHAR(6) NULLABLE ← 40 intentional NULLs open_date DATE status VARCHAR(10) LOAN_HDR loan_id VARCHAR(12) PK cust_id VARCHAR(10) principal_amt NUMERIC(15,2) outstanding_amt NUMERIC(15,2) emi_amt NUMERIC(10,2) overdue_days INTEGER disbursement_date DATE loan_type VARCHAR(20) IN ('home_loan','personal_loan','auto_loan') TXN_LOG txn_id VARCHAR(14) PK acct_no VARCHAR(12) FK → ACCT_MASTER txn_type VARCHAR(10) IN ('CREDIT','DEBIT') amount NUMERIC(12,2) txn_date TIMESTAMP channel VARCHAR(10) IN ('branch','mobile','atm','internet') ── MySQL (dw_snapshot) ───────────────────────────────────────────────── ACCT_BALANCE_EOD acct_no VARCHAR(12) PK eod_balance NUMERIC(15,2) ← T-1 value, intentionally ≠ PostgreSQL report_date DATE branch_code VARCHAR(6) INTEREST_ACCR acct_no VARCHAR(12) accrued_interest NUMERIC(10,4) calc_date DATE ── SQLite (mis_reporting.db) ──────────────────────────────────────────── REPORT_SUMMARY report_date DATE PK total_accounts INTEGER total_deposits NUMERIC(18,2) portfolio_at_risk NUMERIC(5,4) avg_loan_balance NUMERIC(12,2)

6. API Design

Request / Response shapes

POST /query Request: { "question": "How many customers have NPA loans?", "role": "risk_officer" } Response: { "question": "...", "cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 RETURN count(c) AS npa_customers", "answer": "42 customers have NPA loans (overdue > 90 days).", "provenance": { "nodes_traversed": ["Customer","LoanAccount"], "filters": ["overdue_days > 90"] }, "role": "risk_officer", "masked_columns": [] } GET /context/{entity_id}?role=data_analyst Response: { "entity_id": "ACCT0000001", "type": "SavingsAccount", "neo4j_data": { "acct_no": "ACCT0000001", "bal_amt": 245000.0, ... }, "semantic_slice": [ { "column": "bal_amt", "ai_context_hint": "current live balance...", "sensitivity": "CONFIDENTIAL" } ], "masked_fields": [], "role": "data_analyst" } GET /graph-rag?q=portfolio+at+risk&top_k=3 Response: { "query": "portfolio at risk", "top_k": 3, "chunks": [ { "source": "glossary.PAR", "score": 0.91, "text": "Portfolio at Risk: proportion of loan portfolio..." }, { "source": "osi.LOAN_HDR.overdue_days", "score": 0.74, "text": "overdue_days: AI hint — use to identify NPA loans..." }, ... ] }

7. Agent Prompt Pipeline

The agent never queries Neo4j directly. Every answer is mediated through the LLM's Cypher generation, ensuring every traversal is inspectable and logged.

User: "Which customers have both NPA loans and PLATINUM balance?" │ ▼ 1. GraphRAG retrieval → keyword-score semantic_core.json → top-3 context chunks: glossary.NPA → "overdue_days > 90" osi.ACCT_MASTER.bal_amt → "balance_category computed by customer_360" glossary.PLATINUM → "bal_amt >= 500000" │ ▼ 2. Prompt construction System: "You are a Cypher expert. Schema: [nodes + rels]. Context: [chunks]." User: "Generate Cypher for: Which customers have both NPA loans and PLATINUM balance?" │ ▼ 3. Ollama inference (llama3 / mistral, local) → raw LLM output containing Cypher block │ ▼ 4. Cypher extraction + sanitisation → parse ```cypher ... ``` block → reject if contains CREATE / DELETE / SET / MERGE → validate MATCH + RETURN present │ ▼ 5. Neo4j execution MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) (c)-[:OWNS]->(a:SavingsAccount) WHERE l.overdue_days > 90 AND a.balance_category = 'PLATINUM' RETURN c.cust_id, c.name, l.overdue_days, a.bal_amt │ ▼ 6. Answer + provenance → format result rows as human-readable answer → record nodes traversed: [Customer, LoanAccount, SavingsAccount] → log to governance/audit_log.jsonl │ ▼ User: "3 customers match: CUST_0042 (overdue 112d, balance 620,000)..."

8. Governance Model

RBAC Roles & Column Masking

RoleCan see PII (cust_id, name)Can see financials (bal_amt, outstanding_amt)Audit log accessTypical user
data_analystAnalytics team
risk_officerRisk & credit team
compliance_officer✗ (masked)Compliance / Basel III
auditor✗ (masked)✓ (read-only)Internal audit
dbaDatabase administrator

Audit Log Record

# governance/audit_log.jsonl — one JSON object per line { "timestamp": "2026-05-15T17:31:04Z", "role": "risk_officer", "endpoint": "/query", "question": "Which customers have NPA loans?", "cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 ...", "result_rows": 42, "duration_ms": 143, "masked_cols": [] }

9. Architecture Decisions

Decision Choice Made Why Trade-off
Graph database Neo4j 5 (Docker) Multi-hop relationship queries are natural; Cypher is readable for LLM generation Heavier than SQLite; requires Docker; Bolt port must be open
LLM runtime Ollama (local llama3/mistral) Zero cloud dependency; demo runs offline; no API key management Slower than cloud APIs; quality depends on local hardware (8 GB+ RAM needed)
Semantic aggregation Single semantic_core.json Decouples agent + API from each upstream layer; one file to cache/invalidate Must be regenerated after any upstream change; grows large if catalog is big
GraphRAG implementation TF-IDF keyword scoring over semantic_core.json No embedding model needed; fast; deterministic; works offline Semantic similarity blind spots; a vector store (pgvector, Chroma) would perform better
Data contract format ODCS YAML + Python SQL validator Human-readable; version-controllable; no external SaaS dependency No built-in notification or CI integration out of the box
Ontology format RDFLib (OWL-lite) → Turtle export Python-native; no Protégé dependency; TTL is diffable in git Not a full OWL reasoner; limited inferencing compared to Jena/OWL API
PostgreSQL port 5433 (non-standard) Avoids conflict with any existing PostgreSQL on the host machine All scripts must explicitly set port; easy to misconfigure
dbt profiles location --profiles-dir . (local to project) Self-contained demo; no mutation of ~/.dbt/profiles.yml Must always pass --profiles-dir . flag; non-standard dbt UX