System Design
Architecture Overview
End-to-end data flow, component design, and technology decisions for the Banking Context Engineering demo — from legacy raw tables to an AI-queryable knowledge graph.
1. High-Level Design
The system is organised as eight progressive layers. Each layer produces a persistent artifact that becomes an input to the next. Nothing flows backwards — the architecture is a directed acyclic pipeline from raw data to governed intelligence.
┌─────────────────────────────────────────────────────────────────────────────────┐
│ BANKING CONTEXT ENGINEERING STACK │
└─────────────────────────────────────────────────────────────────────────────────┘
LAYER 0 — LEGACY ESTATE (3 heterogeneous databases)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ PostgreSQL :5433 │ │ MySQL :3306 │ │ SQLite (file) │
│ ACCT_MASTER │ │ ACCT_BALANCE_EOD │ │ REPORT_SUMMARY │
│ LOAN_HDR │ │ INTEREST_ACCR │ │ │
│ TXN_LOG │ │ │ │ │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└───────────────────────┼───────────────────────┘
▼
LAYER 1 — DISCOVERY (DDL extraction + lineage)
┌───────────────────────────────────────────────────────────────┐
│ extract_ddl.py → 04_lineage/ddl/ + source_to_target.yml │
└──────────────────────────────────┬────────────────────────────┘
▼
LAYER 2 — ENRICHMENT (business metadata overlay)
┌───────────────────────────────────────────────────────────────┐
│ business_metadata.yml → apply_enrichment.py → enriched_catalog.json │
└──────────────────────────────────┬────────────────────────────┘
▼
LAYER 3 — QUALITY (profiling + contracts)
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ Great Expectations │ │ ODCS Contracts (YAML) │
│ 12 rules per table │ │ savings_account.yml │
│ catches nulls + neg bal │ │ loan_position.yml │
└──────────┬──────────────┘ └─────────────────┬────────────────┘
└────────────────────────────────────┘
▼
LAYER 4 — SEMANTIC MODEL (dbt staging + marts + ontology)
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ dbt staging/marts │ │ RDFLib Ontology (banking.ttl) │
│ customer_360 │ │ Basel III risk weights │
│ portfolio_at_risk metric│ │ GDPR column annotations │
└──────────┬──────────────┘ └─────────────────┬────────────────┘
└────────────────────────────────────┘
▼
LAYER 5 — KNOWLEDGE GRAPH (Neo4j)
┌───────────────────────────────────────────────────────────────┐
│ load_graph.py → Neo4j :7474 │
│ (Customer)─[:OWNS]→(SavingsAccount)─[:HELD_AT]→(Branch) │
│ (Customer)─[:OWNS]→(LoanAccount)─[:DISBURSED_AT]→(Branch) │
└──────────────────────────────────┬────────────────────────────┘
▼
LAYER 6 — SEMANTIC CORE (aggregator bundle)
┌───────────────────────────────────────────────────────────────┐
│ build_semantic_core.py → semantic_core.json │
│ glossary + contracts + OSI hints + ontology TTL + graph counts│
└──────────────────────────────────┬────────────────────────────┘
▼
LAYER 7 — CONTEXT FIBER (serving API)
┌───────────────────────────────────────────────────────────────┐
│ FastAPI :8000 /context /query /graph-rag /audit-log │
└──────────────────────────────────┬────────────────────────────┘
▼
LAYER 8 — AI AGENT + GOVERNANCE
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ agent.py (CLI) │ │ RBAC + masking + audit_log │
│ NL → GraphRAG → Cypher │ │ 5 roles, per-column masking │
│ → Neo4j → answer │ │ every query logged to JSONL │
└─────────────────────────┘ └──────────────────────────────────┘
2. Data Flow Diagram
The table below maps every artifact produced and consumed at each layer, showing exactly which files cross layer boundaries.
| Step |
Script / Tool |
Inputs |
Output Artifact |
Consumed by |
| 03 — Mock Data |
generate_data.py |
Faker library, .env |
PostgreSQL tables, MySQL tables, SQLite file |
All downstream layers |
| 04 — DDL Extraction |
extract_ddl.py |
PostgreSQL, MySQL, SQLite |
04_lineage/ddl/*.sql, source_to_target.yml |
05 enrichment, OpenMetadata |
| 05 — Enrichment |
apply_enrichment.py |
DDL files, business_metadata.yml |
enriched_catalog.json |
11 semantic core |
| 06 — Ontology |
export_ontology.py |
banking_ontology.py (RDFLib) |
banking.ttl (~120 triples) |
07 graph loader, 11 semantic core |
| 07 — Knowledge Graph |
load_graph.py |
PostgreSQL (live), banking.ttl |
Neo4j nodes & relationships |
12 API, 13 agent |
| 08 — dbt |
dbt run |
PostgreSQL (raw tables) |
Staging views, customer_360 mart, metric YAMLs |
11 semantic core |
| 09 — Great Expectations |
run_checkpoints.py |
PostgreSQL (live) |
Validation JSON, HTML data docs |
Client walkthrough (visual evidence) |
| 10 — Data Contracts |
validate_contracts.py |
PostgreSQL, contract YAMLs |
PASS/FAIL/WARN console output |
Client walkthrough (wow moment 2) |
| 11 — Semantic Core |
build_semantic_core.py |
enriched_catalog.json, glossary.yml, domain_products.yml, contracts, osi_model.yml, banking.ttl, Neo4j counts |
semantic_core.json |
12 API, 13 agent |
| 12 — Context Fiber API |
api.py (FastAPI) |
semantic_core.json, Neo4j, governance YAMLs |
REST responses, Cypher results, audit events |
13 agent, external clients |
| 13 — AI Agent |
agent.py |
Ollama (LLM), Neo4j, semantic_core.json |
NL answer + provenance + audit log |
End user |
| 14 — Observability |
drift_detect.py |
PostgreSQL, baseline_stats.json |
DRIFT / OK per column |
Ops / on-call |
3. Component Inventory
Primary source of truth. Holds ACCT_MASTER, LOAN_HDR, TXN_LOG. Runs in Docker. Port 5433 (not default 5432) to avoid host conflicts.
- 1,000 accounts, 500 loans, 10,000 transactions
- Intentional quality issues seeded at generation time
DockerRDBMS
Simulates a data warehouse end-of-day snapshot. Holds ACCT_BALANCE_EOD and INTEREST_ACCR. Values are T-1 (slightly stale vs PostgreSQL), modelling real-world 3-system inconsistency.
DockerRDBMS
Lightweight file database holding REPORT_SUMMARY — pre-aggregated daily totals for the management reporting layer. Requires no container.
Embedded DB
Graph database holding 1,000 Customer nodes, 1,000 Account nodes, 500 Loan nodes, 10 Branch nodes. Relationships are ontology-aligned. Queried via Cypher.
- Browser UI: localhost:7474
- Bolt: neo4j://localhost:7687
DockerGraph DB
Staging models standardise raw columns; customer_360 mart computes balance_category, overdue_flag, total_exposure, Basel III RWA, and Capital Adequacy Ratio per customer.
PythonYAML
12 expectation rules per table. Catches the 40 null branch_codes and 3 negative balances seeded at generation time. Produces HTML data docs as visual evidence.
Python
OWL-lite ontology in Python: BankAccount → SavingsAccount / CurrentAccount / LoanAccount → NPALoan. GDPR policy annotations on PII columns. Exported as RDF Turtle (~120 triples).
Python
Three endpoints: /context/{id} (entity lookup), /query (NL → Cypher), /graph-rag (keyword context retrieval). Enforces RBAC masking and writes every call to audit_log.jsonl.
REST APIPython
Local LLM server. Agent sends enriched prompt (question + GraphRAG context chunks) and receives a Cypher query. No cloud dependency. Switch models via OLLAMA_MODEL in .env.
Local LLM
4. Low-Level Design — Layer by Layer
Layer 0 — Legacy Estate (Mock Data Generation)
File: 03_mock_data/generate_data.py
The generator is idempotent: it truncates tables before reinserting, so running it twice leaves the database in the same state. It uses a fixed random seed for reproducibility.
generate_data.py
│
├── PostgreSQL (port 5433)
│ ├── ACCT_MASTER 1,000 rows (800 SAV, 200 CUR)
│ │ └── intentional: 40 null branch_code, 3 negative bal_amt
│ ├── LOAN_HDR 500 rows (home / personal / auto)
│ │ └── intentional: ~33 overdue_days > 90 (NPA threshold)
│ └── TXN_LOG 10,000 rows (last 90 days)
│
├── MySQL (port 3306)
│ ├── ACCT_BALANCE_EOD T-1 balances (±2% vs PostgreSQL)
│ └── INTEREST_ACCR daily accruals
│
└── SQLite (mis_reporting.db)
└── REPORT_SUMMARY 30 daily rows
Layer 1 — Discovery & Lineage
Files: 04_lineage/extract_ddl.py, 04_lineage/source_to_target.yml
extract_ddl.py connects to each database via SQLAlchemy, reflects the schema using inspect(engine), and writes a CREATE TABLE DDL file per table. The source_to_target.yml is a hand-authored lineage manifest declaring the full pipeline: PostgreSQL → dbt staging → marts → Neo4j → API.
Layer 2 — Business Metadata Enrichment
Files: 05_enrichment/business_metadata.yml, 05_enrichment/apply_enrichment.py
The YAML assigns owners, domains, sensitivity classifications (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED), and usage notes to each of the 6 datasets. apply_enrichment.py merges these properties with the DDL column list to produce enriched_catalog.json — a per-table, per-column metadata record.
Layer 3 — Data Quality
Great Expectations
Two expectation suites: acct_master_suite (11 rules) and loan_hdr_suite (10 rules). Key rules:
expect_column_values_to_not_be_null on branch_code → catches 40 nulls
expect_column_min_to_be_between(min_value=0) on bal_amt → catches 3 negatives
expect_column_values_to_be_in_set on acct_type → validates SAV/CUR/LON enum
ODCS Data Contracts
Three contracts in 10_contracts/. Each YAML specifies: schema (column names + types), freshness_sla_minutes, and a list of SQL-based quality rules. The validator executes each rule as a SQL COUNT(*) and compares to the threshold.
Wow Moment 2: The savings_account contract rule no_negative_balances FAILS because 3 accounts have bal_amt < 0. This catch happens before data ships to regulatory reporting.
Layer 4 — Semantic Model
dbt Project Structure
08_dbt_project/
├── models/
│ ├── staging/
│ │ ├── stg_acct_master.sql # cast + rename + add acct_age_days
│ │ ├── stg_loan_hdr.sql # add basel3_risk_weight, rwa_amt
│ │ └── stg_txn_log.sql # cast txn_date, normalise channel
│ ├── marts/
│ │ └── customer_360.sql # join stg_acct + stg_loan → exposure
│ └── metrics/
│ └── portfolio_at_risk.yml # 5 dbt metrics definitions
└── profiles.yml # points to postgres :5433
customer_360 Mart — computed fields
| Column | Formula | Purpose |
balance_category | CASE bal_amt: ≥500k PLATINUM, ≥100k GOLD, ≥10k SILVER, else BASIC | Customer segmentation |
overdue_flag | overdue_days > 0 | Loan health indicator |
total_exposure | bal_amt + outstanding_amt | Credit risk view |
total_rwa | SUM(rwa_amt) per customer | Basel III capital requirement |
estimated_car_pct | bal_amt / total_rwa × 100 | Capital adequacy ratio |
car_status | ≥8% ADEQUATE, ≥4.5% MINIMUM, else BREACH | RBI compliance flag |
RDFLib Ontology (banking.ttl)
Class Hierarchy
schema:FinancialProduct
└── bank:BankAccount
├── bank:SavingsAccount # rdfs:comment "SAV accounts, GDPR-annotated"
├── bank:CurrentAccount
└── bank:LoanAccount
└── bank:NPALoan # subclass: overdue_days > 90
GDPR Annotations
bank:cust_id gdpr:personalData "true"^^xsd:boolean
bank:cust_id gdpr:lawfulBasis "contract"
bank:bal_amt gdpr:sensitivity "financial"
Layer 5 — Knowledge Graph
File: 07_knowledge_graph/load_graph.py
Node Labels & Properties
(:Customer {cust_id, name, city, ontologyClass})
(:SavingsAccount {acct_no, bal_amt, balance_category, branch_code, ontologyClass})
(:CurrentAccount {acct_no, bal_amt, branch_code, ontologyClass})
(:LoanAccount {loan_id, outstanding_amt, overdue_days, loan_type, rwa_amt, ontologyClass})
(:Branch {branch_code, city})
Relationships
(Customer)-[:OWNS]->(SavingsAccount)
(Customer)-[:OWNS]->(CurrentAccount)
(Customer)-[:OWNS]->(LoanAccount)
(SavingsAccount)-[:HELD_AT]->(Branch)
(CurrentAccount)-[:HELD_AT]->(Branch)
(LoanAccount)-[:DISBURSED_AT]->(Branch)
Indexes
CREATE INDEX ON :Customer(cust_id)
CREATE INDEX ON :SavingsAccount(acct_no)
CREATE INDEX ON :LoanAccount(loan_id)
Layer 6 — Semantic Core Aggregator
File: 11_semantic_core/build_semantic_core.py
This is the single integration point. It reads from all upstream artifacts and produces semantic_core.json — the sole input to the API and agent. Structure:
semantic_core.json schema:
{
"glossary": { termName → {definition, formula, regulatory_ref} },
"domain_products": [ {name, input_ports, output_ports, sla, lifecycle} ],
"contracts": [ {dataset, rules, freshness_sla_minutes} ],
"osi_fields": { tableName → [ {column, ai_context_hint, sensitivity} ] },
"ontology_excerpt": "...RDF Turtle snippet...",
"graph_counts": { Customer: 1000, SavingsAccount: 800, ... },
"enriched_catalog": { tableName → {owner, domain, columns: [...]} }
}
Layer 7 — Context Fiber API
File: 12_context_fiber/api.py
Endpoint Design
| Endpoint | Method | Auth | What it does |
/context/{entity_id} |
GET |
?role= |
Neo4j entity lookup → semantic model slice → masked response per RBAC role |
/query |
POST |
body.role |
NL question → GraphRAG context → Ollama Cypher generation → sanitise → Neo4j execute → answer + provenance |
/graph-rag |
GET |
?role= |
Keyword-scored TF-IDF search over semantic_core.json → top-k context chunks |
/audit-log |
GET |
role=auditor only |
Last N lines from audit_log.jsonl |
/health |
GET |
none |
Liveness check |
The /query endpoint sanitises generated Cypher before execution: it rejects any query containing CREATE, DELETE, SET, or MERGE — only read-only traversals are permitted.
Layer 8 — AI Agent & Governance
Files: 13_agent/agent.py, governance/
5. Database Schema
── PostgreSQL (core_banking) ──────────────────────────────────────────
ACCT_MASTER
acct_no VARCHAR(12) PK
cust_id VARCHAR(10) FK → (conceptual customer)
acct_type CHAR(3) CHECK IN ('SAV','CUR','LON')
bal_amt NUMERIC(15,2)
int_rate NUMERIC(5,3)
branch_code VARCHAR(6) NULLABLE ← 40 intentional NULLs
open_date DATE
status VARCHAR(10)
LOAN_HDR
loan_id VARCHAR(12) PK
cust_id VARCHAR(10)
principal_amt NUMERIC(15,2)
outstanding_amt NUMERIC(15,2)
emi_amt NUMERIC(10,2)
overdue_days INTEGER
disbursement_date DATE
loan_type VARCHAR(20) IN ('home_loan','personal_loan','auto_loan')
TXN_LOG
txn_id VARCHAR(14) PK
acct_no VARCHAR(12) FK → ACCT_MASTER
txn_type VARCHAR(10) IN ('CREDIT','DEBIT')
amount NUMERIC(12,2)
txn_date TIMESTAMP
channel VARCHAR(10) IN ('branch','mobile','atm','internet')
── MySQL (dw_snapshot) ─────────────────────────────────────────────────
ACCT_BALANCE_EOD
acct_no VARCHAR(12) PK
eod_balance NUMERIC(15,2) ← T-1 value, intentionally ≠ PostgreSQL
report_date DATE
branch_code VARCHAR(6)
INTEREST_ACCR
acct_no VARCHAR(12)
accrued_interest NUMERIC(10,4)
calc_date DATE
── SQLite (mis_reporting.db) ────────────────────────────────────────────
REPORT_SUMMARY
report_date DATE PK
total_accounts INTEGER
total_deposits NUMERIC(18,2)
portfolio_at_risk NUMERIC(5,4)
avg_loan_balance NUMERIC(12,2)
6. API Design
Request / Response shapes
POST /query
Request:
{
"question": "How many customers have NPA loans?",
"role": "risk_officer"
}
Response:
{
"question": "...",
"cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 RETURN count(c) AS npa_customers",
"answer": "42 customers have NPA loans (overdue > 90 days).",
"provenance": { "nodes_traversed": ["Customer","LoanAccount"], "filters": ["overdue_days > 90"] },
"role": "risk_officer",
"masked_columns": []
}
GET /context/{entity_id}?role=data_analyst
Response:
{
"entity_id": "ACCT0000001",
"type": "SavingsAccount",
"neo4j_data": { "acct_no": "ACCT0000001", "bal_amt": 245000.0, ... },
"semantic_slice": [ { "column": "bal_amt", "ai_context_hint": "current live balance...", "sensitivity": "CONFIDENTIAL" } ],
"masked_fields": [],
"role": "data_analyst"
}
GET /graph-rag?q=portfolio+at+risk&top_k=3
Response:
{
"query": "portfolio at risk",
"top_k": 3,
"chunks": [
{ "source": "glossary.PAR", "score": 0.91, "text": "Portfolio at Risk: proportion of loan portfolio..." },
{ "source": "osi.LOAN_HDR.overdue_days", "score": 0.74, "text": "overdue_days: AI hint — use to identify NPA loans..." },
...
]
}
7. Agent Prompt Pipeline
The agent never queries Neo4j directly. Every answer is mediated through the LLM's Cypher generation, ensuring every traversal is inspectable and logged.
User: "Which customers have both NPA loans and PLATINUM balance?"
│
▼
1. GraphRAG retrieval
→ keyword-score semantic_core.json
→ top-3 context chunks:
glossary.NPA → "overdue_days > 90"
osi.ACCT_MASTER.bal_amt → "balance_category computed by customer_360"
glossary.PLATINUM → "bal_amt >= 500000"
│
▼
2. Prompt construction
System: "You are a Cypher expert. Schema: [nodes + rels]. Context: [chunks]."
User: "Generate Cypher for: Which customers have both NPA loans and PLATINUM balance?"
│
▼
3. Ollama inference (llama3 / mistral, local)
→ raw LLM output containing Cypher block
│
▼
4. Cypher extraction + sanitisation
→ parse ```cypher ... ``` block
→ reject if contains CREATE / DELETE / SET / MERGE
→ validate MATCH + RETURN present
│
▼
5. Neo4j execution
MATCH (c:Customer)-[:OWNS]->(l:LoanAccount)
(c)-[:OWNS]->(a:SavingsAccount)
WHERE l.overdue_days > 90 AND a.balance_category = 'PLATINUM'
RETURN c.cust_id, c.name, l.overdue_days, a.bal_amt
│
▼
6. Answer + provenance
→ format result rows as human-readable answer
→ record nodes traversed: [Customer, LoanAccount, SavingsAccount]
→ log to governance/audit_log.jsonl
│
▼
User: "3 customers match: CUST_0042 (overdue 112d, balance 620,000)..."
8. Governance Model
RBAC Roles & Column Masking
| Role | Can see PII (cust_id, name) | Can see financials (bal_amt, outstanding_amt) | Audit log access | Typical user |
data_analyst | ✓ | ✓ | ✗ | Analytics team |
risk_officer | ✓ | ✓ | ✗ | Risk & credit team |
compliance_officer | ✗ (masked) | ✓ | ✗ | Compliance / Basel III |
auditor | ✓ | ✗ (masked) | ✓ (read-only) | Internal audit |
dba | ✓ | ✓ | ✓ | Database administrator |
Audit Log Record
# governance/audit_log.jsonl — one JSON object per line
{
"timestamp": "2026-05-15T17:31:04Z",
"role": "risk_officer",
"endpoint": "/query",
"question": "Which customers have NPA loans?",
"cypher": "MATCH (c:Customer)-[:OWNS]->(l:LoanAccount) WHERE l.overdue_days > 90 ...",
"result_rows": 42,
"duration_ms": 143,
"masked_cols": []
}
9. Architecture Decisions
| Decision |
Choice Made |
Why |
Trade-off |
| Graph database |
Neo4j 5 (Docker) |
Multi-hop relationship queries are natural; Cypher is readable for LLM generation |
Heavier than SQLite; requires Docker; Bolt port must be open |
| LLM runtime |
Ollama (local llama3/mistral) |
Zero cloud dependency; demo runs offline; no API key management |
Slower than cloud APIs; quality depends on local hardware (8 GB+ RAM needed) |
| Semantic aggregation |
Single semantic_core.json |
Decouples agent + API from each upstream layer; one file to cache/invalidate |
Must be regenerated after any upstream change; grows large if catalog is big |
| GraphRAG implementation |
TF-IDF keyword scoring over semantic_core.json |
No embedding model needed; fast; deterministic; works offline |
Semantic similarity blind spots; a vector store (pgvector, Chroma) would perform better |
| Data contract format |
ODCS YAML + Python SQL validator |
Human-readable; version-controllable; no external SaaS dependency |
No built-in notification or CI integration out of the box |
| Ontology format |
RDFLib (OWL-lite) → Turtle export |
Python-native; no Protégé dependency; TTL is diffable in git |
Not a full OWL reasoner; limited inferencing compared to Jena/OWL API |
| PostgreSQL port |
5433 (non-standard) |
Avoids conflict with any existing PostgreSQL on the host machine |
All scripts must explicitly set port; easy to misconfigure |
| dbt profiles location |
--profiles-dir . (local to project) |
Self-contained demo; no mutation of ~/.dbt/profiles.yml |
Must always pass --profiles-dir . flag; non-standard dbt UX |