Data Contracts¶
The Core Concept¶
Instead of writing validation logic in Python, you declare it in a YAML file following the Open Data Contract Standard (ODCS). This separates your rules from your code, making them easier to manage, version, and share.
Example hdb_resale_simple.yaml (trimmed for readability):
kind: DataContract
apiVersion: v3.1.0
name: HDB Resale Flat Prices
schema:
- name: hdb_resale_prices
properties:
# SQL check: regex-based format validation
- name: month
logicalType: string
quality:
- type: sql
name: Month
description: Based on ISO 8601, assumed to be in UTC +8 | YYYY-MM
mustBe: 0
query: |-
SELECT COUNT(*)
FROM "hdb_resale_prices"
WHERE CAST(month AS TEXT) !~ '^[0-9]{4}-(0[1-9]|1[0-2])$';
dimension: conformity
# Library metric: null-value check
- name: town
quality:
- type: library
metric: nullValues
mustBe: 0
dimension: completeness
# Library metric: valid-value list
- name: flat_type
quality:
- type: library
metric: invalidValues
mustBe: 0
dimension: conformity
arguments:
validValues:
- 1 ROOM
- 2 ROOM
- 3 ROOM
- 4 ROOM
- 5 ROOM
- EXECUTIVE
- MULTI-GENERATION
# SQL check: business rule
- name: floor_area_sqm
quality:
- name: floor_area_must_be_less_than_200
description: Validates that floor area must be less than 200
type: sql
dimension: consistency
query: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE floor_area_sqm >= 200
mustBe: 0
# SQL check: resale price cap
- name: resale_price
quality:
- name: resale_price_must_not_exceed_2m
description: Resale price must not be more than 2 million SGD
type: sql
dimension: conformity
query: >-
SELECT COUNT(*) FROM "hdb_resale_prices" WHERE resale_price > 2000000
mustBe: 0
# Table-level library metric
quality:
- type: library
metric: rowCount
mustBeBetween:
- 0
- 30000000
dimension: completeness
Automatic Check References¶
When a contract is loaded, vowl automatically builds CheckReference objects for every executable check in the contract via Contract.get_check_references_by_schema().
This includes both user-authored checks in quality blocks and synthetic checks derived from column metadata. The generated references are grouped by schema, and the auto-generated ones run before explicit quality checks.
| Reference type | Trigger in contract | JSONPath stored in the reference |
|---|---|---|
| Table check | Entry under schema-level quality |
$.schema[N].quality[M] |
| Column check | Entry under property-level quality |
$.schema[N].properties[M].quality[K] |
| Library column metric | type: library under property-level quality |
$.schema[N].properties[M].quality[K] |
| Library table metric | type: library under schema-level quality |
$.schema[N].quality[M] |
| Declared column exists check | Property has a name |
$.schema[N].properties[M] |
| Logical type check | logicalType present on a property |
$.schema[N].properties[M].logicalType |
| Logical type options check | Supported key under logicalTypeOptions |
$.schema[N].properties[M].logicalTypeOptions.<optionKey> |
| Required check | required: true |
$.schema[N].properties[M].required |
| Unique check | unique: true |
$.schema[N].properties[M].unique |
| Primary key check | primaryKey: true |
$.schema[N].properties[M].primaryKey |
Auto-Generated Checks¶
| Generated from | What vowl validates |
|---|---|
name |
Column declared in the contract exists in the source table |
logicalType |
Values can be cast to the declared SQL type for integer, number, boolean, date, timestamp, and time |
logicalTypeOptions.minLength |
String length is at least the configured minimum |
logicalTypeOptions.maxLength |
String length does not exceed the configured maximum |
logicalTypeOptions.pattern |
String values match the configured regex pattern |
logicalTypeOptions.minimum |
Value is greater than or equal to the configured minimum |
logicalTypeOptions.maximum |
Value is less than or equal to the configured maximum |
logicalTypeOptions.exclusiveMinimum |
Value is strictly greater than the configured minimum |
logicalTypeOptions.exclusiveMaximum |
Value is strictly less than the configured maximum |
logicalTypeOptions.multipleOf |
Value is a multiple of the configured number |
logicalTypeOptions.format |
Value satisfies the declared format (see Format Checks below) |
required: true |
Column contains no NULL values |
unique: true |
Non-null values are unique |
primaryKey: true |
Values are both unique and non-null |
In practice, a property like this:
produces three generated check references:
| Check path | Check type |
|---|---|
$.schema[0].properties[...] |
DeclaredColumnExistsCheckReference |
$.schema[0].properties[...].logicalTypeOptions.maxLength |
LogicalTypeOptionsCheckReference |
$.schema[0].properties[...].required |
RequiredCheckReference |
Note
Because string does not currently generate a SQL cast-based type check, the logicalType entry above contributes metadata for option checks rather than a standalone type-validation query. If you use integer, number, boolean, date, timestamp, or time, vowl also generates a logicalType SQL check automatically.
Format Checks¶
The logicalTypeOptions.format key validates that column values conform to a declared format. The check generated depends on the column's logicalType:
Integer formats¶
Validates that values fall within the range of a fixed-width integer type.
format |
Min | Max |
|---|---|---|
i8 |
-128 | 127 |
i16 |
-32,768 | 32,767 |
i32 |
-2,147,483,648 | 2,147,483,647 |
i64 |
-9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
u8 |
0 | 255 |
u16 |
0 | 65,535 |
u32 |
0 | 4,294,967,295 |
u64 |
0 | 18,446,744,073,709,551,615 |
i128 and u128 are recognised but skipped because their ranges exceed what SQL engines can represent.
String formats¶
Validates values against a built-in regex pattern.
format |
What it checks |
|---|---|
uuid |
UUID v1–v5 hex format (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) |
email |
Basic local@domain.tld structure |
ipv4 |
Dotted-decimal IPv4 address (0.0.0.0 – 255.255.255.255) |
ipv6 |
Full-form colon-separated IPv6 address |
hostname |
RFC-952 hostname with TLD |
uri |
URI with a valid scheme prefix (e.g. https:, s3:) |
password, byte, and binary are recognised but skipped because they cannot be validated against data.
Number formats¶
f32 and f64 are recognised but produce no check — they are metadata-only hints that SQL engines do not differentiate at query time.
Date, timestamp and time formats¶
For date, timestamp, and time logical types, format accepts a JDK DateTimeFormatter pattern (e.g. yyyy-MM-dd, yyyy-MM-dd HH:mm:ss). vowl converts the pattern to a regex and validates that string-cast values match.
Supported JDK tokens include yyyy, yy, MM, M, dd, d, HH, H, hh, h, mm, ss, SSS (fractional seconds), and timezone offsets (X/XX/XXX/Z). Literal characters such as -, :, T, and quoted sections ('T') are preserved. If a pattern contains tokens vowl cannot translate, the check is skipped with a warning.
- name: created_at
logicalType: timestamp
logicalTypeOptions:
format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
Library Metrics (type: library)¶
Instead of writing SQL by hand, you can declare common data quality metrics using type: library in your quality blocks. vowl auto-generates the appropriate SQL at runtime.
Column-Level Metrics¶
Under a property's quality:
metric |
What it checks | Arguments |
|---|---|---|
nullValues |
Count of NULL values in the column |
- |
missingValues |
Count of values matching a configurable missing-values list | arguments.missingValues: list of sentinel values (use null for SQL NULL) |
invalidValues |
Count of values that fail valid-value or pattern criteria | arguments.validValues: allowed values list and/or arguments.pattern: regex |
duplicateValues |
Count of duplicate non-NULL values in the column | - |
Table-Level Metrics¶
Under a schema's quality:
metric |
What it checks | Arguments |
|---|---|---|
rowCount |
Total number of rows in the table | - |
duplicateValues |
Count of duplicate rows across specified columns | arguments.properties: list of column names to check |
All library metrics support unit: "percent" to return the result as a percentage of total rows instead of an absolute count. They also accept any of the standard check operators (mustBe, mustBeGreaterThan, etc.).
Example¶
properties:
- name: town
quality:
- type: library
metric: nullValues
mustBe: 0
dimension: completeness
- name: flat_type
quality:
- type: library
metric: invalidValues
mustBe: 0
dimension: conformity
arguments:
validValues:
- 3 ROOM
- 4 ROOM
- 5 ROOM
- EXECUTIVE
quality:
- type: library
metric: rowCount
mustBeGreaterThan: 0
dimension: completeness
- type: library
metric: duplicateValues
mustBe: 0
dimension: uniqueness
arguments:
properties:
- month
- block
- street_name