Tabular synthesis

Benchmark-certified at 95.69 % Five engines. One sealed contract.

Five tabular engines — a flagship engine at benchmark-certified fidelity, plus a parametric engine for small data, a score-based diffusion engine for minority-class preservation, a relational engine for multi-table schemas, and a schema-only fast path — all driven from one sealed contract and all producing the same cryptographic evidence bundle. Verified offline by anyone with the open-source evidence CLI.

Benchmark-certified 95.69 %5 engines · 1 sealed contractCryptographic evidence bundleDeterministic byte-for-byte

Anchor benchmark

Benchmark-certified at 95.69 % under an independent QA harness.

Canonical shared-split public benchmark (adult-income dataset, mixed-type columns). The flagship engine at its default configuration reaches 95.69 % overall QA score on the held-out test partition under a third-party QA harness. Per-column similarity, correlation preservation, and privacy metrics are emitted into the sealed utility report artefact on every run.

Reproducible against the same public benchmark using matched train / test splits. Full per-release benchmark certificates are published in the customer dashboard; run instructions available to enterprise customers under NDA.

Download a real 9-artefact bundle
Engine
KS ↑Corr Δ ↓QA overallRuntime
Flagship engine
strongstrong95.69 %~6 min
Diffusion engine
strongstrong~11 min
Relational engine
strongstrong~8 min
Parametric engine
strongstrong< 1 s
Schema engine
n/an/a< 1 s

Source: internal benchmark harness with the same seed across engines, single-host reproducible, under an independent QA pass. The schema engine is schema-only and does not produce distributional metrics by design. Full dataset + split citation on /verify.

Five engines — one contract

The tabular engine lineup.

Each engine targets a distinct regime. The platform routes to the optimal engine based on schema structure, column types, row count, and constraint requirements — or you pick one explicitly via the SDK. Every engine exposes the same sealed contract interface and produces the same 9-artefact BLAKE3 bundle.

Flagship engine

Flagship · Benchmark-certified
Flagship

Benchmark-certified at 95.69 % under an independent QA harness — the default for production datasets.

The flagship tabular engine handles heavy-tailed columns, high-cardinality categoricals, and mixed continuous-discrete joint distributions natively without manual pre-processing. The training, sampling, and aggregate-repair stages are proprietary; every run is driven by your sealed contract and the output is signed into the evidence bundle. Default for any dataset above 100 rows.

  • Production-grade handling of heavy tails, rare categoricals and mixed-type joints
  • Aggregate-repair stage that corrects marginal drift on roll-up tables
  • Deterministic: same seed + same contract → same bytes, every run
  • Every artefact seals into the cryptographic evidence chain
  • Same contract surface as the other four engines — swap without re-wiring

Best for

Mixed-type tabular, heavy-tailed financial / industrial, high-cardinality categoricals, 100 – 10 M rows

Parametric engine

Parametric · Small-data path

Classical parametric path for sub-100-row datasets where training a flow model would overfit.

The parametric engine is the small-data path for datasets where training a flow-matching model is statistically impractical. Per-column marginals are learned independently and stitched back together under a correlation structure estimated from the source. Zero hyperparameter search; runs in under a second on any dataset under 10 K rows. Same contract interface and evidence bundle as every other engine.

  • Per-column marginal preservation with stable-tail handling
  • Pairwise correlation structure reproduced across numeric columns
  • No training step — closed-form fit completes in milliseconds
  • Deterministic: same seed + contract → byte-identical samples
  • Same sealed evidence bundle as the flagship

Best for

Under 100 rows, fully-numeric surveys, quick baselines, audit comparison against the flagship

Diffusion engine

Score-based · Minority-aware

Score-based denoising diffusion for tabular distributions where minority-class preservation matters most.

A score-based denoising diffusion model adapted for mixed-type tabular data. Continuous columns and categorical columns each travel their own learned forward process; the sampler runs deterministically under a contract seed. Particularly strong on datasets where minority-class preservation is critical — fraud, anomaly, imbalanced classification. Exposed through the same contract surface and the same evidence bundle.

  • Separate learned forward processes for continuous and categorical columns
  • Deterministic sampler under a sealed seed for byte-exact re-runs
  • Column-aware loss weighting balances mixed-type reconstruction
  • Strong behaviour on rare joint events other engines over-smooth
  • Same contract + same evidence bundle as every other engine

Best for

Minority-class-sensitive datasets, fraud / anomaly training sets, imbalanced classification

Relational engine

Relational · Multi-table

Cross-table synthesis that preserves referential integrity across parent–child schemas.

The relational engine synthesises multi-table datasets by walking the foreign-key graph in dependency order and conditioning each child table on the parent records already materialised. Referential integrity is guaranteed by construction — every child row's foreign key points to a parent row that was generated first. Cardinality distributions (one-to-many, many-to-many bridges) are reproduced from the source. Output passes the same constraint-report gate as single-table engines.

  • Foreign-key dependency order — parents first, children in reference order
  • Per-edge cardinality distribution reproduced from the source
  • Referential integrity is 100 % by construction
  • Same contract interface as the single-table engines
  • Same sealed evidence bundle on every run

Best for

Multi-table relational databases, parent-child schemas, bridge / junction tables

Schema engine

Schema-only · Smoke & demos

Schema-preserving fast path when you only need the shape — not the distribution.

The schema engine is the fast path: preserves the foreign-key graph, table cardinalities, and column dtype structure without training per-table models. Generates statistically-plausible values from column-type priors in well under a second. Default for integration testing, UI demos, and any case where referential integrity matters but distributional fidelity does not.

  • Same foreign-key dependency traversal as the relational engine
  • Per-dtype priors — no per-table training step
  • Cardinality per edge estimated from schema metadata only
  • Deterministic seed → same skeleton every time
  • Same sealed evidence bundle as every other engine

Best for

Integration tests, UI demos, schema-level API testing, data-platform smoke runs

SDK

One call. Five engines. Verifiable output.

The SDK is a thin Python client: you hand it a source (S3 path, Snowflake table, CSV upload), a sealed contract, a seed, and the engine selector. The platform compiles sealed contract, runs the engine, emits the 9-artefact bundle, and blocks your thread until either the quality gates pass or the fail-closed policy aborts.

  • Deterministic: same sealed contract + seed → byte-identical output
  • Fail-closed: DCR < 0.05 or MIA-AUC > 0.65 → job aborts before emitting
  • Offline verifiable: evidence verifier CLI re-runs the BLAKE3 chain locally
  • Engine-agnostic API: swap flagship → diffusion with one string
SDK reference
tabular_synthesize.py
from radmah_sdk import RadMah

client = RadMah(api_key=os.environ["RADMAH_API_KEY"])

# Flagship engine (default) — 95.69 % benchmark-certified fidelity
job = client.synthesize(
    source="s3://acme-prod/raw/customers_2026q1.parquet",
    engine="flagship",              # alternates available in the docs
    rows=1_000_000,
    contract={
        "pk":           ["customer_id"],
        "constraints":  ["balance >= 0", "age BETWEEN 18 AND 120"],
        "privacy":      {"membership_risk_ceiling": "strict"},
    },
    seed=42,                        # sealed contract + seed = byte-identical re-run
)

job.wait()                          # blocks until quality gates pass or fail-closed
print(job.evidence.utility_report)  # {'ks_median': 0.97, 'corr_frobenius': 0.02, ...}

# Verify the 9-artefact BLAKE3 bundle offline
client.verify(job.evidence.bundle_path)
# → BundleVerified(hsfg_seal=..., chain_ok=True, artefact_count=9)

Measured fidelity

Quality metrics — emitted, not claimed.

Every synthetic dataset ships with quantitative fidelity and privacy measurements inside the evidence bundle. No subjective claims — the bundle proves the numbers and evidence verifier confirms the chain offline.

Distributional similarity

Strong per-column match between real and synthetic marginals

Per-column distributional similarity is measured between the real and synthetic marginals on every run. The flagship engine has been independently QA-certified at 95.69 % fidelity on a reference benchmark. The full number set is emitted into the sealed utility report artefact on every run — see /verify for a downloadable bundle.

Correlation preservation

Inter-column dependency structure preserved, with a fail-closed gate

The pairwise correlation matrix is compared between the real and synthetic slices to quantify how well inter-column dependency structure is preserved. Every run either passes the correlation gate and ships — or aborts with a fail-closed quality failure. Exact numbers land in the sealed utility report.

Privacy risk metrics

Membership-inference, attribute-inference, and disclosure metrics — all emitted

Nearest-neighbour distance distribution, membership-inference resistance, attribute-inference resistance, and disclosure-risk metrics are all measured on the synthetic output against the source and written into the privacy report. Zero PII is guaranteed by construction: synthetic rows are sampled from the learned joint, never copied from source records.

Sealed evidence bundle

Multi-artefact, cryptographically chained, verifiable offline

Every run emits a contract snapshot, determinism report, constraint report, utility report, privacy report, run telemetry, engine manifest, artefact index, and a chain seal. The open-source verifier CLI replays the cryptographic chain end-to-end, offline, no network. Third parties can independently confirm every metric.

Enterprise compliance

Audit-ready from day one.

Cryptographic evidence bundles, privacy risk metrics, and deterministic reproducibility give your compliance team everything a DPIA, ISO 27001 data-sharing control, or SOC 2 audit needs.

Cryptographic evidence bundles

Every synthetic generation produces a signed multi-artefact bundle covering the sealed contract, determinism proof, constraint and utility reports, privacy metrics, run telemetry, engine manifest, artefact index, and release seal. Cryptographic hashes chain every artefact; any modification breaks the chain and the evidence verifier refuses to certify.

GDPR Article 89 alignment

Synthetic data that does not relate to an identified or identifiable natural person falls outside the material scope of GDPR Articles 5–15. Our evidence bundles document the generation process and a full set of privacy-risk metrics — disclosure risk, membership-inference resistance, and attribute-inference resistance — supporting Article 89 research-exemption claims and DPIA submissions.

Audit-ready provenance chain

Evidence bundles provide the traceability, reproducibility, and integrity verification compliance auditors require. Every generation is deterministic — same sealed contract + same seed → byte-identical output on any host, any time. The audit trail is cryptographically immutable.

Zero PII in output by construction

Synthetic records are sampled from the learned joint distribution, never copied from source rows. Privacy reports measure disclosure risk, membership-inference resistance, and attribute-inference resistance for every run. When risk exceeds the enterprise-configurable thresholds, the job fails-closed before emitting any synthetic data.

Cross-border-transfer enabler

Synthetic output enables cross-border development and analytics without transferring personal data. Teams in different jurisdictions work on statistically-faithful datasets while source data remains in its sovereign storage. Removes the need for Standard Contractual Clauses on the synthetic artefact.

Deterministic reproducibility

A sealed contract plus a seed produces byte-identical synthetic output on any host, at any time. A cryptographically-strong seed-reproducible RNG with cross-platform deterministic arithmetic guarantees consistency across clusters and cloud providers. The determinism report in the evidence bundle records the exact RNG state at every checkpoint.

Bring a CSV. Leave with an evidence bundle.

30-minute working session: you upload (or we mock) a representative dataset, we run it through the flagship and one alternate engine, and you keep the signed 9-artefact BLAKE3 bundle plus the utility report. All five engines are available on every plan — Free, Sovereign, and Enterprise — with tier-specific credit allocations.