erp-synth — technical deep dive

Quick start — 30 seconds

Three commands. No accounts, no API keys, no cloud setup.

Step 1 · Install

Editable, with charts

git clone https://github.com/scripts-and-tables/\
erp-synthetic-data-generator
cd erp-synthetic-data-generator
pip install -e ".[charts]"

Step 2 · Generate

One reproducible command

erp-synth \
  --seed 42 \
  --market us \
  --n-customers 1000 \
  --date-from 2015-01-01 \
  --date-till 2025-12-31

Step 3 · Analyze

Plug into pandas

import pandas as pd

lines   = pd.read_csv(
  "output_csv/sales_lines.csv")
headers = pd.read_csv(
  "output_csv/invoice_headers.csv")

The full CLI surface is documented in section 6. To verify a freshly generated dataset: python scripts/verify.py output_csv.

Schema — 5 dimensions + 4 facts, fully relational

Header / line invoice split mirrors AdventureWorks' SalesOrderHeader / SalesOrderDetail, with three dates per invoice (order, ship, due), full money decomposition (subtotal → discount → tax → freight → grand_total), and explicit gross-margin tracking on every line. Plus voice-of-customer extensions: marketing spend, support tickets, NPS surveys.

Schema diagram with items, promotions, customers, stores, invoice_headers, sales_lines and FK arrows

File	Type	Sample rows	Key fields
`items.csv`	dim	98	`product_id`, `list_price`, `standard_cost`, `subcategory`
`customers.csv`	dim	1,000	`customer_id`, demographics, geography, `cohort`, `price_sensitivity`, `acquisition_channel`
`stores.csv`	dim	8	`store_id`, `latitude`, `longitude`, `store_type`
`promotions.csv`	dim	66	`promotion_id`, `discount_pct`, `category_scope`
`invoice_headers.csv`	fact	54,100	3 dates, full money decomp, `payment_method`, `is_return`
`sales_lines.csv`	fact	100,627	`quantity`, `unit_price`, `discount_pct`, `line_total`, `gross_margin`
`marketing_spend.csv`	fact	792	per-month-per-channel spend with holiday boost
`support_tickets.csv`	fact	284	`category`, `priority`, resolution time, `csat_score`
`nps_surveys.csv`	fact	2,765	quarterly survey, ~55% response, score 0–10

The customer behavior model

Each customer is permanently assigned to one of six cohorts via random.Random(seed ^ customer_id) — meaning the same customer always gets the same cohort across runs. Daily buy probability is then modulated by seasonality, day-of-week, holiday bumps, and the customer's price sensitivity.

Cohort	Buy prob (Yr 1→4)	Lost / day	Refill basket	Price sens.
LOYAL_HEAVY (10%)	6% → 10%	0.0001	up to 5 refills	0.20
LOYAL_LIGHT (20%)	3% → 5%	0.0001	1–4 refills	0.40
GROWING (20%)	2% → 8%	0.0002	1–4 refills	0.55
DECLINING (20%)	6% → 1%	0.0003	smaller	0.70
CHURN_RISK (10%)	4% → 0.5%	0.0006	small	0.90
ONE_SHOT (20%)	spike → 0%	0.0008	small	0.85

p_buy_day = clip(
    cohort.p_buy_by_year[year_idx]            # cohort schedule
    × month_factor[month]                     # Nov 1.6, Dec 1.8, Jan 0.7, ...
    × dow_factor[dow]                         # weekend 1.15
    × holiday_bump(date)                      # Black Friday 2.5, Christmas 1.8, Eid 2.0
    × (1 - 0.20 × price_sensitivity),
    0, 1
)

So a LOYAL_HEAVY customer on Black Friday in their fourth year buys with p ≈ 31%, while the same customer on a Tuesday in February buys with p ≈ 8%. The customer also has a sticky brand_affinity that filters their product pool toward one device family.

Theoretical foundations

The cohort + lost-decision-date structure is a discrete-time analogue of the buy-till-you-die customer-base models of Fader, Hardie & Lee (BG/NBD) [4] and their generalizations to non-contractual settings [5, 10] — every customer has a constant per-period probability of "dying" (becoming permanently lost) and an independent per-period probability of buying while alive. We replace their continuous-time gamma/beta mixtures with a discrete library of six cohort presets so the dataset is parameterizable by hand and inspectable line-by-line.

The per-customer × per-product price sensitivity and brand affinity modulation comes from the multi-stage discrete-choice formulation in RetailSynth [2], itself rooted in McFadden's conditional logit / mixed logit framework [6, 7] — the framework that earned the 2000 Nobel Prize in Economics for "the development of theory and methods of analyzing discrete choice" [8].

The fidelity / utility / cohort-stickiness checks in scripts/verify.py follow the comprehensive evaluation framework for synthetic retail data proposed by Xia et al. [3].

Design choice → literature

Each architectural decision in erp-synth traces to a specific paper or canonical reference. The mapping is summarized below; full bibliography in section 7.

Design choice	Anchored in
Star-schema split into `invoice_headers` + `sales_lines`; three dates per invoice; rich `DimCustomer` demographics	AdventureWorks [1]
Customer-level latent variables (`price_sensitivity`, `brand_affinity`); discount-aware multi-stage purchase model	RetailSynth (Xia et al. 2023) [2]
Choice of fidelity / utility / cohort-stickiness checks in `verify.py`	Comprehensive Evaluation of Synthetic Retail Data (Xia et al. 2024) [3]
Six "buy-till-you-die" cohorts with sticky lost-decision date and decaying yearly buy probability	BG/NBD and Pareto/NBD models (Fader et al. 2005, 2009) [4, 5]
Per-customer × per-product price sensitivity → product choice; market-aware payment-method weighting	McFadden conditional / mixed logit (1974, 2000), Nobel lecture (2001) [6, 7, 8]
Reproducibility-first generator — single `--seed` propagates to every RNG	Faker [9] + numpy + Python `random`

Showcase charts

All charts produced by scripts/generate_charts.py from the shipped 1,000-customer × 11-year sample (seed 42). Run with any seed to regenerate.

Monthly revenue (11 years)

Recurring November / December spikes; customer-base ramp visible end-to-end.

Daily holiday spikes (2023)

Black Friday, Cyber Monday, Christmas Eve, July 4 — annotated.

Cohort retention

Six behavioral cohorts diverge cleanly on retention curves.

Gross margin distribution

Refills highest-margin (consumables), devices lowest. Mean drifts up over time as inflation outpaces frozen standard_cost.

Promotion activity

Discount-as-% of revenue spikes 15–20% during Black Friday weeks vs. ~0% baseline.

Returns share

Holds steady around the configured 3% rate. Earlier quarters noisier due to small N.

Logistic-growth signups

Slow start → ramp → plateau, replacing the uniform-random distribution most generators ship.

Cohort distribution

Six cohorts at the configured weights. Sticky per (seed, customer_id).

Top SKUs

Premium devices and bulk industrial refills dominate revenue.

Architecture

The package is organized around small, single-responsibility modules under erp_synth/. All RNG state flows through a single seed.

erp_synth/
├── rng_utils.py    seeded RNGs (numpy + python random + Faker), threaded everywhere
├── markets.py      US / GCC / EU presets: locale, currency, VAT, holidays, dow factors
├── seasonality.py  combined_multiplier(date, market) → float
├── cohorts.py      6 sticky cohorts, deterministic per (seed, customer_id)
├── pricing.py      inflation-adjusted unit price + line-level promo discount
├── stores.py       store master generator (with lat/lon)
├── promotions.py   promotion master + precomputed (date, category) lookup
├── returns.py      negated-quantity return invoices linked via reference_invoice_id
├── marketing.py    per-month-per-channel spend with holiday boost
├── support.py      support tickets + NPS surveys gated by cohort propensity
├── items.py        product universe + sampler
├── customers.py    demographics, geography, cohort assignment, signup growth curve
└── sales.py        day-by-day per-customer (headers, lines) generation

scripts/
├── verify.py               schema + invariants + FK + reconciliation checks
├── generate_charts.py      reproducible chart pipeline (matplotlib)
├── generate_branding.py    hero, stats, schema, features, phone dashboards
└── generate_animations.py  the data_unfolds.gif animation

run.py                       CLI orchestrator → erp-synth console script
tests/                       111 tests across 12 files
.github/workflows/ci.yml     lint + tests + smoke + reproducibility check

Performance notes: Promotions are precomputed into a dict[(date_iso, category)] → (promo_id, pct) at startup so the per-line lookup inside the hot loop is O(1) instead of an O(N) pandas filter. Sales output is streamed (append per customer), keeping memory flat regardless of --n-customers. Items' listed_from_date is parsed once and cached. Holiday tables are lazy-cached per year per market.

CLI reference

After pip install -e ., the erp-synth console script is on PATH. It accepts the following flags:

Scale & dates
  --n-customers INT          number of customers (default 1000)
  --date-from YYYY-MM-DD     start of customer creation timeline
  --date-till YYYY-MM-DD     end of generation timeline

Market & realism
  --seed INT                 master RNG seed (default 42)
  --market {us,gcc,eu}       locale, currency, VAT, holidays (default us)
  --vat-rate FLOAT           override market VAT
  --currency STR             override market currency
  --annual-inflation FLOAT   override market inflation rate

Items
  --n-devices INT            (default 5)
  --n-accessories INT        (default 10)
  --n-spare-parts INT        (default 8)
  --n-refills INT            (default 74)
  --n-bulk-refills INT       (default 1)

Customer fields
  --p-first-name FLOAT       (default 0.95)
  --p-last-name FLOAT        (default 0.85)
  --p-email FLOAT            (default 0.70)
  --p-phone FLOAT            (default 0.80)
  --p-email-opt-in FLOAT     (default 0.60, only if email present)
  --p-sms-opt-in FLOAT       (default 0.90, only if phone present)
  --p-call-opt-in FLOAT      (default 0.75, only if phone present)

Stores & promotions
  --n-stores INT             (default 8)
  --n-promotions-per-year INT (default 6)

Returns
  --p-return FLOAT           (default 0.03)
  --enable-returns / --disable-returns

Output
  --out-dir PATH             (default output_csv)

Bibliography

BibTeX for every entry: docs/REFERENCES.bib.

Microsoft. AdventureWorks Sample Databases. SQL Server documentation. link
Xia, Y., Arian, A., Narayanamoorthy, S., & Mabry, J. (2023). RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation. arXiv:2312.14095. link
Xia, Y., Wang, C.-H., Mabry, J., & Cheng, G. (2024). Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data. arXiv:2406.13130. link
Fader, P. S., Hardie, B. G. S., & Lee, K. L. (2005). "Counting Your Customers" the Easy Way: An Alternative to the Pareto/NBD Model. Marketing Science, 24(2), 275–284. PDF
Fader, P. S., & Hardie, B. G. S. (2009). Probability Models for Customer-Base Analysis. Journal of Interactive Marketing, 23(1), 61–69. PDF
McFadden, D. (1974). Conditional Logit Analysis of Qualitative Choice Behavior. Frontiers in Econometrics. PDF
McFadden, D., & Train, K. (2000). Mixed MNL Models for Discrete Response. Journal of Applied Econometrics, 15(5), 447–470. PDF
McFadden, D. (2001). Economic Choices. Nobel Lecture. PDF
Faraglia, D. and others. Faker: Python package for fake data generation. link
Fader, P. S., Hardie, B. G. S., & Shang, J. (2010). Customer-Base Analysis in a Discrete-Time Noncontractual Setting. Marketing Science, 29(6), 1086–1108. PDF