Skip to main content

Documentation Index

Fetch the complete documentation index at: https://guides.datacebo.com/llms.txt

Use this file to discover all available pages before exploring further.

A hosted SaaS service is a non starter 

By design, this product is intended to run on-premises, including in air-gapped environments. A SaaS-hosted platform, where enterprises upload their data to train models on our servers, defeats the core purpose of the product. Its very reason for existence is to ensure sensitive enterprise data never leaves organizational boundaries—not when working with third parties, and not even for internal use by developers. As such, any suggestion to upload data externally is a non-starter. 

Enterprise data requirements

To succeed with wider adoption of synthetic data within an entperise these are the requirements that can act as a guide when evaluating platforms.
Data complexityRequirementFall back commonly used
Address the sprawl of data typesBe able to model and synthesize data across a wide range of types—not only traditional statistical variables like numerical and categorical fields, but also complex real-world data such as phone numbers, addresses, vehicle identification numbers (VINs), GPS coordinates, and many others.Using faker to generate data for certain data types.
Be able to scale to many tablesA synthetic data platform must be able to model 100s of interconnected tables in order to be useful for enterprise environments.Model table by table and connect them to maintain referential integirty.
Address hidden contextMuch of the complexity comes from hidden context not captured in schemas or metadata. These represent hard constraints that the platform must recognize and enforce to generate valid data.Pre and post processing scripts
Generate flexible formatsBe able to model and generate data with flexible formats. For example, if the address is split across multiple columns or if a datetime format is very unique.Pre and post processing scripts
Most generative modeling systems address these requirements using the fallback options listed in the third column. This leads to inability to scale synthetic data initiatives at an enterprise level.

Model flexibility

A platform must be able to model data across multiple modalities—multi-table, sequential, and single-table—but it must also offer a range of models that excel along different performance axes. Use cases vary widely in their requirements for speed, quality, privacy, and transparency, and no single model can optimize for all of these simultaneously. For each data modality, a platform therefore needs a plurality of modeling approaches. The following examples illustrate why this diversity is essential. Creating synthetic data for clinical trials.
Imagine you have data from only a small number of clinical trials and need to augment it with additional synthetic samples. In this context, transparency is critical. Models built using classical statistical techniques often provide greater explainability and traceability than complex generative methods, making them better suited for regulated environments where auditability is essential.
Creating synthetic data for consumer surveys.
Consumer survey datasets are often extremely limited in size. With such scarcity, GAN-based approaches are prone to mode collapse and may fail to generate meaningful variation. Because collecting more survey data defeats the very purpose—deriving insights with fewer samples and supplementing them synthetically—you need an approach designed to operate reliably under severe data constraints.
Creating synthetic data when real data consists of only hundreds of samples.
In some scenarios, each data point is expensive to obtain—for example, when every observation requires an in-person session in a controlled laboratory environment. You may end up with only a few hundred records across many dimensions. This high-dimensional, low-sample-size problem requires a specialized synthesizer that can model complex relationships without overfitting or collapsing.
Different use cases demand varying levels of quality, transparency, speed, and privacy, and many involve extreme data constraints. No single model can excel across all these dimensions. A robust platform must therefore address them deliberately—offering specialized modeling techniques tailored to the specific requirements of each use case.
Model pluralityA platform must provide multiple modeling choices for each modality of data. Theses models must allow trade-offs between quality, speed, privacy and transparency.
Specialized modelsA platform must provide ability to train models that can train from as little data possible and also enable training models for highly segmented data.

Computational requirements

Non GPU based algorithms. The platform must enable algorithms that run on CPUs. GPU dependency increases cost and complexity, limiting the number of viable use cases.

Evaluation

  • ROI vs. Quality: Evaluation should focus on ROI metrics rather than only on synthetic data quality metrics. Synthetic data quality does not always correlate linearly with ROI. In some cases, higher “quality” does not translate into greater business value for a given use case.
  • Flexibility for Use Cases: A powerful synthetic data platform must integrate multiple layers to support diverse use cases, each with distinct quality requirements.
  • Performance Metrics: When benchmarking, both the model training time and the time required to generate synthetic data should be considered critical evaluation metrics.
  • Robust Evaluation Metrics: Evaluation should rely only on openly validated, peer-reviewed metrics from academia and industry. We created SDMetrics for this purpose, and it has become widely adopted across both domains.

Benchmarking

  • Benchmarking Tools: To enable consistent assessment, we created the SDGym library, which provides proper abstractions to evaluate synthetic data along two axes, data quality and performance time.
  • _In this graphic, we are showing how different synthetic data generators perform along the time-quality axes. The line represents the pareto front and the ones on the line are good models that have good tradeoffs. Such evaluations are important as users may choose synthesizers based on their need. Some use cases require rapid iteration. 

Privacy

A synthetic data platform should support the sharing and portability of the synthesizer (model) itself, not just the generated data. As a result, privacy-preservation must be evaluated not only on the output data but also on the synthesizer. Without the ability to securely share synthesizers, several important use cases are limited. Our launch of differentially private synthesizers demonstrates this capability:

Use case support