Tabular data inside enterprises is rarely as clean or structured as AI modeling techniques expect. This mismatch has become a fundamental bottleneck to broad adoption of both predictive and generative AI, hindering the development, training, and deployment of reliable models. In this article, we highlight some of the most common data complexities encountered in our enterprise implementations. We conclude with a practical questionnaire you can use to assess whether a software platform is truly prepared to ingest enterprise data and train an AI model effectively.Documentation Index
Fetch the complete documentation index at: https://guides.datacebo.com/llms.txt
Use this file to discover all available pages before exploring further.
AI pre-condition expectations
Most vendors’ products expect data to come in a certain format or with certain properties. One of our customers described this with the helpful phrase “AI pre-condition expectations”:In most AI examples we see showcases of a ‘Hello World’ example. In these showcases we use data-sources that are quite in line with AI pre-condition expectations. The most data-sources that we see in our enterprise do not meet the AI pre-condition expectations. In these scenarios data-source knowledge is key so you need commitment from the team who owns the data-source, with risk of endless discussions on how to implement an AI solution.” In most cases, the pre-condition expectation is single-table data that is either numerical or categorical and has no ID columns or other data types. In reality, no data consumed by enterprise applications actually looks like this.”
Enterprise data has a complex web of data, a sprawl of data types, business logic, hidden context
As we push into the frontier of providing “synthetic data for testing software applications,” deploying our software across many enterprises and applications, we see how complex data can be. Here are the hidden complexities we run into most frequently — luckily, we love a challenge. Multiple tables. Enterprises routinely have dozens of tables (some may have more than 50) that an application consumes. A synthetic data generator must be able to generate data for all these tables simultaneously while maintaining referential integrity and respecting cardinality and both intertable and intra-table correlations. Our HSASynthesizer can model and create synthetic data for 30+ tables on a single machine within minutes while maintaining top performance.
Complex data patterns. Enterprise data stores contain patterns, such as composite keys and slowly changing dimensions, polymorphic relationships. We now address these natively within SDV Enterprise without the need for pre- or postprocessing.
Many different data types. Often, specific data types must be synthesized, such as phone numbers, addresses, and URLs. Faker-based synthetic data techniques create random data that loses important context — for example, if you’re working with a real phone number database where 30% of the numbers are from New York, that proportion should be emulated within the synthetic data). Unlike fakers, we use contextual anonymization to support anonymized, context-preserving synthetic data generation.
Embedded business logic. Particular businesses have their own rules that synthetic data must follow. For example, a bank may require any customer with a credit limit higher than $50,000 to also have a savings account with the bank. Or for a hotel, a checkout date within a transactions dataset should always occur after the associated check-in date. Or for an online retail dataset, a checkout transaction can only happen after something has been added to cart. UI, systems, and business processes rely on these rules being followed, and so synthetic data must hew to this logic.
Hidden context. Over the past few years, we’ve encountered many enterprise datasets that contain relationships beyond simple correlations. Because these are not just PK-FK type relationships, they are also not captured in data schemas. These relationships are often well-understood by application developers, who use their contextual knowledge to process the data effectively — yet this context is rarely stored, or explicitly documented.
For example, in some multi-table datasets, we see patterns where one or more columns from a parent table are mirrored in a child table. This is often done to avoid the need for expensive or time-consuming JOIN operations, especially when working with very large datasets or column-oriented (OLAP) databases.
Consider an e-commerce dashboard that needs to frequently analyze order volume and revenue based on a user’s country of origin. In a fully normalized schema, the application would need to repeatedly join the orders and users tables to obtain this information. If such queries are slow or resource-intensive, one common optimization is to carry over the country column from the users table into the orders table, eliminating the need for a repeated JOIN and speeding up access for reporting or analysis.We refer to this as the CarryOverColumns pattern, in which one or more columns are intentionally duplicated across tables to support performance or usability goals.
Consider an insurance application that provides insurance only to people over 18 — their real data only contains information about people 18 years and older. However, the same company has a lot of software logic that creates responses for people younger than 18. To test this logic, the company needs synthetic data that goes beyond the real data. They may also want to switch this capability on and off for privacy reasons. We address all of these needs through specialized transformers.