Step by step guide on the creation of synthetic data using the Vault of synthetic data (SDV)

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The real data is often expensive, disorderly and limited by the rules of confidentiality. Synthetic data offer a solution – and it is already widely used:

  • LLMS trains on a text generated by AI
  • Fraud systems simulate on -board cases
  • Vision models preterite on false images

SDD (Synthetic Data Vault) is an open source Python library that generates realistic tabular data using automatic learning. He learns models from real data and creates high -quality synthetic data for secure sharing, tests and model training.

In this tutorial, we will use SDV to generate synthetic data step by step.

We will first install the SDV library:

from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '.' # If the data is in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data('data')

Then we import the necessary module and connect to our local folder containing the data set files. This reads CSV files from the specified folder and stores them as dataframes pandas. In this case, we access the main data set using data (“data”).

from sdv.metadata import Metadata
metadata = Metadata.load_from_json('metadata.json')

We now import metadata for our data set. These metadata are stored in a JSON file and explains to SDV how to interpret your data. He understands:

  • THE table name
  • THE primary key
  • THE data type of each column (for example, categorical, digital, datetime, etc.)
  • Optional column formats as the dateTime models or identification models
  • Painting relations (for multiple configurations)

Here is a metadata format sample.

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T(0-9){6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": ()
    }
  }
}
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

Alternatively, we can use the SDV library to automatically deduct metadata. However, the results may not always be precise or complete, so you may have to see it again and update it if there are differences.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

With metadata and the original data set, we can now use SDV to form a model and generate synthetic data. The model learns the structure and models in your real data set and uses this knowledge to create synthetic records.

You can control the number of lines to be generated using the NUM_ROWS argument.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

The SDV library also provides tools to assess the quality of your synthetic data by comparing it to the original data set. An excellent starting point is to generate a quality report

You can also view how synthetic data is compared to real data using the integrated SDV tracing tools. For example, importation Get_column_Plot Since SDV. Evaluation.single_table To create comparison routes for specific columns:

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name="Sales",
    metadata=metadata
)
   
fig.show()

We can observe that the distribution of the column of “sales” in the real and synthetic data is very similar. To explore more, we can use Matplotlib to create more detailed comparisons, such as visualization of average monthly sales trends on the two data sets.

import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf('Date') = pd.to_datetime(salesDf('Date'), format="%d-%m-%Y")
synthetic_data('Date') = pd.to_datetime(synthetic_data('Date'), format="%d-%m-%Y")

# Extract 'Month' as year-month string
salesDf('Month') = salesDf('Date').dt.to_period('M').astype(str)
synthetic_data('Month') = synthetic_data('Date').dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')('Sales').mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')('Sales').mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat((actual_avg_monthly, synthetic_avg_monthly), axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison('Actual Average Sales'), label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison('Synthetic Average Sales'), label="Synthetic Average Sales", marker="o")

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)  # y-axis starts at 0
plt.tight_layout()
plt.show()

This graph also shows that average monthly sales in the two data sets are very similar, with only minimum differences.

In this tutorial, we have shown how to prepare your data and metadata for the generation of synthetic data using the SDV library. By forming a model on your original data set, SDV can create high -quality synthetic data that closely reflects the real data and distributions of data. We have also explored how to assess and visualize synthetic data, confirming that key measures such as sales distributions and monthly trends are consistent. Synthetic data offer a powerful way to overcome the challenges of confidentiality and availability while allowing a robust analysis of data and automatic learning workflows.


Discover the GitHub notebook. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.


I graduated in Civil Engineering (2022) by Jamia Millia Islamia, New Delhi, and I have a great interest in data science, in particular neural networks and their application in various fields.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.