Synthetic Data Generation Techniques for Machine Learning

Published on Tue Nov 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time)

Machine Learning
Synthetic Data
Data Augmentation

Unlock the potential of synthetic data for machine learning with our comprehensive guide. Dive into the most effective techniques to generate data when real-world datasets are scarce or privacy-constrained.

A representation of synthetic data points plotted in a virtual space

Generating Synthetic Data for Machine Learning Models

Machine learning thrives on data—the more, the better. Yet, securing a vast trove of real-world data is often hindered by privacy worries, steep costs, or sheer unavailability. Enter synthetic data generation: a powerful workaround that mimics authentic data, minus the privacy and access hassles. This guide explores the diverse techniques for crafting synthetic data, offering a lifeline for projects starved of robust datasets.

Method 1: Statistical Sampling

Statistical sampling is a method used to generate synthetic data by applying known or assumed statistical distributions, like normal or binomial. This approach offers an educated guess informed by existing statistical data.

import numpy as np
import pandas as pd

# Assume we're working on a project related to energy consumption
size = 1000  # Size of data
temperature = np.random.normal(68, 2, size)  # normal distribution
energy_consumption = temperature * 2.5 + np.random.normal(0, 0.5, size)  # linear relation with some noise

# Creating DataFrame
data = pd.DataFrame({'Temperature(°F)': temperature, 'EnergyConsumption(KWh)': energy_consumption})
data.to_csv('synthetic_data_statistical.csv', index=False)

Statistical sampling is best suited for scenarios with well-characterized distributions, such as simulating test scores or modeling customer purchase behaviors from historical data. Here are the pros and cons of this approach:

Simple to implementAssumes known distributions which might not always represent real-world data accurately
Efficient in terms of computationLimited complexity in generated data

Method 2: Data Augmentation

Data augmentation involves making subtle modifications to existing data to generate new data points, a technique particularly useful when dealing with limited datasets.

from sklearn.utils import shuffle

# Assume data is a DataFrame containing real-world data
def augment_data(data):
    augmented_data = data.copy()
    augmented_data['Temperature(°F)'] = data['Temperature(°F)'] + np.random.normal(0, 0.5, len(data))
    augmented_data = shuffle(augmented_data)
    return augmented_data

augmented_data = augment_data(data)
augmented_data.to_csv('augmented_data.csv', index=False)

Data augmentation excels in enhancing image or text datasets where minor tweaks can yield significant new variations, crucial for training deep learning models in image or text classification tasks. Below are the advantages and disadvantages of this method:

Effective in expanding a small datasetMight not introduce enough variability if the original dataset is not diverse
Preserves the original data distributionCould reinforce existing biases in the data

Method 3: Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are a type of autoencoder designed to learn a probabilistic mapping between data space and a latent space. This allows them to generate new data points that have a high probability of fitting within the learned data distribution.

import tensorflow as tf
from tensorflow.keras import layers

class VAE(tf.keras.Model):
    def __init__(self, latent_dim):
        super(VAE, self).__init__()
        self.latent_dim = latent_dim
        self.encoder = tf.keras.Sequential([
            layers.Dense(10, activation='relu'),
            layers.Dense(latent_dim + latent_dim),
        self.decoder = tf.keras.Sequential([
            layers.Dense(10, activation='relu'),

    def call(self, x):
        mean, logvar = tf.split(self.encoder(x), num_or_size_splits=2, axis=1)
        z = self.reparameterize(mean, logvar)
        return self.decoder(z)

    def reparameterize(self, mean, logvar):
        eps = tf.random.normal(shape=mean.shape)
        return eps * tf.exp(logvar * .5) + mean

# Assuming data is normalized
latent_dim = 2
vae = VAE(latent_dim)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
vae.compile(optimizer, loss=tf.keras.losses.MeanSquaredError()), epochs=50, batch_size=32)

# Generating synthetic data
z = np.random.normal(size=(1000, latent_dim))
synthetic_data_vae = vae.decoder.predict(z)

VAEs (Variational Autoencoders) are optimal for scenarios dealing with complex, high-dimensional data distributions, such as generating synthetic images or sounds. Here’s a rundown of the pros and cons of this method:

Can generate complex, high-dimensional dataRequires a good amount of real data for training
Learns the underlying data distributionMight struggle if the data distribution is too complex

Method 4: Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) set the stage for a duel between two neural networks: a generator and a discriminator. The generator crafts synthetic data, and the discriminator judges its authenticity against real data. Through repeated iterations, the generator hones its ability to create data that closely mirrors the actual data distribution.

class GAN(tf.keras.Model):
    def __init__(self, latent_dim):
        super(GAN, self).__init__()
        self.latent_dim = latent_dim
        self.generator = tf.keras.Sequential([
            layers.Dense(10, activation='relu'),
        self.discriminator = tf.keras.Sequential([
            layers.Dense(10, activation='relu'),
            layers.Dense(1, activation='sigmoid'),

    def compile(self, gen_optimizer, disc_optimizer, loss_fn):
        super(GAN, self).compile()
        self.gen_optimizer = gen_optimizer
        self.disc_optimizer = disc_optimizer
        self.loss_fn = loss_fn

    def train_step(self, real_data):
        batch_size = tf.shape(real_data)[0]
        random_latent_vectors = tf.random.normal(shape=(batch_size, self.latent_dim))
        generated_data = self.generator(random_latent_vectors)
        combined_data = tf.concat([generated_data, real_data], axis=0)
        labels = tf.concat(
            [tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0
        labels += 0.05 * tf.random.uniform(tf.shape(labels))

        with tf.GradientTape() as tape:
            predictions = self.discriminator(combined_data)
            disc_loss = self.loss_fn(labels, predictions)
        grads = tape.gradient(disc_loss, self.discriminator.trainable_weights)
            zip(grads, self.discriminator.trainable_weights)

        random_latent_vectors = tf.random.normal(shape=(batch_size, self.latent_dim))
        misleading_labels = tf.zeros((batch_size, 1))

        with tf.GradientTape() as tape:
            predictions = self.discriminator(self.generator(random_latent_vectors))
            gen_loss = self.loss_fn(misleading_labels, predictions)
        grads = tape.gradient(gen_loss, self.generator.trainable_weights)
        self.gen_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))

        return {"disc_loss": disc_loss, "gen_loss": gen_loss}

# Instantiate and compile the GAN
latent_dim = 2
gan = GAN(latent_dim)
gen_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
disc_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = tf.keras.losses.BinaryCrossentropy()
gan.compile(gen_optimizer, disc_optimizer, loss_fn)

# Train the GAN, epochs=50, batch_size=32)

# Generate synthetic data
random_latent_vectors = np.random.normal(size=(1000, latent_dim))
synthetic_data_gan = gan.generator.predict(random_latent_vectors)

GANs (Generative Adversarial Networks) shine in projects needing the generation of lifelike images, pivotal in artistic creation or the development of authentic game environments. Here are the pros and cons of employing this method:

Capable of generating complex, realistic dataTraining can be difficult and unstable
Does not require an explicit model of the data distributionRequires a good amount of real data for training

Method 5: Language Model Learning

Large language models (LLMs) like GPT-3 can be fine-tuned to generate tabular data based on existing patterns in the dataset. They grasp the underlying patterns and relationships in the data, allowing them to generate new, synthetic data points.

import openai
import pandas as pd

# Assume OpenAI API is set up and GPT-3 is fine-tuned on the dataset
def generate_synthetic_data(prompt, num_entries):
    response = openai.Completion.create(
        max_tokens=num_entries * 20,  # assuming an average of 20 tokens per entry
    generated_text = response['choices'][0]['text'].strip()
    # Convert the generated text back into tabular data
    generated_lines = generated_text.split('\n')
    headers = generated_lines[0].split(',')
    synthetic_data = [line.split(',') for line in generated_lines[1:]]
    return pd.DataFrame(synthetic_data, columns=headers)

# Sample usage:
prompt = "Generate additional data entries for the energy consumption project based on the following pattern: ..."
synthetic_data_text = generate_synthetic_data(prompt, 1000)

Language model learning is well-suited for projects that require data to emulate intricate and nuanced patterns, such as crafting text for chatbots or scripting realistic dialogue for virtual assistants. Consider the following pros and cons of this method:

Can generate realistic, complex dataRequires a substantial amount of training data
High flexibility in data generationResource-intensive

Synthetic data generation opens up a wealth of options to produce data that’s a near-perfect reflection of the real thing. Each technique comes with its own strengths and weaknesses. Your selection should align with your project’s needs, the real data at hand, and the complexity you’re prepared to tackle. Generating synthetic data lays the groundwork for machine learning models to train on a sizable dataset—crucial for crafting strong and dependable models.