Unlocking New Possibilities: Hands-on Data Generation for Every Project

With Practical Examples of Data Generation in Action.

Jun 28, 2025

There are times when Data Projects or Data Science teams hit a roadblock.

No Real Data!

girl in white tank top using black tablet computer — Photo by Kelly Sikkema on Unsplash

The experience of not having Real Data can be frustrating. Let us find a possible solution to this problem.

Suppose fast-moving enterprises are struggling with problems like inadequate data to fine-tune and test AI models. In that case, it is impeding their AI implementations and overall progress in driving success and outcomes with AI Investments.

Most recently, I have delved into the topic of “Synthetic Data” and how it propels the way forward for AI Advancement.

Here is a link to know more about Synthetic Data.

Stratagem-360.ai

Beyond the Real: The Rise of Synthetic Data in AI

In the fast-growing world of AI, we come across various challenges in the organizations, like…

4 months ago · 7 likes · 1 comment · Suhas Dhekane

With an understanding of Synthetic Data, let us now focus on building a fundamental framework to generate data for testing features, applications, software, or even fine-tuning AI Models.

To achieve this, here are two approaches.

Randomized Data Generation Techniques-

For this example, we are using Python Faker Library.

Faker is a Python package that generates fake data, including names, addresses, emails, phone numbers, dates, and more.

It’s widely used for testing, populating databases, creating sample datasets, and anonymizing data.

Faker supports many locales (languages/ countries) and can generate data in a variety of formats.

To install Faker, just run-

pip install faker

Generating Randomized Data

For this, I am creating some generic database tables, such as PERSON, SUBSCRIPTIONS, and ADDRESS.

Here are the steps -

Step 1: Database table for storing data

Here is a script to create these three tables that I will be using in this example -

Table 1: PERSON

This table has columns like PERSON_ID, First Name, Last Name, Eye Color, Birth Date with PERSON_ID as PK.

create table PERSON(PERSON_ID int, F_NAME varchar(20), L_NAME varchar(20), EYE_COLOR enum('BLACK', 'BLUE', 'GREEN', 'BROWN'), BIRTH_DATE date, CONSTRAINT PK_PERSON PRIMARY KEY (PERSON_ID));

Table 2: SUBSCRIPTIONS

This table is a child table of the above PERSON table, with columns such as PERSON_ID (FK), SUBSCRIPTION_NAME, TYPE, PAYMENT_DATE, and PAYMENT_AMOUNT. The PERSON_ID, which is PK from PERSON TABLE, is an FK in this table. The combination of PERSON_ID & ADDRESS_TYPE is unique.

create table SUBSCRIPTIONS (PERSON_ID int, SUBSCRIPTION_NAME enum('Netflix', 'Amazon Prime', 'Spotify', 'Hulu', 'Disney+', 'Apple Music', 'YouTube Premium', 'HBO Max', 'New York Times', 'Audible'), TYPE enum('MONTHLY', 'YEARLY', 'WEEKLY', 'TRIAL'), PAYMENT_DATE DATE, PAYMENT_AMOUNT int, CONSTRAINT PK_PERSON_SUBS PRIMARY KEY (PERSON_ID, SUBSCRIPTION_NAME), CONSTRAINT FK_PERSON_SUBS FOREIGN KEY (PERSON_ID) REFERENCES PERSON (PERSON_ID));

Table 3: ADDRESS

This table is also a child table of the above PERSON table, with a few typical address columns, including address lines 1, 2, 3, City, State, Country, and ZIP. The PERSON_ID, which is PK from PERSON TABLE, is an FK in this table. The combination of PERSON_ID & ADDRESS_TYPE is unique.

create table ADDRESS (PERSON_ID int, ADDRESS_TYPE enum('PRIMARY', 'BILLING', 'COMMUNICATION', 'SHIPPING', 'TEMPORARY'), LINE_1 varchar(200), LINE_2 varchar(200), LINE_3 varchar(200), CITY varchar(20), STATE varchar(20), COUNTRY varchar(200), ZIP varchar(20), CONSTRAINT PK_PERSON_ADDRESS PRIMARY KEY (PERSON_ID, ADDRESS_TYPE), CONSTRAINT FK_PERSON FOREIGN KEY (PERSON_ID) REFERENCES PERSON (PERSON_ID));

Step 2: Generate Synthetic Data using Python and FAKER Lib

Python Script to generate PERSON data

Here you go

import csv

from faker import Faker

import random

fake = Faker()

EYE_COLORS = [

'Blue', 'Brown', 'Green', 'Black'

]

with open('people1.csv', 'w', newline='') as csvfile:

 writer = csv.writer(csvfile)

 writer.writerow(['PERSON_ID', 'First Name', 'Last Name', 'Eye Color', 'Birthdate'])

for i in range(1, 1001):

 first_name = fake.first_name()

 last_name = fake.last_name()

 eye_color = random.choice(EYE_COLORS)

 birthdate = fake.date_of_birth(minimum_age=18, maximum_age=90).strftime('%Y-%m-%d')

 writer.writerow([i, first_name, last_name, eye_color, birthdate])

print('people.csv generated with 1000 fake people!')

Command to run the Python Script to generate people1.csv

python generate_people_csv.py

Command to load the csv into the PERSON TABLE

Here is a SQL command

LOAD DATA INFILE '/path/to/the/people1.csv'

INTO TABLE PERSON

FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 LINES;

Python Script to generate SUBSCRIPTION data

Here you go. This script will ensure that there are unique combinations of PERSON_ID and SUBSCRIPTION_NAME to keep the table integrity

import csv

from faker import Faker

import random

fake = Faker('en_US')

SUBSCRIPTION_NAMES = [

 'Netflix', 'Amazon Prime', 'Spotify', 'Hulu', 'Disney+', 'Apple Music', 'YouTube Premium', 'HBO Max', 'New York Times', 'Audible'

]

SUBSCRIPTION_TYPES = ['MONTHLY', 'YEARLY', 'WEEKLY', 'TRIAL']

# Read person IDs from people1.csv

person_ids = []

with open('people1.csv', newline='') as peoplefile:

 reader = csv.DictReader(peoplefile)

 for row in reader:

 person_ids.append(int(row['PERSON_ID']))

unique_combinations = set()

rows = []

while len(rows) < 1000:

 person_id = random.choice(person_ids)

 sub_name = random.choice(SUBSCRIPTION_NAMES)

 key = (person_id, sub_name)

 if key in unique_combinations:

 continue

 unique_combinations.add(key)

 sub_type = random.choice(SUBSCRIPTION_TYPES)

 payment_date = fake.date_between(start_date='-2y', end_date='today').strftime('%Y-%m-%d')

 payment_amount = round(random.uniform(5, 100), 2)

 rows.append([person_id, sub_name, sub_type, payment_date, payment_amount])

with open('subscriptions.csv', 'w', newline='') as csvfile:

 writer = csv.writer(csvfile)

 writer.writerow(['PERSON_ID', 'SUBSCRIPTION_NAME', 'TYPE', 'PAYMENT_DATE', 'PAYMENT_AMOUNT'])

 writer.writerows(rows)

print('subscriptions.csv generated with 1000 unique (PERSON_ID, SUBSCRIPTION_NAME) pairs!')

Command to run the Python Script to generate subscriptions.csv

python generate_subscriptions_csv.py

Command to load the csv into the SUBSCRIPTIONS TABLE

Here is a SQL command

LOAD DATA INFILE '/path/to/the/subscriptions.csv'

INTO TABLE SUBSCRIPTIONS

FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 LINES;

Python Script to generate ADDRESS data

Here you go

import csv

from faker import Faker

import random

fake = Faker('en_US')

ADDRESS_TYPES = ['BILLING', 'SHIPPING', 'PRIMARY', 'COMMUNICATION']

# Read person IDs from people1.csv

person_ids = []

with open('people1.csv', newline='') as peoplefile:

 reader = csv.DictReader(peoplefile)

 for row in reader:

 person_ids.append(int(row['PERSON_ID']))

unique_combinations = set()

rows = []

while len(rows) < 1000:

 person_id = random.choice(person_ids)

 address_type = random.choice(ADDRESS_TYPES)

 key = (person_id, address_type)

 if key in unique_combinations:

 continue

 unique_combinations.add(key)

 city = fake.city()

 line_1 = fake.building_number()

 line_2 = fake.street_name()

 line_3 = fake.secondary_address()

 state = fake.state_abbr()

 country = fake.country()

 zip_code = fake.zipcode()

 rows.append([person_id, address_type, line_1, line_2, line_3, city, state, country, zip_code])

with open('addresses.csv', 'w', newline='') as csvfile:

 writer = csv.writer(csvfile)

 writer.writerow(['PERSON_ID', 'ADDRESS_TYPE', 'LINE_1', 'LINE_2', 'LINE_3', 'CITY', 'STATE', 'COUNTRY', 'ZIP'])

 writer.writerows(rows)

print('addresses.csv generated with 1000 unique (PERSON_ID, ADDRESS_TYPE) pairs!')

Command to run the Python Script to generate addresses.csv

python generate_addresses_csv.py

Command to load the csv into the SUBSCRIPTIONS TABLE

Here is a SQL command

LOAD DATA INFILE '/path/to/the/addresses.csv'

INTO TABLE ADDRESS

FIELDS TERMINATED BY ','

ENCLOSED BY '"'

LINES TERMINATED BY '\n'

IGNORE 1 LINES;

These steps will create the required tables in MySQL, generate data using the Faker library of Python, and load the tables with the generated data.

Note: Use the Python MySQL library as an alternative to performing all these steps in a Python Notebook instead of using MySQL directly.

Synthetic Data Generation Techniques-

For this example, we will use YData-synthetic, an open-source Python library for generating synthetic data using machine learning models, including GANs (Generative Adversarial Networks), CTGAN, and other tabular data synthesizers. It is widely used for privacy-preserving data generation, data augmentation, and testing.

Note: There are several other Synthetic data providers, and the list of all providers is available in the article linked above.

Example of Synthetic Data Generation using YData Python library

from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
from ydata_synthetic.preprocessing.timeseries import TimeSeriesPreprocessor
from ydata_synthetic.synthesizers.timeseries import TimeGAN
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
from ydata_synthetic.synthesizers import CTGANSynthesizer
import pandas as pd
from ydata_synthetic.evaluation import evaluate

# Load your Dallas, TX customer data with ATT service (replace with your actual data loading)
data = pd.read_csv('your_data.csv')

# Preprocess the data
preprocessor = TimeSeriesPreprocessor()
data = preprocessor.fit_transform(data)

# Define the TimeGAN model parameters
gan_args = ModelParameters(batch_size=128,
                           lr=2e-4,
                           beta_1=0.5,
                           noise_dim=32,
                           layers_dim=128)

# Define the training parameters
train_args = TrainParameters(epochs=100,
                             n_critic=5,
                             clip_value=0.01,
                             sample_interval=500)

# Train the TimeGAN model
model = TimeGAN(model_parameters=gan_args,
                train_parameters=train_args)
model.train(data)

# Generate synthetic customer data
synthetic_data = model.sample(n_samples=1000)

# Inverse transform the synthetic data to get the original format
synthetic_data = preprocessor.inverse_transform(synthetic_data)

# Extract the name, city, and zip columns
synthetic_data = synthetic_data[['name', 'city', 'zip']]

# Save the synthetic data to a CSV file
synthetic_data.to_csv('synthetic_customer_data_dallas_tx.csv', index=False)

# Initialize the RegularSynthesizer
synth = RegularSynthesizer(modelname='vae', epochs=100)

# Initialize the CTGANSynthesizer
synth = CTGANSynthesizer(epochs=100)

synth.fit(data)

synthetic_data = synth.sample(n_samples=1000)

# Evaluate the synthetic data
evaluate(data, synthetic_data)

Below is a general explanation of what Python code in a typical YDataSynth workflow does:

1. Data Loading and Preprocessing

Start by loading your real dataset (e.g., a Pandas DataFrame) and, optionally, preprocess it (handle missing values, encode categories, mask sensitive PII information, etc.).

2. Model Selection and Configuration

Select a synthesizer model (e.g., RegularSynthesizer, CTGANSynthesizer, GTVAESynthesizer) and configure its parameters.

3. Model Training

Now, fit the synthesizer to your real data. The model learns the statistical patterns and relationships in your dataset.

Note that these steps are not performed in the first example of randomized data generation techniques. These are essential steps that distinguish between the two methods.

4. Synthetic Data Generation

Use the trained synthesizer to generate new, synthetic samples that resemble the real data.

5. (Optional) Evaluation

For evaluation, compare the synthetic data to the real data using built-in metrics or visualizations to assess quality and privacy.

Key Takeaways-

Now, both are examples that use the Python library. However, the real difference between these two approaches is -

The example with YData is a true synthetic data generator that utilizes GAN Arguments to define the TimeGAN model and training parameters. This method of data generation is a highly controlled approach to generating data. It benefits in fine-tuning existing AI models to yield desired results aligned with your projects.
However, Faker is a Python library that generates fake data using predefined templates, randomization, and lists of names, addresses, and other data. It does not use AI, GPT, or any neural network. However, for testing software features or applications, this randomized technique is convenient.

I hope that this provides an idea of how Synthetic Data and randomized libraries generate test data for AI Implementations and more.

Share Stratagem-360.ai

Unlocking New Possibilities: Hands-on Data Generation for Every Project

With Practical Examples of Data Generation in Action.

No Real Data!

Randomized Data Generation Techniques-

Synthetic Data Generation Techniques-

1. Data Loading and Preprocessing

2. Model Selection and Configuration

3. Model Training

4. Synthetic Data Generation

5. (Optional) Evaluation

Key Takeaways-

Discussion about this post