Unlocking New Possibilities: Hands-on Data Generation for Every Project
With Practical Examples of Data Generation in Action.
There are times when Data Projects or Data Science teams hit a roadblock.
No Real Data!
The experience of not having Real Data can be frustrating. Let us find a possible solution to this problem.
Suppose fast-moving enterprises are struggling with problems like inadequate data to fine-tune and test AI models. In that case, it is impeding their AI implementations and overall progress in driving success and outcomes with AI Investments.
Most recently, I have delved into the topic of “Synthetic Data” and how it propels the way forward for AI Advancement.
Here is a link to know more about Synthetic Data.
With an understanding of Synthetic Data, let us now focus on building a fundamental framework to generate data for testing features, applications, software, or even fine-tuning AI Models.
To achieve this, here are two approaches.
Randomized Data Generation Techniques-
For this example, we are using Python Faker Library.
Faker is a Python package that generates fake data, including names, addresses, emails, phone numbers, dates, and more.
It’s widely used for testing, populating databases, creating sample datasets, and anonymizing data.
Faker supports many locales (languages/ countries) and can generate data in a variety of formats.
To install Faker, just run-
pip install faker
Generating Randomized Data
For this, I am creating some generic database tables, such as PERSON, SUBSCRIPTIONS, and ADDRESS.
Here are the steps -
Step 1: Database table for storing data
Here is a script to create these three tables that I will be using in this example -
Table 1: PERSON
This table has columns like PERSON_ID, First Name, Last Name, Eye Color, Birth Date with PERSON_ID as PK.
create table PERSON(PERSON_ID int, F_NAME varchar(20), L_NAME varchar(20), EYE_COLOR enum('BLACK', 'BLUE', 'GREEN', 'BROWN'), BIRTH_DATE date, CONSTRAINT PK_PERSON PRIMARY KEY (PERSON_ID));
Table 2: SUBSCRIPTIONS
This table is a child table of the above PERSON table, with columns such as PERSON_ID (FK), SUBSCRIPTION_NAME, TYPE, PAYMENT_DATE, and PAYMENT_AMOUNT. The PERSON_ID, which is PK from PERSON TABLE, is an FK in this table. The combination of PERSON_ID & ADDRESS_TYPE is unique.
create table SUBSCRIPTIONS (PERSON_ID int, SUBSCRIPTION_NAME enum('Netflix', 'Amazon Prime', 'Spotify', 'Hulu', 'Disney+', 'Apple Music', 'YouTube Premium', 'HBO Max', 'New York Times', 'Audible'), TYPE enum('MONTHLY', 'YEARLY', 'WEEKLY', 'TRIAL'), PAYMENT_DATE DATE, PAYMENT_AMOUNT int, CONSTRAINT PK_PERSON_SUBS PRIMARY KEY (PERSON_ID, SUBSCRIPTION_NAME), CONSTRAINT FK_PERSON_SUBS FOREIGN KEY (PERSON_ID) REFERENCES PERSON (PERSON_ID));
Table 3: ADDRESS
This table is also a child table of the above PERSON table, with a few typical address columns, including address lines 1, 2, 3, City, State, Country, and ZIP. The PERSON_ID, which is PK from PERSON TABLE, is an FK in this table. The combination of PERSON_ID & ADDRESS_TYPE is unique.
create table ADDRESS (PERSON_ID int, ADDRESS_TYPE enum('PRIMARY', 'BILLING', 'COMMUNICATION', 'SHIPPING', 'TEMPORARY'), LINE_1 varchar(200), LINE_2 varchar(200), LINE_3 varchar(200), CITY varchar(20), STATE varchar(20), COUNTRY varchar(200), ZIP varchar(20), CONSTRAINT PK_PERSON_ADDRESS PRIMARY KEY (PERSON_ID, ADDRESS_TYPE), CONSTRAINT FK_PERSON FOREIGN KEY (PERSON_ID) REFERENCES PERSON (PERSON_ID));
Step 2: Generate Synthetic Data using Python and FAKER Lib
Python Script to generate PERSON data
Here you go
import csv
from faker import Faker
import random
fake = Faker()
EYE_COLORS = [
'Blue', 'Brown', 'Green', 'Black'
]
with open('people1.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['PERSON_ID', 'First Name', 'Last Name', 'Eye Color', 'Birthdate'])
for i in range(1, 1001):
first_name = fake.first_name()
last_name = fake.last_name()
eye_color = random.choice(EYE_COLORS)
birthdate = fake.date_of_birth(minimum_age=18, maximum_age=90).strftime('%Y-%m-%d')
writer.writerow([i, first_name, last_name, eye_color, birthdate])
print('people.csv generated with 1000 fake people!')
Command to run the Python Script to generate people1.csv
python generate_people_csv.py
Command to load the csv into the PERSON TABLE
Here is a SQL command
LOAD DATA INFILE '/path/to/the/people1.csv'
INTO TABLE PERSON
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Python Script to generate SUBSCRIPTION data
Here you go. This script will ensure that there are unique combinations of PERSON_ID and SUBSCRIPTION_NAME to keep the table integrity
import csv
from faker import Faker
import random
fake = Faker('en_US')
SUBSCRIPTION_NAMES = [
'Netflix', 'Amazon Prime', 'Spotify', 'Hulu', 'Disney+', 'Apple Music', 'YouTube Premium', 'HBO Max', 'New York Times', 'Audible'
]
SUBSCRIPTION_TYPES = ['MONTHLY', 'YEARLY', 'WEEKLY', 'TRIAL']
# Read person IDs from people1.csv
person_ids = []
with open('people1.csv', newline='') as peoplefile:
reader = csv.DictReader(peoplefile)
for row in reader:
person_ids.append(int(row['PERSON_ID']))
unique_combinations = set()
rows = []
while len(rows) < 1000:
person_id = random.choice(person_ids)
sub_name = random.choice(SUBSCRIPTION_NAMES)
key = (person_id, sub_name)
if key in unique_combinations:
continue
unique_combinations.add(key)
sub_type = random.choice(SUBSCRIPTION_TYPES)
payment_date = fake.date_between(start_date='-2y', end_date='today').strftime('%Y-%m-%d')
payment_amount = round(random.uniform(5, 100), 2)
rows.append([person_id, sub_name, sub_type, payment_date, payment_amount])
with open('subscriptions.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['PERSON_ID', 'SUBSCRIPTION_NAME', 'TYPE', 'PAYMENT_DATE', 'PAYMENT_AMOUNT'])
writer.writerows(rows)
print('subscriptions.csv generated with 1000 unique (PERSON_ID, SUBSCRIPTION_NAME) pairs!')
Command to run the Python Script to generate subscriptions.csv
python generate_subscriptions_csv.py
Command to load the csv into the SUBSCRIPTIONS TABLE
Here is a SQL command
LOAD DATA INFILE '/path/to/the/subscriptions.csv'
INTO TABLE SUBSCRIPTIONS
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Python Script to generate ADDRESS data
Here you go
import csv
from faker import Faker
import random
fake = Faker('en_US')
ADDRESS_TYPES = ['BILLING', 'SHIPPING', 'PRIMARY', 'COMMUNICATION']
# Read person IDs from people1.csv
person_ids = []
with open('people1.csv', newline='') as peoplefile:
reader = csv.DictReader(peoplefile)
for row in reader:
person_ids.append(int(row['PERSON_ID']))
unique_combinations = set()
rows = []
while len(rows) < 1000:
person_id = random.choice(person_ids)
address_type = random.choice(ADDRESS_TYPES)
key = (person_id, address_type)
if key in unique_combinations:
continue
unique_combinations.add(key)
city = fake.city()
line_1 = fake.building_number()
line_2 = fake.street_name()
line_3 = fake.secondary_address()
state = fake.state_abbr()
country = fake.country()
zip_code = fake.zipcode()
rows.append([person_id, address_type, line_1, line_2, line_3, city, state, country, zip_code])
with open('addresses.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['PERSON_ID', 'ADDRESS_TYPE', 'LINE_1', 'LINE_2', 'LINE_3', 'CITY', 'STATE', 'COUNTRY', 'ZIP'])
writer.writerows(rows)
print('addresses.csv generated with 1000 unique (PERSON_ID, ADDRESS_TYPE) pairs!')
Command to run the Python Script to generate addresses.csv
python generate_addresses_csv.py
Command to load the csv into the SUBSCRIPTIONS TABLE
Here is a SQL command
LOAD DATA INFILE '/path/to/the/addresses.csv'
INTO TABLE ADDRESS
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
These steps will create the required tables in MySQL, generate data using the Faker library of Python, and load the tables with the generated data.
Note: Use the Python MySQL library as an alternative to performing all these steps in a Python Notebook instead of using MySQL directly.
Synthetic Data Generation Techniques-
For this example, we will use YData-synthetic, an open-source Python library for generating synthetic data using machine learning models, including GANs (Generative Adversarial Networks), CTGAN, and other tabular data synthesizers. It is widely used for privacy-preserving data generation, data augmentation, and testing.
Note: There are several other Synthetic data providers, and the list of all providers is available in the article linked above.
Example of Synthetic Data Generation using YData Python library
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
from ydata_synthetic.preprocessing.timeseries import TimeSeriesPreprocessor
from ydata_synthetic.synthesizers.timeseries import TimeGAN
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
from ydata_synthetic.synthesizers import CTGANSynthesizer
import pandas as pd
from ydata_synthetic.evaluation import evaluate
# Load your Dallas, TX customer data with ATT service (replace with your actual data loading)
data = pd.read_csv('your_data.csv')
# Preprocess the data
preprocessor = TimeSeriesPreprocessor()
data = preprocessor.fit_transform(data)
# Define the TimeGAN model parameters
gan_args = ModelParameters(batch_size=128,
lr=2e-4,
beta_1=0.5,
noise_dim=32,
layers_dim=128)
# Define the training parameters
train_args = TrainParameters(epochs=100,
n_critic=5,
clip_value=0.01,
sample_interval=500)
# Train the TimeGAN model
model = TimeGAN(model_parameters=gan_args,
train_parameters=train_args)
model.train(data)
# Generate synthetic customer data
synthetic_data = model.sample(n_samples=1000)
# Inverse transform the synthetic data to get the original format
synthetic_data = preprocessor.inverse_transform(synthetic_data)
# Extract the name, city, and zip columns
synthetic_data = synthetic_data[['name', 'city', 'zip']]
# Save the synthetic data to a CSV file
synthetic_data.to_csv('synthetic_customer_data_dallas_tx.csv', index=False)
# Initialize the RegularSynthesizer
synth = RegularSynthesizer(modelname='vae', epochs=100)
# Initialize the CTGANSynthesizer
synth = CTGANSynthesizer(epochs=100)
synth.fit(data)
synthetic_data = synth.sample(n_samples=1000)
# Evaluate the synthetic data
evaluate(data, synthetic_data)
Below is a general explanation of what Python code in a typical YDataSynth workflow does:
1. Data Loading and Preprocessing
Start by loading your real dataset (e.g., a Pandas DataFrame) and, optionally, preprocess it (handle missing values, encode categories, mask sensitive PII information, etc.).
2. Model Selection and Configuration
Select a synthesizer model (e.g., RegularSynthesizer, CTGANSynthesizer, GTVAESynthesizer) and configure its parameters.
3. Model Training
Now, fit the synthesizer to your real data. The model learns the statistical patterns and relationships in your dataset.
Note that these steps are not performed in the first example of randomized data generation techniques. These are essential steps that distinguish between the two methods.
4. Synthetic Data Generation
Use the trained synthesizer to generate new, synthetic samples that resemble the real data.
5. (Optional) Evaluation
For evaluation, compare the synthetic data to the real data using built-in metrics or visualizations to assess quality and privacy.
Key Takeaways-
Now, both are examples that use the Python library. However, the real difference between these two approaches is -
The example with YData is a true synthetic data generator that utilizes GAN Arguments to define the TimeGAN model and training parameters. This method of data generation is a highly controlled approach to generating data. It benefits in fine-tuning existing AI models to yield desired results aligned with your projects.
However, Faker is a Python library that generates fake data using predefined templates, randomization, and lists of names, addresses, and other data. It does not use AI, GPT, or any neural network. However, for testing software features or applications, this randomized technique is convenient.
I hope that this provides an idea of how Synthetic Data and randomized libraries generate test data for AI Implementations and more.