Generate realistic test data

As data professionals, we often need test data, whether for functional testing, to satisfy business logic criteria or for non-functional, to satisfy performance requirements. We must also not store any sensitive or personal information in non-production systems and doing so could be against Data Protection Regulations (GDPR).

A common approach is to refresh test environments from production and thus load production data for testing. Problem with this approach is that it may not fully satisfy our business logic. For example, we could have a business rule that awards customers born on February 29th but we may not have such customers. In that case, our production data would never trigger this particular business rule and we would never be able to validate it. The only way is to generate test customers born on February 29th

There are a number of online tools available to generate mock-up data. My favourite is https://www.generatedata.com by Benjamin Keen because it is Open Source, free and self-hosted. In the online version, we can only generate 100 records at a time, which one can increase after a donation. The self-hosted version does not have any limitation. Ben has done a fantastic job and I would urge you to donate on the author’s website.

Prerequisites

The data generator is a PHP/MySQL application and therefore requires MySQL and PHP installed on the machine. This could be either on Windows or Linux. I will be using Ubuntu Linux 18.04 for this demonstration. You can learn how to install Ubuntu virtual machine in Azure in my previous post

There is no need for a separate machine. You can install AMP (Apache, MySQL, PHP) locally on a Windows laptop. See https://ampps.com for details. I have chosen a dedicated VM as it makes it easier for me. I would, however, love to see Data Generator as a Docker container.

First and foremost, if you have just installed Ubuntu you need to refresh repositories:

apt-get update

Install Apache, PHP and MySQL

Install Apache

apt-get install apache2

Install MySQL

apt-get install mysql-server

Install PHP

apt-get install php php-mysql libapache2-mod-php

Restart Apache

systemctl restart apache2

Now we should have a working web server with PHP and MySQL support. Let’s test it:

wget localhost

should result in:

This means the Apache is responding to requests and served us index.html page.

Configure MySQL

Connect to the MySQL server with the root user:

mysql -u root -p

Create a new database:

mysql> create database datagenerator;

Create a new user:

mysql> create user  'datagenerator'@'localhost' identified by 'SomeNewPassword';

Now, grant the new user privileges to the database:

mysql> grant all privileges ON datagenerator . * TO 'datagenerator'@'localhost' identified by 'SomeNewPassword';

Reload privileges to take into effect:

flush privileges;

Install data generator

The guide is available on their GitHub page but I will take you through it step by step:

Download latest release:

wget https://github.com/benkeen/generatedata/archive/3.2.8.zip

Unzip the downloaded package. First, we need to install unzip:

apt-get install unzip

Now unzip the package:

unzip 3.2.8.zip

By default, the Apache webserver is looking for websites to be in /var/www/html. This is defined by the DocumentRoot directive in the Apache configuration. To see the configuration you can open it in the text editor. I use VIM:

vi /etc/apache2/sites-enabled/000-default.conf

And look for DocumentRoot variable:

to quit vim press Esc, then : and type q and press Enter

Now, we have to copy the extracted package to the DocumentRoot folder:

cp -a generatedata-3.2.8/ /var/www/html/

Grant access to the cache folder as per documentation:

chmod 777 /var/www/html/generatedata-3.2.8/cache/

Now, navigate to your servers IP or DNS and follow the wizard:

Configure the MySQL connection with the information we have created earlier in this post:

On the next screens, you will be configuring User Account types and which plugins to install. After that, you will be able to start using your own data generator without any limitations… well, the only limitation is the performance of your VM and how quickly it can generate data sets. In my example, on a VM with 2 CPUs (Standard D2s v3) generating INSERT SQL Statement for 10000 records is instant. I have made mine accessible via the Internet for test purposes but you can keep yours within the local network, there is no need to expose it:

Result

An example of test customer data:

We can also generate an INSERT statement:

And voila!

Thanks for reading!

This post was originally published on March 30, 2020.

Help us grow, please share this content!

Author

Posted by
Marcin Gminski
September 14, 2021

Let us help you. Get in touch.

Help us grow, please share this content!