Name	Name	Last commit message	Last commit date
parent directory ..
configurations	configurations
demos	demos
docker_scripts	docker_scripts
img	img
scripts	scripts
syngen	syngen
.dockerignore	.dockerignore
.gitignore	.gitignore
Dockerfile	Dockerfile
README.md	README.md
requirements.txt	requirements.txt

Synthetic Graph Generation

This repository implements a tool for generating graphs with an arbitrary size, including node and edge tabular features.

Solution overview
Setup
- Requirements
Quick Start Guide
Advanced
Performance
- Results
Release notes
- Changelog
- Known issues
Reference
- Cite

Solution overview

Synthetic data generation has become pervasive with imploding amounts of data and demand to deploy machine learning models leveraging such data. There has been an increasing interest in leveraging graph-based neural network model on graph datasets, though many public datasets are of a much smaller scale than that used in real-world applications. Synthetic Graph Generation is a common problem in multiple domains for various applications, including the generation of big graphs with similar properties to original or anonymizing data that cannot be shared. The Synthetic Graph Generation tool enables users to generate arbitrary graphs based on provided real data.

Synthetic Graph Generation architecture

The tool has the following architecture.

The module is composed of three parts: a structural generator, which fits the graph structure, feature generator, which fits the feature distribution contained in the graph; and finally, an aligner, which aligns the generated features with the generated graph structure

Graph structural generator

The graph structural generator fits graph structure and generate a corresponding graph containing the nodes and edges.

Feature generator

The feature generator fits the feature distribution contained in the graph and generates the corresponding features. There is the option to allow users to generate features associated with nodes, edges, or both.

Aligner

The aligner aligns the generated features taken from the feature generator with the graph structure generated by a graph structural generator.

Feature support matrix

This tool supports the following features:

Feature	Synthetic Graph Generation
Non-partite graph generation	Yes
N-partite graph generation	Yes
Undirected graph generation	Yes
Directed graph generation	Yes
Self-loops generation	Yes
Edge features generation	Yes
Node features generation	Yes

Features

Non-partite graph generation is a task to generate a graph that doesn't contain any explicit partites (disjoint and independent sets of nodes).
N-partite graph generation is a task to generate a graph that consists of an arbitrary number of partites.
Undirected graph generation is a task to generate a graph made up of a set of vertices connected by not ordered edges.
Directed graph generation is a task to generate a graph made up of a set of vertices connected by directed edges.
Self-loops generation is a task to generate edges that connect a vertex to itself.
Edge features generation is a task to generate features associated with an edge.
Node features generation is a task to generate features associated with a node.

Models

Structural graph generation

- RMAT
- Random (Erdos-Renyi)

Tabular features

- KDE
- Gaussian
- Uniform
- Random
- CTGAN (Conditional GAN)

Aligner

- XGBoost

Setup

The following section lists the requirements you need to run the Synthetic Graph Generation tool.

Requirements

This repository contains a Dockerfile that extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

NVIDIA Ampere Architecture, NVIDIA Volta or NVIDIA Turing based GPU
NVIDIA Docker
Custom Docker containers built for this tool. Refer to the steps in the Quick Start Guide.

For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to set up the required environment or create your own container, refer to the versioned NVIDIA Container Support Matrix.

Quick Start Guide

Getting Started

To use the tool, perform the following steps. For the specifics concerning generation and training, refer to the Advanced section.

Clone the repository.

git clone https://github.com/NVIDIA/DeepLearningExamples

Go to the SyntheticGraphGeneration tool directory within the DeepLearningExamples repository:

cd DeepLearningExamples/Tools/DGLPyTorch/SyntheticGraphGeneration

Build the SyntheticGraphGeneration container.

bash docker_scripts/build_docker.sh

Download the datasets. (It is advisable to run this command inside docker interactive container to ensure environment setup, see 6.1)

bash scripts/get_datasets.sh

Note: This script requires a manual download of 4 datasets (tabformer, ieee, paysim, credit) and putting them into ./data directory with the correct naming. The instruction for the manual download will be printed during the preprocessing. If the raw data is not present or the dataset is already preprocessed, the preprocessing will be skipped.

Run the SyntheticGraphGeneration Jupyter notebook.

5.1. Run the Docker notebook container.

bash docker_scripts/run_docker_notebook.sh

5.2 Open Jupyter notebook.

http://localhost:9916/tree/demos

Run the SyntheticGraphGeneration CLI.

6.1. Run the Docker interactive container.

bash docker_scripts/run_docker_interactive.sh

6.2. Run Command Line Interface (CLI) command.

The tool contains 3 run commands: preprocess, ``synthesizeandpretrain`

For example, to synthesize a graph similar to the IEEE dataset, run the following commands:

Convert IEEE into the SynGen format:

syngen preprocess \
--dataset ieee \
--source-path /workspace/data/ieee-fraud/ \
--destination-path /workspace/data/ieee-preprocessed

Note: --source-path points to the location where the IEEE dataset is extracted, and destination-path points to the location where the IEEE dataset in SynGen format is saved.

Prepare SynGen configuration manually or using:

syngen mimic-dataset \
--dataset-path /workspace/data/ieee-preprocessed \
--output-file /workspace/configurations/my_ieee_config.json \
--tab-gen kde \
--edge-scale 1 \
--node-scale 1

Note: In the above commands, the kde tabular generator will be used to generate all tabular features.

Generate synthetic IEEE

syngen synthesize \
--config-path /workspace/configurations/my_ieee_config.json \
--save-path  /workspace/data/ieee-generated

Note: --save-path points to the location where the generated data in SynGen format is saved.

Following the above command, the pretrain command can be used to pre-train or fine-tune the given generated sample.

syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 5 \
--finetune-epochs 5 \
--data-path /workspace/data/ieee-preprocessed \
--edge-name user-product \
--pretraining-data-path /workspace/data/ieee-generated  \
--pretraining-edge-name user-product \
--task ec \
--target-col isFraud \
--num-classes 2 \
--log-interval 1

Note: The current set of tasks and models are solely provided as use case examples on how to use the generated synthetic data to pretrain/fine-tune on a downstream task, and generally would need extension/modifications to accomodate very large graphs or arbitrary models.

For the complete CLI usage of the synthesize command run:

syngen synthesize --help

Similarly for the pretrain, mimic-dataset, and preprocess run:

syngen <COMMAND> --help

Advanced

Repository structure

.
├── demos            # Directory with all the Jupyter examples
├── docker_scripts   # Directory with Docker scripts
├── scripts          # Directory with datasets scripts
├── syngen              # Directory with Synthetic Graph Generation source code
│  ├── analyzer             # Directory with tools for getting graph visualisation and statistics
│  │   ├── graph                    # Directory with graph structure analyzer
│  │   └── tabular                  # Directory with tabular features analyzer
│  ├── benchmark            # Directory with pretraining tools
│  │   ├── data_loader              # Directory with pre-defined node and edge classification datasets
│  │   ├── models                   # Directory with GNN model definitions
│  │   └── tasks                    # Directory with set of tasks that are supported for training
│  ├── cli                  # Directory with all cli commands
│  ├── configuration        # Directory with SynGen formats
│  ├── generator            # Directory with all the generators
│  │   ├── graph                    # Directory with graph generators and graph
│  │   └── tabular                  # Directory with tabular generators
│  │       ├── data_transformer         # Directory with tabular data transformations used by generators
│  │       └── transforms               # Directory with tabular column transforms
│  ├── graph_aligner      # Directory with all the aligners
│  ├── preprocessing        # Directory with the preprocessings for the supported datasets
│  │   └── datasets                 # Directory with example dataset preprocessing scripts used to generate data
│  ├── synthesizer          # Directory with all the synthesizers
│  └── utils                # Directory with the utilities
│      └── types                    # Directory with common data types used in the tool

Important scripts and files

scripts/get_datasets.sh - Bash script downloading and preprocessing supported datastes
docker_scripts/build_docker.sh - Bash script that builds the Docker image
docker_scripts/run_docker_notebook.sh - Bash script that runs Jupyter notebook in the Docker container
docker_scripts/run_docker_interactive.sh - Bash script that runs the Docker container in interactive mode
syngen/synthesizer/configuration_graph_synthesizer.py - Python file with graph synthesizer

Parameters

For the synthesis process, refer to the parameters in the following table.

Scope	parameter	Comment	Default Value
preprocess	--dataset DATASET_NAME	Dataset to preprocess into SynGen format. Available datasets : [cora, epinions, ogbn_mag, ogbn_mag240m, ieee, tabformer]	Required
preprocess	-sp \| --source-path SOURCE_PATH	Path to downloaded raw dataset	Required
preprocess	-dp \| --destination-path DESTINATION_PATH	Path to store the preprocessed dataset in SynGen format.	SOURCE_PATH/syngen_preprocessed
preprocess	--cpu	Runs all operations on CPU
preprocess	--use-cache	Does nothing if the target preprocessed dataset exists
preprocess	--download	Downloads the dataset to the specified SOURCE_PATH
mimic-dataset	-dp \| --dataset-path DATASET_PATH	Path to the dataset in SynGen format
mimic-dataset	-of \| --output-file OUTPUT_FILE	Path to the generated SynGen Configuration
mimic-dataset	-tg \| --tab-gen TABULAR_GENERATOR	Tabular Generator to use to generate all tabular features (You always can modify OUTPUT_FILE). Available options: [kde, random, gaussian, uniform, ctgan]	kde
mimic-dataset	-rsg \| --random-struct-gen	Generates random structure based on Erdos-Renyi model instead of mimicking
mimic-dataset	-es \| --edge-scale EDGE_SCALE	Multiples the number of edges to generate by the provided number
mimic-dataset	-en \| --node-scale NODE_SCALE	Multiples the number of nodes to generate by the provided number
synthesize	-cp \| --config-path CONFIG_PATH	Path to SynGen Configuration file that describes how to generate a graph	Required
synthesize	-sp \| --save-path SAVE_PATH	Save path to dump generated files	Current directory
synthesize	--verbose	Displays generation process progress
synthesize	--cpu	Runs all operations on CPU. [Attention] Alignment is not available on CPU
synthesize	--timer-path FILE_PATH	Saves generation process timings to the specified file	Required

For the pretraining refer to the to Command-line options, as the parameters depend on the model choice.

Define the synthesizer pipeline

In this example, we show how to define the synthesizer pipeline for IEEE dataset. A full example can be found in ieee_notebook.

Prepare data

Preprocessing class is used to convert the IEEE dataset into SynGen format.

preprocessing = IEEEPreprocessing(source_path='/workspace/data/ieee-fraud', destination_path='/workspace/data/ieee_preprocessed')
feature_spec = preprocessing.transform()

Prepare SynGen Configuration

SynGen Configuration is used to specify all generation details. We use the original dataset feature spec as a base for the configuration

feature_spec_for_config = feature_spec.copy()

Tabular generator is used to generate tabular features.

feature_spec_for_config[MetaData.EDGES][0][MetaData.TABULAR_GENERATORS] = [
    {
        MetaData.TYPE: "kde",
        MetaData.FEATURES_LIST: -1, # copies all tabular features from the original dataset
        MetaData.DATA_SOURCE: {
            MetaData.TYPE: "configuration", 
            MetaData.PATH: preprocessed_path,
            MetaData.NAME: "user-product", 
        },
        MetaData.PARAMS: {}
    }
]

Structure generator is used to generate graph structure.

feature_spec_for_config[MetaData.EDGES][0][MetaData.STRUCTURE_GENERATOR] = {
    MetaData.TYPE: "RMAT",
    MetaData.DATA_SOURCE: {
        MetaData.TYPE: "cfg",  # the equivalent of 'configuration'
        MetaData.PATH: preprocessed_path,
        MetaData.NAME: "user-product",
    },
    MetaData.PARAMS: {
        "seed": 42,
    }
}

After providing all related information, we create a SynGenConfiguration object. It fills out missing fields and validates provided data.

config = SynGenConfiguration(feature_spec_for_config)

Prepare synthesizer

Synthesizer is a class that combines all the generators and allows the user to run end-to-end fitting and generation.

synthesizer = ConfigurationGraphSynthesizer(configuration=config, save_path='/workspace/data/ieee_generated')

To start fitting process, we use fit method provided by the synthesizer. It will automatically load all required data from the disk based on the information provided in config.

synthesizer.fit()

Generate graph

To run generation, we call the generate method provided by the synthesizer. We use return_data=False because we want only to store the generated in /workspace/data/ieee_generated folder. In other case it will download tabular data under the MetaData.FEATURES_DATA key for each node and edge type and structural data under the MetaData.STRUCTURE_DATA key for edges.

out_feature_spec = synthesizer.generate(return_data=False)

Getting the data

To download the datasets used as an example , use get_datasets.sh script

bash scripts/get_datasets.sh

Note: Certain datasets require a Kaggle API key, hence may require manual download. Refer to the links below. Note: Each user is responsible for checking the content of datasets and the applicable licenses and determining if they are suitable for the intended use

List of datasets

Supported datasets:

Performance

Our results were obtained by running the demo notebooks directory in the PyTorch NGC container on NVIDIA DGX1 V100 with 8x V100 32GB GPUs. All the notebooks are presented in the table below.

	scope	notebook	description
1.	basic_examples	e2e_cora_demo.ipynb	a complete process of generating a non-bipartite graph dataset with node features
2.	basic_examples	e2e_ieee_demo.ipynb	a complete process of generating a bipartite graph dataset with edge features
3.	basic_examples	e2e_epinions_demo.ipynb	a complete process of generating a heterogeneous bipartite graph dataset with edge features
4.	advanced_examples	big_graph_generation.ipynb	a complete process of mimicking and scaling the MAG240m dataset
5.	performance	struct_generator.ipynb	comparison of SynGen graph structure generators
6.	performance	tabular_generator.ipynb	comparison of SynGen tabular data generators

Scope refers to the directories in which the notebooks are stored and the functionalities particular notebooks cover . There are

Basic - basic_examples - notebooks with the examples of basics functionalities
Advanced - advanced_examples - notebooks with the examples of advanced functionalities
Performance - performance - notebooks with the performance experiments

To achieve the same results, follow the steps in the Quick Start Guide.

Results

1. Quality of the content of generated dataset vs. original dataset:

The quality of the content comparison was conducted on the IEEE dataset (refer to List of datasets for more details) with corresponding notebook e2e_ieee_demo.ipynb We compared three modalities, that is, quality of generated graph structure, quality of generated tabular data and quality of aligning tabular data to the graph structure.

Graph structure quality
- Comparison of degree distribution for an original graph, properly generated and random (Erdős–Rényi)
- Comparison of basic graph statistics for an original graph, properly generated and random (Erdős–Rényi) ![graph_structure statistics](img/graph_structure statistics.png)
Tabular data quality
- Comparison of two first components of a PCA of real and generated data
- Comparison of basic statistics between real and generated data
  
  Generator kl divergence correlation correlation
  
  GAN 0.912 0.018
  
  Gaussian 0.065 -0.030
  
  Random 0.617 0.026
Structure to tabular alignment quality
- Degree centrality for feature distribution

Generator	kl divergence	correlation correlation
GAN	0.912	0.018
Gaussian	0.065	-0.030
Random	0.617	0.026

2. Performance (speed) of the synthetic dataset generation:

Performance of graph structure generation (edges/s)
Performance of categorical tabular data generation (samples/s)

Dataset (CPU/GPU) KDE Uniform Gaussian Random

ieee (CPU) 371296 897421 530683 440086

ieee (GPU) 592132 3621726 983408 6438646

Dataset (CPU/GPU)	KDE	Uniform	Gaussian	Random
ieee (CPU)	371296	897421	530683	440086
ieee (GPU)	592132	3621726	983408	6438646

3. Synthetic dataset use-case specific quality factors:

Performance (batches/s) comparison between original vs. synthetic datasets

Dataset Model Synthetic Original

ieee gat 0.07173 0.07249

Dataset	Model	Synthetic	Original
ieee	gat	0.07173	0.07249

Release notes

Changelog

August 2023

Heterogeneous graph generation
Multi-GPU generation

January 2023

Initial release

Known issues

There are no known issues with this model.

Reference

Cite

Cite the following paper if you find this code useful or use it in your own work:

@article{darabi2022framework,
  title={A Framework for Large Scale Synthetic Graph Dataset Generation},
  author={Darabi, Sajad and Bigaj, Piotr and Majchrowski, Dawid and Morkisz, Pawel and Fit-Florea, Alex},
  journal={arXiv preprint arXiv:2210.01944},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Synthetic Graph Generation

Table Of Contents

Solution overview

Synthetic Graph Generation architecture

Graph structural generator

Feature generator

Aligner

Feature support matrix

Features

Models

Setup

Requirements

Quick Start Guide

Getting Started

Advanced

Repository structure

Important scripts and files

Parameters

Define the synthesizer pipeline

Prepare data

Prepare SynGen Configuration

Prepare synthesizer

Generate graph

Getting the data

List of datasets

Performance

Results

1. Quality of the content of generated dataset vs. original dataset:

2. Performance (speed) of the synthetic dataset generation:

3. Synthetic dataset use-case specific quality factors:

Release notes

Changelog

Known issues

Reference

Cite

FilesExpand file tree

SyntheticGraphGeneration

Directory actions

More options

Directory actions

More options

Latest commit

History

SyntheticGraphGeneration

Folders and files

parent directory

README.md

Synthetic Graph Generation

Table Of Contents

Solution overview

Synthetic Graph Generation architecture

Graph structural generator

Feature generator

Aligner

Feature support matrix

Features

Models

Setup

Requirements

Quick Start Guide

Getting Started

Advanced

Repository structure

Important scripts and files

Parameters

Define the synthesizer pipeline

Prepare data

Prepare SynGen Configuration

Prepare synthesizer

Generate graph

Getting the data

List of datasets

Performance

Results

1. Quality of the content of generated dataset vs. original dataset:

2. Performance (speed) of the synthetic dataset generation:

3. Synthetic dataset use-case specific quality factors:

Release notes

Changelog

Known issues

Reference

Cite