Skip to content

Latest commit

 

History

History
 
 

README.md

Synthetic Graph Generation

This repository implements a tool for generating graphs with an arbitrary size, including node and edge tabular features.

Table Of Contents

Solution overview

Synthetic data generation has become pervasive with imploding amounts of data and demand to deploy machine learning models leveraging such data. There has been an increasing interest in leveraging graph-based neural network model on graph datasets, though many public datasets are of a much smaller scale than that used in real-world applications. Synthetic Graph Generation is a common problem in multiple domains for various applications, including the generation of big graphs with similar properties to original or anonymizing data that cannot be shared. The Synthetic Graph Generation tool enables users to generate arbitrary graphs based on provided real data.

Synthetic Graph Generation architecture

The tool has the following architecture.

Synthetic Graph Generation architecture

The module is composed of three parts: a structural generator, which fits the graph structure, feature generator, which fits the feature distribution contained in the graph; and finally, an aligner, which aligns the generated features with the generated graph structure

Graph structural generator

The graph structural generator fits graph structure and generate a corresponding graph containing the nodes and edges.

Feature generator

The feature generator fits the feature distribution contained in the graph and generates the corresponding features. There is the option to allow users to generate features associated with nodes, edges, or both.

Aligner

The aligner aligns the generated features taken from the feature generator with the graph structure generated by a graph structural generator.

Feature support matrix

This tool supports the following features:

Feature Synthetic Graph Generation
Non-partite graph generation Yes
N-partite graph generation Yes
Undirected graph generation Yes
Directed graph generation Yes
Self-loops generation Yes
Edge features generation Yes
Node features generation Yes

Features

  • Non-partite graph generation is a task to generate a graph that doesn't contain any explicit partites (disjoint and independent sets of nodes).

  • N-partite graph generation is a task to generate a graph that consists of an arbitrary number of partites.

  • Undirected graph generation is a task to generate a graph made up of a set of vertices connected by not ordered edges.

  • Directed graph generation is a task to generate a graph made up of a set of vertices connected by directed edges.

  • Self-loops generation is a task to generate edges that connect a vertex to itself.

  • Edge features generation is a task to generate features associated with an edge.

  • Node features generation is a task to generate features associated with a node.

Models

Structural graph generation

- RMAT
- Random (Erdos-Renyi)

Tabular features

- KDE
- Gaussian
- Uniform
- Random
- CTGAN (Conditional GAN)

Aligner

- XGBoost

Setup

The following section lists the requirements you need to run the Synthetic Graph Generation tool.

Requirements

This repository contains a Dockerfile that extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to set up the required environment or create your own container, refer to the versioned NVIDIA Container Support Matrix.

Quick Start Guide

Getting Started

To use the tool, perform the following steps. For the specifics concerning generation and training, refer to the Advanced section.

  1. Clone the repository.
git clone https://github.com/NVIDIA/DeepLearningExamples
  1. Go to the SyntheticGraphGeneration tool directory within the DeepLearningExamples repository:
cd DeepLearningExamples/Tools/DGLPyTorch/SyntheticGraphGeneration
  1. Build the SyntheticGraphGeneration container.
bash docker_scripts/build_docker.sh
  1. Download the datasets. (It is advisable to run this command inside docker interactive container to ensure environment setup, see 6.1)
bash scripts/get_datasets.sh

Note: This script requires a manual download of 4 datasets (tabformer, ieee, paysim, credit) and putting them into ./data directory with the correct naming. The instruction for the manual download will be printed during the preprocessing. If the raw data is not present or the dataset is already preprocessed, the preprocessing will be skipped.

  1. Run the SyntheticGraphGeneration Jupyter notebook.

5.1. Run the Docker notebook container.

bash docker_scripts/run_docker_notebook.sh

5.2 Open Jupyter notebook.

http://localhost:9916/tree/demos
  1. Run the SyntheticGraphGeneration CLI.

6.1. Run the Docker interactive container.

bash docker_scripts/run_docker_interactive.sh

6.2. Run Command Line Interface (CLI) command.

The tool contains 3 run commands: preprocess, ``synthesizeandpretrain`

For example, to synthesize a graph similar to the IEEE dataset, run the following commands:

  1. Convert IEEE into the SynGen format:
syngen preprocess \
--dataset ieee \
--source-path /workspace/data/ieee-fraud/ \
--destination-path /workspace/data/ieee-preprocessed

Note: --source-path points to the location where the IEEE dataset is extracted, and destination-path points to the location where the IEEE dataset in SynGen format is saved.

  1. Prepare SynGen configuration manually or using:
syngen mimic-dataset \
--dataset-path /workspace/data/ieee-preprocessed \
--output-file /workspace/configurations/my_ieee_config.json \
--tab-gen kde \
--edge-scale 1 \
--node-scale 1

Note: In the above commands, the kde tabular generator will be used to generate all tabular features.

  1. Generate synthetic IEEE
syngen synthesize \
--config-path /workspace/configurations/my_ieee_config.json \
--save-path  /workspace/data/ieee-generated 

Note: --save-path points to the location where the generated data in SynGen format is saved.

Following the above command, the pretrain command can be used to pre-train or fine-tune the given generated sample.

syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 5 \
--finetune-epochs 5 \
--data-path /workspace/data/ieee-preprocessed \
--edge-name user-product \
--pretraining-data-path /workspace/data/ieee-generated  \
--pretraining-edge-name user-product \
--task ec \
--target-col isFraud \
--num-classes 2 \
--log-interval 1

Note: The current set of tasks and models are solely provided as use case examples on how to use the generated synthetic data to pretrain/fine-tune on a downstream task, and generally would need extension/modifications to accomodate very large graphs or arbitrary models.

For the complete CLI usage of the synthesize command run:

syngen synthesize --help

Similarly for the pretrain, mimic-dataset, and preprocess run:

syngen <COMMAND> --help

Advanced

Repository structure

.
├── demos            # Directory with all the Jupyter examples
├── docker_scripts   # Directory with Docker scripts
├── scripts          # Directory with datasets scripts
├── syngen              # Directory with Synthetic Graph Generation source code
│  ├── analyzer             # Directory with tools for getting graph visualisation and statistics
│  │   ├── graph                    # Directory with graph structure analyzer
│  │   └── tabular                  # Directory with tabular features analyzer
│  ├── benchmark            # Directory with pretraining tools
│  │   ├── data_loader              # Directory with pre-defined node and edge classification datasets
│  │   ├── models                   # Directory with GNN model definitions
│  │   └── tasks                    # Directory with set of tasks that are supported for training
│  ├── cli                  # Directory with all cli commands
│  ├── configuration        # Directory with SynGen formats
│  ├── generator            # Directory with all the generators
│  │   ├── graph                    # Directory with graph generators and graph
│  │   └── tabular                  # Directory with tabular generators
│  │       ├── data_transformer         # Directory with tabular data transformations used by generators
│  │       └── transforms               # Directory with tabular column transforms
│  ├── graph_aligner      # Directory with all the aligners
│  ├── preprocessing        # Directory with the preprocessings for the supported datasets
│  │   └── datasets                 # Directory with example dataset preprocessing scripts used to generate data
│  ├── synthesizer          # Directory with all the synthesizers
│  └── utils                # Directory with the utilities
│      └── types                    # Directory with common data types used in the tool

Important scripts and files

  • scripts/get_datasets.sh - Bash script downloading and preprocessing supported datastes
  • docker_scripts/build_docker.sh - Bash script that builds the Docker image
  • docker_scripts/run_docker_notebook.sh - Bash script that runs Jupyter notebook in the Docker container
  • docker_scripts/run_docker_interactive.sh - Bash script that runs the Docker container in interactive mode
  • syngen/synthesizer/configuration_graph_synthesizer.py - Python file with graph synthesizer

Parameters

For the synthesis process, refer to the parameters in the following table.

Scope parameter Comment Default Value
preprocess --dataset DATASET_NAME Dataset to preprocess into SynGen format. Available datasets : [cora, epinions, ogbn_mag, ogbn_mag240m, ieee, tabformer] Required
preprocess -sp | --source-path SOURCE_PATH Path to downloaded raw dataset Required
preprocess -dp | --destination-path DESTINATION_PATH Path to store the preprocessed dataset in SynGen format. SOURCE_PATH/syngen_preprocessed
preprocess --cpu Runs all operations on CPU
preprocess --use-cache Does nothing if the target preprocessed dataset exists
preprocess --download Downloads the dataset to the specified SOURCE_PATH
mimic-dataset -dp | --dataset-path DATASET_PATH Path to the dataset in SynGen format
mimic-dataset -of | --output-file OUTPUT_FILE Path to the generated SynGen Configuration
mimic-dataset -tg | --tab-gen TABULAR_GENERATOR Tabular Generator to use to generate all tabular features (You always can modify OUTPUT_FILE). Available options: [kde, random, gaussian, uniform, ctgan] kde
mimic-dataset -rsg | --random-struct-gen Generates random structure based on Erdos-Renyi model instead of mimicking
mimic-dataset -es | --edge-scale EDGE_SCALE Multiples the number of edges to generate by the provided number
mimic-dataset -en | --node-scale NODE_SCALE Multiples the number of nodes to generate by the provided number
synthesize -cp | --config-path CONFIG_PATH Path to SynGen Configuration file that describes how to generate a graph Required
synthesize -sp | --save-path SAVE_PATH Save path to dump generated files Current directory
synthesize --verbose Displays generation process progress
synthesize --cpu Runs all operations on CPU. [Attention] Alignment is not available on CPU
synthesize --timer-path FILE_PATH Saves generation process timings to the specified file Required

For the pretraining refer to the to Command-line options, as the parameters depend on the model choice.

Define the synthesizer pipeline

In this example, we show how to define the synthesizer pipeline for IEEE dataset. A full example can be found in ieee_notebook.

Prepare data

  • Preprocessing class is used to convert the IEEE dataset into SynGen format.
preprocessing = IEEEPreprocessing(source_path='/workspace/data/ieee-fraud', destination_path='/workspace/data/ieee_preprocessed')
feature_spec = preprocessing.transform()

Prepare SynGen Configuration

  • SynGen Configuration is used to specify all generation details. We use the original dataset feature spec as a base for the configuration
feature_spec_for_config = feature_spec.copy()
  • Tabular generator is used to generate tabular features.
feature_spec_for_config[MetaData.EDGES][0][MetaData.TABULAR_GENERATORS] = [
    {
        MetaData.TYPE: "kde",
        MetaData.FEATURES_LIST: -1, # copies all tabular features from the original dataset
        MetaData.DATA_SOURCE: {
            MetaData.TYPE: "configuration", 
            MetaData.PATH: preprocessed_path,
            MetaData.NAME: "user-product", 
        },
        MetaData.PARAMS: {}
    }
]
  • Structure generator is used to generate graph structure.
feature_spec_for_config[MetaData.EDGES][0][MetaData.STRUCTURE_GENERATOR] = {
    MetaData.TYPE: "RMAT",
    MetaData.DATA_SOURCE: {
        MetaData.TYPE: "cfg",  # the equivalent of 'configuration'
        MetaData.PATH: preprocessed_path,
        MetaData.NAME: "user-product",
    },
    MetaData.PARAMS: {
        "seed": 42,
    }
}
  • After providing all related information, we create a SynGenConfiguration object. It fills out missing fields and validates provided data.
config = SynGenConfiguration(feature_spec_for_config)

Prepare synthesizer

  • Synthesizer is a class that combines all the generators and allows the user to run end-to-end fitting and generation.
synthesizer = ConfigurationGraphSynthesizer(configuration=config, save_path='/workspace/data/ieee_generated')

  • To start fitting process, we use fit method provided by the synthesizer. It will automatically load all required data from the disk based on the information provided in config.
synthesizer.fit()

Generate graph

  • To run generation, we call the generate method provided by the synthesizer. We use return_data=False because we want only to store the generated in /workspace/data/ieee_generated folder. In other case it will download tabular data under the MetaData.FEATURES_DATA key for each node and edge type and structural data under the MetaData.STRUCTURE_DATA key for edges.
out_feature_spec = synthesizer.generate(return_data=False)

Getting the data

To download the datasets used as an example , use get_datasets.sh script

bash scripts/get_datasets.sh

Note: Certain datasets require a Kaggle API key, hence may require manual download. Refer to the links below. Note: Each user is responsible for checking the content of datasets and the applicable licenses and determining if they are suitable for the intended use

List of datasets

Supported datasets:

Performance

Our results were obtained by running the demo notebooks directory in the PyTorch NGC container on NVIDIA DGX1 V100 with 8x V100 32GB GPUs. All the notebooks are presented in the table below.

scope notebook description
1. basic_examples e2e_cora_demo.ipynb a complete process of generating a non-bipartite graph dataset with node features
2. basic_examples e2e_ieee_demo.ipynb a complete process of generating a bipartite graph dataset with edge features
3. basic_examples e2e_epinions_demo.ipynb a complete process of generating a heterogeneous bipartite graph dataset with edge features
4. advanced_examples big_graph_generation.ipynb a complete process of mimicking and scaling the MAG240m dataset
5. performance struct_generator.ipynb comparison of SynGen graph structure generators
6. performance tabular_generator.ipynb comparison of SynGen tabular data generators

Scope refers to the directories in which the notebooks are stored and the functionalities particular notebooks cover . There are

  • Basic - basic_examples - notebooks with the examples of basics functionalities
  • Advanced - advanced_examples - notebooks with the examples of advanced functionalities
  • Performance - performance - notebooks with the performance experiments

To achieve the same results, follow the steps in the Quick Start Guide.

Results

1. Quality of the content of generated dataset vs. original dataset:

The quality of the content comparison was conducted on the IEEE dataset (refer to List of datasets for more details) with corresponding notebook e2e_ieee_demo.ipynb We compared three modalities, that is, quality of generated graph structure, quality of generated tabular data and quality of aligning tabular data to the graph structure.

  • Graph structure quality

    • Comparison of degree distribution for an original graph, properly generated and random (Erdős–Rényi) degree_distribution_quality
    • Comparison of basic graph statistics for an original graph, properly generated and random (Erdős–Rényi) ![graph_structure statistics](img/graph_structure statistics.png)
  • Tabular data quality

    • Comparison of two first components of a PCA of real and generated data pca_components

    • Comparison of basic statistics between real and generated data

      Generator kl divergence correlation correlation
      GAN 0.912 0.018
      Gaussian 0.065 -0.030
      Random 0.617 0.026
  • Structure to tabular alignment quality

    • Degree centrality for feature distribution degree_centrality_feature_distribution
2. Performance (speed) of the synthetic dataset generation:
  • Performance of graph structure generation (edges/s) edge_perf

  • Performance of categorical tabular data generation (samples/s)

    Dataset (CPU/GPU) KDE Uniform Gaussian Random
    ieee (CPU) 371296 897421 530683 440086
    ieee (GPU) 592132 3621726 983408 6438646
3. Synthetic dataset use-case specific quality factors:
  • Performance (batches/s) comparison between original vs. synthetic datasets

    Dataset Model Synthetic Original
    ieee gat 0.07173 0.07249

Release notes

Changelog

August 2023

  • Heterogeneous graph generation
  • Multi-GPU generation

January 2023

  • Initial release

Known issues

There are no known issues with this model.

Reference

Cite

Cite the following paper if you find this code useful or use it in your own work:

@article{darabi2022framework,
  title={A Framework for Large Scale Synthetic Graph Dataset Generation},
  author={Darabi, Sajad and Bigaj, Piotr and Majchrowski, Dawid and Morkisz, Pawel and Fit-Florea, Alex},
  journal={arXiv preprint arXiv:2210.01944},
  year={2022}
}