This repository implements a tool for generating graphs with an arbitrary size, including node and edge tabular features.
Synthetic data generation has become pervasive with imploding amounts of data and demand to deploy machine learning models leveraging such data. There has been an increasing interest in leveraging graph-based neural network model on graph datasets, though many public datasets are of a much smaller scale than that used in real-world applications. Synthetic Graph Generation is a common problem in multiple domains for various applications, including the generation of big graphs with similar properties to original or anonymizing data that cannot be shared. The Synthetic Graph Generation tool enables users to generate arbitrary graphs based on provided real data.
The tool has the following architecture.
The module is composed of three parts: a structural generator, which fits the graph structure, feature generator, which fits the feature distribution contained in the graph; and finally, an aligner, which aligns the generated features with the generated graph structure
The graph structural generator fits graph structure and generate a corresponding graph containing the nodes and edges.
The feature generator fits the feature distribution contained in the graph and generates the corresponding features. There is the option to allow users to generate features associated with nodes, edges, or both.
The aligner aligns the generated features taken from the feature generator with the graph structure generated by a graph structural generator.
This tool supports the following features:
| Feature | Synthetic Graph Generation |
|---|---|
| Non-partite graph generation | Yes |
| N-partite graph generation | Yes |
| Undirected graph generation | Yes |
| Directed graph generation | Yes |
| Self-loops generation | Yes |
| Edge features generation | Yes |
| Node features generation | Yes |
-
Non-partite graph generation is a task to generate a graph that doesn't contain any explicit partites (disjoint and independent sets of nodes).
-
N-partite graph generation is a task to generate a graph that consists of an arbitrary number of partites.
-
Undirected graph generation is a task to generate a graph made up of a set of vertices connected by not ordered edges.
-
Directed graph generation is a task to generate a graph made up of a set of vertices connected by directed edges.
-
Self-loops generation is a task to generate edges that connect a vertex to itself.
-
Edge features generation is a task to generate features associated with an edge.
-
Node features generation is a task to generate features associated with a node.
Structural graph generation
- RMAT
- Random (Erdos-Renyi)
Tabular features
- KDE
- Gaussian
- Uniform
- Random
- CTGAN (Conditional GAN)
Aligner
- XGBoost
The following section lists the requirements you need to run the Synthetic Graph Generation tool.
This repository contains a Dockerfile that extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- NVIDIA Ampere Architecture, NVIDIA Volta or NVIDIA Turing based GPU
- NVIDIA Docker
- Custom Docker containers built for this tool. Refer to the steps in the Quick Start Guide.
For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
For those unable to set up the required environment or create your own container, refer to the versioned NVIDIA Container Support Matrix.
To use the tool, perform the following steps. For the specifics concerning generation and training, refer to the Advanced section.
- Clone the repository.
git clone https://github.com/NVIDIA/DeepLearningExamples
- Go to the
SyntheticGraphGenerationtool directory within theDeepLearningExamplesrepository:
cd DeepLearningExamples/Tools/DGLPyTorch/SyntheticGraphGeneration
- Build the SyntheticGraphGeneration container.
bash docker_scripts/build_docker.sh
- Download the datasets. (It is advisable to run this command inside docker interactive container to ensure environment setup, see 6.1)
bash scripts/get_datasets.sh
Note: This script requires a manual download of 4 datasets (tabformer, ieee, paysim, credit) and putting them into ./data directory with the correct naming. The instruction for the manual download will be printed during the preprocessing. If the raw data is not present or the dataset is already preprocessed, the preprocessing will be skipped.
- Run the SyntheticGraphGeneration Jupyter notebook.
5.1. Run the Docker notebook container.
bash docker_scripts/run_docker_notebook.sh
5.2 Open Jupyter notebook.
http://localhost:9916/tree/demos
- Run the SyntheticGraphGeneration CLI.
6.1. Run the Docker interactive container.
bash docker_scripts/run_docker_interactive.sh
6.2. Run Command Line Interface (CLI) command.
The tool contains 3 run commands: preprocess, ``synthesizeandpretrain`
For example, to synthesize a graph similar to the IEEE dataset, run the following commands:
- Convert IEEE into the SynGen format:
syngen preprocess \
--dataset ieee \
--source-path /workspace/data/ieee-fraud/ \
--destination-path /workspace/data/ieee-preprocessed
Note: --source-path points to the location where the IEEE dataset is extracted,
and destination-path points to the location where the IEEE dataset in SynGen format is saved.
- Prepare SynGen configuration manually or using:
syngen mimic-dataset \
--dataset-path /workspace/data/ieee-preprocessed \
--output-file /workspace/configurations/my_ieee_config.json \
--tab-gen kde \
--edge-scale 1 \
--node-scale 1
Note: In the above commands, the kde tabular generator will be used to generate all tabular features.
- Generate synthetic IEEE
syngen synthesize \
--config-path /workspace/configurations/my_ieee_config.json \
--save-path /workspace/data/ieee-generated
Note: --save-path points to the location where the generated data in SynGen format is saved.
Following the above command, the pretrain command can be used to pre-train or fine-tune the given generated sample.
syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 5 \
--finetune-epochs 5 \
--data-path /workspace/data/ieee-preprocessed \
--edge-name user-product \
--pretraining-data-path /workspace/data/ieee-generated \
--pretraining-edge-name user-product \
--task ec \
--target-col isFraud \
--num-classes 2 \
--log-interval 1
Note: The current set of tasks and models are solely provided as use case examples on how to use the generated synthetic data to pretrain/fine-tune on a downstream task, and generally would need extension/modifications to accomodate very large graphs or arbitrary models.
For the complete CLI usage of the synthesize command run:
syngen synthesize --help
Similarly for the pretrain, mimic-dataset, and preprocess run:
syngen <COMMAND> --help
.
├── demos # Directory with all the Jupyter examples
├── docker_scripts # Directory with Docker scripts
├── scripts # Directory with datasets scripts
├── syngen # Directory with Synthetic Graph Generation source code
│ ├── analyzer # Directory with tools for getting graph visualisation and statistics
│ │ ├── graph # Directory with graph structure analyzer
│ │ └── tabular # Directory with tabular features analyzer
│ ├── benchmark # Directory with pretraining tools
│ │ ├── data_loader # Directory with pre-defined node and edge classification datasets
│ │ ├── models # Directory with GNN model definitions
│ │ └── tasks # Directory with set of tasks that are supported for training
│ ├── cli # Directory with all cli commands
│ ├── configuration # Directory with SynGen formats
│ ├── generator # Directory with all the generators
│ │ ├── graph # Directory with graph generators and graph
│ │ └── tabular # Directory with tabular generators
│ │ ├── data_transformer # Directory with tabular data transformations used by generators
│ │ └── transforms # Directory with tabular column transforms
│ ├── graph_aligner # Directory with all the aligners
│ ├── preprocessing # Directory with the preprocessings for the supported datasets
│ │ └── datasets # Directory with example dataset preprocessing scripts used to generate data
│ ├── synthesizer # Directory with all the synthesizers
│ └── utils # Directory with the utilities
│ └── types # Directory with common data types used in the tool
scripts/get_datasets.sh- Bash script downloading and preprocessing supported datastesdocker_scripts/build_docker.sh- Bash script that builds the Docker imagedocker_scripts/run_docker_notebook.sh- Bash script that runs Jupyter notebook in the Docker containerdocker_scripts/run_docker_interactive.sh- Bash script that runs the Docker container in interactive modesyngen/synthesizer/configuration_graph_synthesizer.py- Python file with graph synthesizer
For the synthesis process, refer to the parameters in the following table.
| Scope | parameter | Comment | Default Value |
|---|---|---|---|
| preprocess | --dataset DATASET_NAME | Dataset to preprocess into SynGen format. Available datasets : [cora, epinions, ogbn_mag, ogbn_mag240m, ieee, tabformer] | Required |
| preprocess | -sp | --source-path SOURCE_PATH | Path to downloaded raw dataset | Required |
| preprocess | -dp | --destination-path DESTINATION_PATH | Path to store the preprocessed dataset in SynGen format. | SOURCE_PATH/syngen_preprocessed |
| preprocess | --cpu | Runs all operations on CPU | |
| preprocess | --use-cache | Does nothing if the target preprocessed dataset exists | |
| preprocess | --download | Downloads the dataset to the specified SOURCE_PATH | |
| mimic-dataset | -dp | --dataset-path DATASET_PATH | Path to the dataset in SynGen format | |
| mimic-dataset | -of | --output-file OUTPUT_FILE | Path to the generated SynGen Configuration | |
| mimic-dataset | -tg | --tab-gen TABULAR_GENERATOR | Tabular Generator to use to generate all tabular features (You always can modify OUTPUT_FILE). Available options: [kde, random, gaussian, uniform, ctgan] | kde |
| mimic-dataset | -rsg | --random-struct-gen | Generates random structure based on Erdos-Renyi model instead of mimicking | |
| mimic-dataset | -es | --edge-scale EDGE_SCALE | Multiples the number of edges to generate by the provided number | |
| mimic-dataset | -en | --node-scale NODE_SCALE | Multiples the number of nodes to generate by the provided number | |
| synthesize | -cp | --config-path CONFIG_PATH | Path to SynGen Configuration file that describes how to generate a graph | Required |
| synthesize | -sp | --save-path SAVE_PATH | Save path to dump generated files | Current directory |
| synthesize | --verbose | Displays generation process progress | |
| synthesize | --cpu | Runs all operations on CPU. [Attention] Alignment is not available on CPU | |
| synthesize | --timer-path FILE_PATH | Saves generation process timings to the specified file | Required |
For the pretraining refer to the to Command-line options, as the parameters depend on the model choice.
In this example, we show how to define the synthesizer pipeline for IEEE dataset. A full example can be found in ieee_notebook.
- Preprocessing class is used to convert the IEEE dataset into SynGen format.
preprocessing = IEEEPreprocessing(source_path='/workspace/data/ieee-fraud', destination_path='/workspace/data/ieee_preprocessed')
feature_spec = preprocessing.transform()
- SynGen Configuration is used to specify all generation details. We use the original dataset feature spec as a base for the configuration
feature_spec_for_config = feature_spec.copy()
- Tabular generator is used to generate tabular features.
feature_spec_for_config[MetaData.EDGES][0][MetaData.TABULAR_GENERATORS] = [
{
MetaData.TYPE: "kde",
MetaData.FEATURES_LIST: -1, # copies all tabular features from the original dataset
MetaData.DATA_SOURCE: {
MetaData.TYPE: "configuration",
MetaData.PATH: preprocessed_path,
MetaData.NAME: "user-product",
},
MetaData.PARAMS: {}
}
]
- Structure generator is used to generate graph structure.
feature_spec_for_config[MetaData.EDGES][0][MetaData.STRUCTURE_GENERATOR] = {
MetaData.TYPE: "RMAT",
MetaData.DATA_SOURCE: {
MetaData.TYPE: "cfg", # the equivalent of 'configuration'
MetaData.PATH: preprocessed_path,
MetaData.NAME: "user-product",
},
MetaData.PARAMS: {
"seed": 42,
}
}
- After providing all related information, we create a
SynGenConfigurationobject. It fills out missing fields and validates provided data.
config = SynGenConfiguration(feature_spec_for_config)
- Synthesizer is a class that combines all the generators and allows the user to run end-to-end fitting and generation.
synthesizer = ConfigurationGraphSynthesizer(configuration=config, save_path='/workspace/data/ieee_generated')
- To start fitting process, we use
fitmethod provided by the synthesizer. It will automatically load all required data from the disk based on the information provided in config.
synthesizer.fit()
- To run generation, we call the
generatemethod provided by the synthesizer. We usereturn_data=Falsebecause we want only to store the generated in/workspace/data/ieee_generatedfolder. In other case it will download tabular data under theMetaData.FEATURES_DATAkey for each node and edge type and structural data under theMetaData.STRUCTURE_DATAkey for edges.
out_feature_spec = synthesizer.generate(return_data=False)
To download the datasets used as an example , use get_datasets.sh script
bash scripts/get_datasets.sh
Note: Certain datasets require a Kaggle API key, hence may require manual download. Refer to the links below. Note: Each user is responsible for checking the content of datasets and the applicable licenses and determining if they are suitable for the intended use
Supported datasets:
Our results were obtained by running the demo notebooks directory in the PyTorch NGC container on NVIDIA DGX1 V100 with 8x V100 32GB GPUs. All the notebooks are presented in the table below.
| scope | notebook | description | |
|---|---|---|---|
| 1. | basic_examples | e2e_cora_demo.ipynb | a complete process of generating a non-bipartite graph dataset with node features |
| 2. | basic_examples | e2e_ieee_demo.ipynb | a complete process of generating a bipartite graph dataset with edge features |
| 3. | basic_examples | e2e_epinions_demo.ipynb | a complete process of generating a heterogeneous bipartite graph dataset with edge features |
| 4. | advanced_examples | big_graph_generation.ipynb | a complete process of mimicking and scaling the MAG240m dataset |
| 5. | performance | struct_generator.ipynb | comparison of SynGen graph structure generators |
| 6. | performance | tabular_generator.ipynb | comparison of SynGen tabular data generators |
Scope refers to the directories in which the notebooks are stored and the functionalities particular notebooks cover . There are
- Basic - basic_examples - notebooks with the examples of basics functionalities
- Advanced - advanced_examples - notebooks with the examples of advanced functionalities
- Performance - performance - notebooks with the performance experiments
To achieve the same results, follow the steps in the Quick Start Guide.
The quality of the content comparison was conducted on the IEEE dataset (refer to List of datasets for more details) with corresponding notebook e2e_ieee_demo.ipynb We compared three modalities, that is, quality of generated graph structure, quality of generated tabular data and quality of aligning tabular data to the graph structure.
-
Graph structure quality
-
Tabular data quality
-
Structure to tabular alignment quality
-
Performance of categorical tabular data generation (samples/s)
Dataset (CPU/GPU) KDE Uniform Gaussian Random ieee (CPU) 371296 897421 530683 440086 ieee (GPU) 592132 3621726 983408 6438646
-
Performance (batches/s) comparison between original vs. synthetic datasets
Dataset Model Synthetic Original ieee gat 0.07173 0.07249
August 2023
- Heterogeneous graph generation
- Multi-GPU generation
January 2023
- Initial release
There are no known issues with this model.
Cite the following paper if you find this code useful or use it in your own work:
@article{darabi2022framework,
title={A Framework for Large Scale Synthetic Graph Dataset Generation},
author={Darabi, Sajad and Bigaj, Piotr and Majchrowski, Dawid and Morkisz, Pawel and Fit-Florea, Alex},
journal={arXiv preprint arXiv:2210.01944},
year={2022}
}




