Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM issue with dgl.unbatch on GPU #6542

Closed
wondey-sh opened this issue Nov 8, 2023 · 3 comments · Fixed by #6564
Closed

OOM issue with dgl.unbatch on GPU #6542

wondey-sh opened this issue Nov 8, 2023 · 3 comments · Fixed by #6564
Assignees
Labels
bug:confirmed Something isn't working

Comments

@wondey-sh
Copy link

🐛 Bug

The GPU memory keeps increasing while conducting dgl.unbatch on batched graphs on GPU and copying splitted graphs to CPU.

To Reproduce

Run the following script, and we will notice that the allocated GPU memory keeps increasing. If a much larger graph dataset is used, this will evently cause OOM crash.

import dgl
import torch
from dgl.dataloading import GraphDataLoader


dataset = dgl.data.QM9EdgeDataset()
dataloader = GraphDataLoader(dataset, batch_size=64)

graph_list = []

for batch_graph, _ in dataloader:
    batch_graph = batch_graph.to('cuda:0')
    split_graphs = dgl.unbatch(batch_graph)

    graph_list.extend([graph.cpu() for graph in split_graphs])

    print(f'memory allocation: {torch.cuda.memory_allocated() / 1024**2} Mb')

    torch.cuda.empty_cache()

Expected behavior

How could the GPU memory increase be prevented?

Environment

  • DGL Version: 1.1.2+cu117
  • Backend Library & Version: Pytorch 1.13.1
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: 3.9
  • CUDA/cuDNN version (if applicable): 11.7
  • GPU models and configuration (e.g. V100): V100
  • Any other relevant information: reproduce the same phenomenon on a Windows machine (dgl 1.1.2+cu116 and python3.10)

Additional context

@czkkkkkk
Copy link
Collaborator

czkkkkkk commented Nov 8, 2023

Hi @wondey-sh, have you estimated the expected memory usage?

@wondey-sh
Copy link
Author

wondey-sh commented Nov 8, 2023

Hi @wondey-sh, have you estimated the expected memory usage?

Hi @czkkkkkk , the example code is used in part of my model inference code, and I expect that it can be run in a g4dn.xlarge machine. However, OOM happens even when batch_size is 1. The graphs in my dataset is much larger than the example code, but are perfectly fine for training with batch_size=16. Thanks.

@czkkkkkk
Copy link
Collaborator

czkkkkkk commented Nov 9, 2023

I see. I reproduced the issue. We will dive deep into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants