Memory leak while reading /sys/devices/system/cpu/online inside an Incus container #677

DeyanSG · 2025-02-17T16:22:54Z

Hello,

We are using Incus + lxcfs in our setup and we’ve come to an issue with memory consumption of the lxcfs process while reading aggressively from /sys/devices/system/cpu/online.

Versions
We’ve tested with different versions of both lxcfs and libfuse3 and the issue seems to be present even with the latest stable versions:

lxcfs 6.0.3
libfuse3 both latest (3.16.2) and CentOS 9 default (3.10.2)

Setup
We are running an Incus container on a node with 56 CPU cores. It seems reproducible even with one single container. In our setup the container itself is restricted in CPU usage using limits.cpu.allowance: 1200ms/60ms (although not very relevant it is much faster to see the effect if the container can use more CPU).

Reproducer
To reproduce the issue, compile the following C code that starts a number of threads inside the container, each opening, reading from and then closing а file:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <stdbool.h>

//compile with:
//gcc -pthread -o fuse-stress-poc fuse-stress-poc.c

#define DEFAULT_FILE_PATH "/sys/devices/system/cpu/online"

volatile sig_atomic_t run = 1;

void handle_sigint(int sig) {
    run = 0;
}

void* stress_work(void* arg) {
    const char* file_path = (const char*)arg;
    int fd;
    char buffer[256];
    ssize_t bytes_read;

    while (run) {
        fd = open(file_path, O_RDONLY);
        if (fd != -1) {
            bytes_read = read(fd, buffer, sizeof(buffer) - 1);
            close(fd);
        }
    }
    return NULL;
}

int main(int argc, char* argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <num_threads> <file_path>\n", argv[0]);
        return EXIT_FAILURE;
    }

    int num_threads = atoi(argv[1]);
    if (num_threads <= 0) {
        fprintf(stderr, "Number of threads must be positive.\n");
        return EXIT_FAILURE;
    }

    const char* file_path = argv[2];
    if (access(file_path, R_OK) != 0) {
        fprintf(stderr, "File '%s' does not exist or is not accessible.\n", file_path);
        return EXIT_FAILURE;
    }

    signal(SIGINT, handle_sigint);

    pthread_t* threads = malloc(num_threads * sizeof(pthread_t));
    if (threads == NULL) {
        perror("malloc");
        return EXIT_FAILURE;
    }

    for (int i = 0; i < num_threads; i++) {
        pthread_create(&threads[i], NULL, stress_work, (void*)file_path);
    }

    for (int i = 0; i < num_threads; i++) {
        pthread_join(threads[i], NULL);
    }

    free(threads);

    return 0;
}

Run it with the following command in a container:
./fuse-stress-poc 400 /sys/devices/system/cpu/online

Monitor the RSS memory usage of lxcfs. We can see it go over 1GB in about a minute. Then if we just stop/kill the process inside the container the RSS memory usage stays around the same value instead of dropping back to about 2MB.

So far we’ve tried the following:

Reading /proc/uptime and /proc/cpuinfo to see if we can see a leak with these files but we could not reproduce the issue, RSS usage stays low (around 2MB) while reading these files.
We’ve attempted to find which commit introduces this behavior and as far as we saw, weirdly enough it seems to be the one enabling direct_io: c2b4b50

We would appreciate your assistance in verifying if this issue is reproducible on your end, so we can collaborate effectively to identify and implement a solution.

While investigating other issues related to hanging lxcfs file operations, we inadvertently discovered this situation. As a result, we developed a stress test. Although we were unable to reproduce the hang, we identified what appears to be a memory leak.

Apologies for any confusion caused by the opening, resolving, and creating a new issue. I accidentally clicked the wrong option while typing.

Regards,
Deyan

The text was updated successfully, but these errors were encountered:

stgraber · 2025-02-17T16:23:49Z

@mihalicyn

DeyanSG · 2025-03-07T13:34:25Z

Hi everyone,

If there’s any additional information we can provide or if there's any way we can assist in diagnosing the issue, please feel free to let me know. I'm here to help in any way possible.

Thank you!

Regards,
Deyan

mihalicyn · 2025-03-07T17:33:59Z

Hey @DeyanSG,

thanks for such a detailed report. I'm on the way to investigate this.

I'll let you know if any additional info is needed.

mihalicyn · 2025-03-07T17:36:16Z

We’ve attempted to find which commit introduces this behavior and as far as we saw, weirdly enough it seems to be the one enabling direct_io: c2b4b50

A direct_io one will definitely make things worse in case of memory leak on the LXCFS side, but it doesn't mean that this change is wrong. direct_io is the only way with LXCFS. Just imagine having a page cache on procfs ;-)

Please, can you tell me you kernel version on the host and if you use ZFS by any chance.

DeyanSG · 2025-03-10T08:27:21Z

Hi, @mihalicyn,

Thank you for looking into the issue!

We’ve attempted to find which commit introduces this behavior and as far as we saw, weirdly enough it seems to be the one enabling direct_io: c2b4b50

A direct_io one will definitely make things worse in case of memory leak on the LXCFS side, but it doesn't mean that this change is wrong. direct_io is the only way with LXCFS. Just imagine having a page cache on procfs ;-)

I am more than sure direct_io should be turned on as a couple of years ago we were running a version before this commit and we saw some of the weird stuff that the cache was causing :).
What I found somehow strange was that when running the benchmark with direct_io turned off, the lxcfs process was still working hard (using a lot of CPU time), however even when left overnight the RSS usage remained absolutely unchanged. Otherwise, yes most definitely caching would tend to hide or at least slow down memory leaks so maybe it means nothing.

Please, can you tell me you kernel version on the host and if you use ZFS by any chance.

The host is using kernel 6.6.63. We've also tested with our previous kernel - 6.6.21 while trying to find what caused the issue just to rule out the kernel upgrade and we were able to reproduce the issue with this one as well.

We do not use ZFS .

mihalicyn · 2025-03-11T13:37:41Z

I can confirm this. I'm actively investigating this and will fix it soon.

Huge thanks again for such a detailed report.

DVD mentioned this issue Feb 17, 2025

Memory leak while reading /sys/devices/system/cpu/online inside a container #676

Closed

mihalicyn self-assigned this Mar 7, 2025

mihalicyn added the Bug Confirmed to be a bug label Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak while reading /sys/devices/system/cpu/online inside an Incus container #677

Memory leak while reading /sys/devices/system/cpu/online inside an Incus container #677

DeyanSG commented Feb 17, 2025

stgraber commented Feb 17, 2025

DeyanSG commented Mar 7, 2025

mihalicyn commented Mar 7, 2025

mihalicyn commented Mar 7, 2025 •

edited

Loading

DeyanSG commented Mar 10, 2025

mihalicyn commented Mar 11, 2025

Memory leak while reading /sys/devices/system/cpu/online inside an Incus container #677

Memory leak while reading /sys/devices/system/cpu/online inside an Incus container #677

Comments

DeyanSG commented Feb 17, 2025

stgraber commented Feb 17, 2025

DeyanSG commented Mar 7, 2025

mihalicyn commented Mar 7, 2025

mihalicyn commented Mar 7, 2025 • edited Loading

DeyanSG commented Mar 10, 2025

mihalicyn commented Mar 11, 2025

mihalicyn commented Mar 7, 2025 •

edited

Loading