Skip to content
bigsk1 edited this page Nov 23, 2024 · 1 revision

GPU Monitor Wiki

Welcome to the GPU Monitor Wiki! This comprehensive guide will help you understand, install, and use the GPU Monitor dashboard.

Table of Contents

Overview

GPU Monitor is a real-time NVIDIA GPU monitoring dashboard that provides comprehensive metrics and performance data through an intuitive web interface. Built with Docker for easy deployment and cross-platform compatibility, it offers real-time monitoring, historical data tracking, and customizable alerts.

Features

Real-Time Monitoring

  • Temperature Tracking: Monitor GPU temperature in real-time with color-coded indicators
  • Utilization Metrics: Track GPU usage percentage with historical data
  • Memory Usage: Monitor VRAM usage and availability
  • Power Consumption: Track power usage and efficiency metrics

Historical Data

  • Multiple timeframe views:
    • 15 Minutes
    • 30 Minutes
    • 1 Hour
    • 6 Hours
    • 12 Hours
    • 24 Hours
  • Interactive performance graphs
  • Statistical analysis of historical data

Alert System

  • Configurable threshold alerts for:
    • Temperature
    • GPU Utilization
    • Power Usage
  • Multiple notification methods:
    • Visual alerts
    • Sound notifications
    • Browser notifications
  • Persistent alert settings

Dashboard Components

  • Real-time gauge displays
  • Interactive performance history graph
  • Recent statistics table
  • 24-hour statistics overview

Installation

Prerequisites

  • Docker
  • NVIDIA GPU with drivers installed
  • NVIDIA Container Toolkit

Quick Start

Using Docker Run

docker run -d \
  --name gpu-monitor \
  -p 8081:8081 \
  -e TZ=America/Los_Angeles \
  -v /etc/localtime:/etc/localtime:ro \
  -v ./history:/app/history \
  -v ./logs:/app/logs \
  --gpus all \
  --restart unless-stopped \
  bigsk1/gpu-monitor:latest

Using Docker Compose

version: '3.8'

services:
  gpu-monitor:
    image: bigsk1/gpu-monitor:latest
    container_name: gpu-monitor
    ports:
      - "8081:8081"
    environment:
      - TZ=America/Los_Angeles
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./history:/app/history
      - ./logs:/app/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    runtime: nvidia

Access

Access the dashboard at: http://localhost:8081

Configuration

Time Zone Configuration

Set your local timezone using the TZ environment variable:

-e TZ=America/Los_Angeles

List of available timezones

Alert Settings

Configure alert thresholds through the UI:

  1. Temperature (°C)
  2. GPU Utilization (%)
  3. Power Usage (W)

Alert settings are persistent and stored in browser local storage.

Data Persistence

Data is persisted through Docker volumes:

volumes:
  - ./history:/app/history    # Historical data
  - ./logs:/app/logs         # Application logs

Usage Guide

Dashboard Navigation

Real-Time Metrics

  • Temperature: Current GPU temperature with color-coded gauge
  • GPU Utilization: Current usage percentage
  • Memory Usage: VRAM usage in MiB
  • Power Usage: Current power consumption in watts

Historical Data View

  1. Select timeframe using the buttons:
    • 15 Minutes
    • 30 Minutes
    • 1 Hour
    • 6 Hours
    • 12 Hours
    • 24 Hours
  2. View performance indicators:
    • Peak Temperature
    • Average Utilization
    • Maximum Memory Usage
    • Power Efficiency

Interactive Graph

  • Toggle metrics by clicking on gauge cards
  • View multiple metrics simultaneously
  • Color-coded lines for easy identification:
    • Temperature: Red
    • GPU Usage: Green
    • Memory: White
    • Power: Purple

Alert Management

  1. Access Alert Settings panel
  2. Configure thresholds:
    • Temperature Threshold (°C)
    • GPU Utilization Threshold (%)
    • Power Threshold (W)
  3. Enable/disable:
    • Sound Alerts
    • Browser Notifications

Components

Backend Components

  • NVIDIA SMI Integration
  • Data Collection Service
  • Web Server (aiohttp)
  • Historical Data Processing

Frontend Components

  • Real-time Gauges
  • Interactive Charts (Chart.js)
  • Alert System
  • Responsive Design

Monitoring Features

Metrics Collection

  • 4-second update interval for real-time data
  • Efficient data buffering
  • Automatic log rotation
  • Historical data aggregation

Data Visualization

  • Color-coded gauges
  • Multi-metric graphing
  • Responsive design
  • Mobile compatibility

Troubleshooting

Common Issues

NVIDIA SMI Not Found

# Verify NVIDIA drivers
nvidia-smi

# Test Docker NVIDIA runtime
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Container Fails to Start

  1. Check Docker logs:
docker logs gpu-monitor
  1. Verify GPU access
  2. Check port availability

Dashboard Not Accessible

  1. Verify container is running:
docker ps
  1. Check port mapping
  2. Verify network access

Debug Logging

Enable debug logging by uncommenting in monitor_gpu.sh:

DEBUG=true

Advanced Topics

Custom Alert Sounds

Replace alert.mp3 in the sounds directory with your preferred sound file.

Data Retention

Configure log rotation in monitor_gpu.sh:

local max_size=$((10 * 1024 * 1024))  # 10MB
local max_age=$((2 * 24 * 3600))      # 2 days

Performance Optimization

  • Buffer size configuration
  • Update interval adjustment
  • Log rotation settings

Security Considerations

  • Container isolation
  • Volume permissions
  • Network access control