Troubleshooting Guide#

Common issues, error patterns, and solutions for Graphistry deployments.

Quick Diagnostics#

Before diving into specific issues, run these diagnostic commands:

# Service status
./graphistry ps

# GPU availability
nvidia-smi

# Check for restart loops
docker ps -a | grep -E "(Restarting|Exited)"

# View recent errors across all services
./graphistry logs --tail=50 | grep -i error

Log Configuration#

Environment Variables#

Configure logging in data/config/custom.env:

Variable

Description

Default

LOG_LEVEL

System-wide log level

INFO

GRAPHISTRY_LOG_LEVEL

Graphistry application log level

INFO

Log levels: TRACE, DEBUG, INFO, WARN, ERROR

# Development / Debugging
LOG_LEVEL=DEBUG
GRAPHISTRY_LOG_LEVEL=DEBUG

Notes:

  • TRACE: Slow due to heavy CPU-GPU traffic, may expose secrets

  • DEBUG: Large log volumes, enable log rotation

  • INFO: Recommended for production

  • Always reset to INFO after debugging

Viewing Logs#

# All services with timestamps
./graphistry logs -f -t --tail=100

# Specific services
./graphistry logs -f forge-etl-python streamgl-gpu streamgl-viz

# Filter for errors
./graphistry logs --tail=500 | grep -i "error\|exception\|failed"

# Since specific time
./graphistry logs --since 1h

Log Driver Configuration#

For local log access, ensure Docker uses the JSON log driver. Edit /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

Then restart Docker: sudo service docker restart

Telemetry for Troubleshooting#

OpenTelemetry provides distributed tracing and metrics for diagnosing complex issues across services.

Enabling Telemetry#

Configure in data/config/telemetry.env:

ENABLE_OPEN_TELEMETRY=true
OTEL_COMPOSE_FILE=telemetry.yml

Restart services: ./graphistry stop && ./graphistry up -d

Using Traces for Debugging#

When telemetry is enabled, access Jaeger for distributed tracing:

  • Default: http://$GRAPHISTRY_HOST:16686/jaeger/

  • Behind Caddy: https://$GRAPHISTRY_HOST/jaeger/

Key trace points:

  • Root span: streamgl-viz: handleFalcorSocketConnections

  • GPU worker actions: play, read, stop spans

  • Data fetching: ETL dataset fetch spans from forge-etl-python

Use cases:

  • Identify which service caused an error

  • Find latency bottlenecks in the rendering pipeline

  • Track request flow from browser to GPU workers

Using Metrics for Debugging#

Access Prometheus for metrics:

  • Default: http://$GRAPHISTRY_HOST:9091/prometheus/

  • Behind Caddy: https://$GRAPHISTRY_HOST/prometheus/

Critical metrics to monitor:

  • worker_read_crashes_total: GPU worker crashes

  • forge_etl_python_upload_*: Upload and dataset creation metrics

GPU Monitoring#

Grafana with NVIDIA DCGM provides GPU health dashboards:

  • Default: http://$GRAPHISTRY_HOST:3000

  • Behind Caddy: https://$GRAPHISTRY_HOST/grafana/

See Telemetry Documentation for full configuration.

NVIDIA Driver Troubleshooting#

Verify Driver Installation#

# Check driver version
nvidia-smi

# Expected output shows driver version, CUDA version, GPU info
# If command not found or fails, driver is not installed/loaded

Driver Not Loaded#

Symptoms: nvidia-smi returns “command not found” or “driver not loaded”

# Check if driver module is loaded
lsmod | grep nvidia

# Check for driver installation
dpkg -l | grep nvidia-driver    # Debian/Ubuntu
rpm -qa | grep nvidia-driver    # RHEL/CentOS

# Check kernel messages for GPU errors
dmesg | grep -i nvidia

Solutions:

# Reinstall driver (Ubuntu example)
sudo apt update
sudo apt install --reinstall nvidia-driver-535

# Or use NVIDIA's official installer
# Download from https://www.nvidia.com/drivers

# After install, reboot
sudo reboot

Driver/CUDA Version Mismatch#

Symptoms: CUDA errors, library version conflicts

Check compatibility:

# Driver's maximum supported CUDA version
nvidia-smi | grep "CUDA Version"

# Installed CUDA toolkit version
nvcc --version

The driver’s CUDA version must be >= the toolkit version. See NVIDIA CUDA Compatibility.

GPU Not Visible#

# List all NVIDIA devices
lspci | grep -i nvidia

# Check if GPU is recognized by driver
nvidia-smi -L

# For multi-GPU, verify all GPUs listed
nvidia-smi --query-gpu=index,name,uuid --format=csv

Solutions:

  • Check PCIe seating and power connections

  • Update system BIOS

  • Check for GPU in BIOS settings (may be disabled)

Persistence Mode#

For production, enable persistence mode to reduce GPU initialization latency:

# Enable persistence mode
sudo nvidia-smi -pm 1

# Make persistent across reboots
sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced

Docker NVIDIA Runtime#

Verify Runtime Installation#

# Check if nvidia runtime is available
docker info | grep -i runtime

# Should show: Runtimes: io.containerd.runc.v2 nvidia runc

# Test GPU access in container
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Runtime Not Found#

Symptoms: docker: Error response from daemon: could not select device driver

# Install NVIDIA Container Toolkit
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit

# RHEL/CentOS
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit

Configure Docker Runtime#

# Configure Docker to use nvidia runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

# Verify configuration
cat /etc/docker/daemon.json

Expected /etc/docker/daemon.json:

{
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  }
}

Set NVIDIA as Default Runtime#

For Graphistry, set nvidia as the default runtime:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  }
}

Then restart Docker: sudo systemctl restart docker

GPU Access Denied in Container#

Symptoms: Container starts but GPU not accessible

# Check container GPU access
docker exec -it <container> nvidia-smi

# Verify CUDA_VISIBLE_DEVICES
docker exec -it <container> env | grep CUDA

Solutions:

  • Ensure --gpus all or specific GPU assignment in compose

  • Check cgroup permissions

  • Verify nvidia-container-toolkit version matches driver

Kubernetes NVIDIA GPU Operator#

Verify Operator Installation#

# Check GPU Operator pods
kubectl get pods -n gpu-operator

# All pods should be Running/Completed
# Key pods: nvidia-driver-daemonset, nvidia-device-plugin, gpu-feature-discovery

# Check GPU resources on nodes
kubectl describe nodes | grep -A5 nvidia.com/gpu

Operator Not Deploying#

Symptoms: GPU Operator pods stuck in Pending/CrashLoopBackOff

# Check operator logs
kubectl logs -n gpu-operator -l app=gpu-operator

# Check driver daemonset
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# Check events
kubectl get events -n gpu-operator --sort-by='.lastTimestamp'

Common causes:

  • Pre-installed drivers conflict: Use --set driver.enabled=false if host has drivers

  • Kernel headers missing: Install linux-headers-$(uname -r)

  • Unsupported OS/kernel version

Device Plugin Issues#

Symptoms: Nodes show 0 GPU capacity

# Check device plugin
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verify node labels
kubectl get nodes --show-labels | grep nvidia

# Check allocatable resources
kubectl get nodes -o json | jq '.items[].status.allocatable'

Solutions:

# Restart device plugin
kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n gpu-operator

# If using MIG, ensure MIG mode configured correctly
nvidia-smi mig -lgi

Pod Can’t Access GPU#

Symptoms: Pod pending with “Insufficient nvidia.com/gpu

# Check GPU allocation
kubectl describe node <node-name> | grep -A10 "Allocated resources"

# Check pod resource requests
kubectl describe pod <pod-name> | grep -A5 "Limits"

Pod GPU resource spec:

resources:
  limits:
    nvidia.com/gpu: 1

Time-Slicing Configuration#

For sharing GPUs across pods:

# gpu-operator values
devicePlugin:
  config:
    name: time-slicing-config
    default: any
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

MIG (Multi-Instance GPU) Issues#

# Check MIG status
nvidia-smi mig -lgi

# Enable MIG mode (requires reboot)
sudo nvidia-smi -mig 1

# Create MIG instances
sudo nvidia-smi mig -cgi 9,9,9 -C

For Kubernetes MIG support, configure GPU Operator with MIG strategy:

helm upgrade gpu-operator nvidia/gpu-operator \
  --set mig.strategy=single

Common Issues#

Service Won’t Start#

Symptoms: Container in restart loop, exits immediately

# Check container logs
./graphistry logs <service_name>

# Check service status
./graphistry ps

# Check for port conflicts
docker ps -a
netstat -tulpn | grep <port>

Solutions:

  • Port conflict: ./graphistry down and check for conflicting containers

  • Missing dependencies: Check service startup order

  • Configuration error: Review custom.env for syntax errors

Database Connection Issues#

Symptoms: Nexus won’t start, authentication failures, data not persisting

# Check postgres health
./graphistry logs postgres

# Test database connectivity
docker exec -it compose_postgres_1 pg_isready

# Check connection from nexus
docker exec -it compose_nexus_1 /bin/bash

Solutions:

  • Wait for postgres to be ready before nexus starts

  • Reset database: ./graphistry down -v && ./graphistry up -d (WARNING: deletes all data)

  • Check disk space: docker system df

Memory Issues#

Symptoms: OOM errors, slow performance, container crashes

# Monitor resource usage
docker stats

# Check system memory
free -h

# Check GPU memory
nvidia-smi

Solutions:

  • Clean Docker resources: docker system prune -f

  • Reduce worker counts in custom.env

  • Enable GPU memory watcher (see Performance Tuning)

Network/Connectivity Issues#

Symptoms: Services can’t communicate, API timeouts, WebSocket failures

# Check all services running
./graphistry ps

# Test inter-service connectivity
docker exec -it compose_nexus_1 ping compose_postgres_1

Solutions:

  • Restart affected services

  • Check firewall rules for required ports

  • Verify Caddy configuration

Upload Failures#

Check upload size limits in custom.env:

UPLOAD_MAX_SIZE=1G
GRAPHISTRY_HTTP_AGENT_TIMEOUT_MS=120000

Service-Specific Issues#

forge-etl-python#

# Check worker health
./graphistry logs forge-etl-python --tail=100

# Verify GPU assignment
./graphistry logs forge-etl-python | grep "GPU initialized"

Common issues:

  • cuDF memory errors: Reduce cache sizes or enable GPU watcher

  • Worker crashes: Check FORGE_NUM_WORKERS matches available GPU memory

streamgl-gpu#

# Check GPU worker status
./graphistry logs streamgl-gpu --tail=100

# Verify OpenCL
./graphistry logs streamgl-gpu | grep -i "opencl\|gpu"

Common issues:

  • OpenCL UUID format: Use integer format (0,1,2) not UUID format for STREAMGL_CUDA_VISIBLE_DEVICES

  • PM2 process manager issues: Check for worker spawn errors

streamgl-viz#

./graphistry logs streamgl-viz --tail=100

Common issues:

  • WebSocket connection failures: Check proxy configuration

  • Session timeout: Adjust GRAPHISTRY_VIZ_TO_GPU_TIMEOUT_MS

Recovery Procedures#

Soft Restart#

./graphistry restart <service_name>

Hard Restart#

./graphistry up -d --force-recreate --no-deps <service_name>

Full Environment Reset#

./graphistry down -v   # WARNING: deletes user database
docker system prune -f
./graphistry up -d

Getting Help#

Collect diagnostic information:

./graphistry ps > diagnostics.txt
./graphistry logs --tail=500 >> diagnostics.txt
nvidia-smi >> diagnostics.txt
docker info >> diagnostics.txt
cat VERSION >> diagnostics.txt

See also: