【报错】使用 AutoDL 复现实验时遇到 RuntimeError: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. 报错:
在这里插入图片描述
显卡是 RTX 3090 24G,软件环境参照 instruct-pix2pix 的 environment.yaml

【原因】执行 nvidia-smi 指令获取 GPU 相关信息,包括驱动版本、最高支持的 CUDA 版本和一些设备信息:
在这里插入图片描述

注意,此处的 CUDA Version 并非本机的 CUDA 版本,而是 GPU 最高支持的 CUDA 版本。想要查询本机的 CUDA 版本,可以通过指令:nvcc --version

按照报错提示,访问 http://www.nvidia.com/Download/index.aspx 查询合适的 GPU 驱动版本,发现确实是驱动版本太低 1 2,至少需要 535.146.02 版本的驱动程序,而服务器上的仅有 515.76:
在这里插入图片描述

【解决办法】参考 AutoDL私有云 | GPU驱动 更新驱动,但第一部卸载当前驱动无法执行,可以按照 How can I uninstall a nvidia driver completely ? 中卸载驱动。

卸载驱动后,安装新驱动:wget https://us.download.nvidia.com/XFree86/Linux-x86_64/535.98/NVIDIA-Linux-x86_64-535.98.run
在这里插入图片描述

最后一步遇到 ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer. 报错:
在这里插入图片描述
查阅大量资料也未能解决 3

因为是远程服务器无法本地安装驱动,建议换一台版本更高的。


  1. UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10010) ↩︎

  2. NVIDIA driver too old error #4546 ↩︎

  3. How to solve ‘ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel’? ↩︎

Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐