環境建置
Build AI development env for learning.
The setup uses CUDA 11.7 and PyTorch 2.0, includes configuring SSH services, setting up NVIDIA drivers, and a test script to verify if PyTorch and the GPU are working correctly.
Updated: (this is better.)
https://github.com/Microfish31/ai-env
Install docker-ce and docker client
- docker-ce
sudo apt-get update sudo apt install apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" apt-cache policy docker-ce sudo apt install docker-ce
- docker client
When you install Docker CE, the Docker client tools are automatically installed. - Configuration
This command sequence creates adocker
group, adds the current user to the group, and refreshes the session to allow the user to run Docker commands withoutsudo
immediately.sudo groupadd docker sudo usermod -aG docker $USER newgrp docker
- Decker Desktop (or this)
https://docs.docker.com/desktop/gpu/
Setting Up NVIDIA Drivers and Docker Container Toolkit
- To enable GPU usage in the Docker container, install the NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \ && sudo apt-get update
- Install the NVIDIA Container Toolkit and restart Docker:
sudo apt install nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Creating the Dockerfile
We will use the official PyTorch CUDA 11.7 container image and install OpenSSH for remote access.
- Create file name as
dockerfile
in current path. And paste the content below:FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime # Set the environment variable ENV PATH="/opt/conda/bin:$PATH" # Install openssh-server RUN apt-get update && apt-get install -y openssh-server # SSH Configurations RUN mkdir -p /var/run/sshd && \ echo 'root:root' | chpasswd && \ sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config # Install JupyterLab RUN pip install jupyterlab # Expose port 22 for ssh EXPOSE 22 # Expose port 8888 for JupyterLab EXPOSE 8888 # Set the volume VOLUME ["/home/ubuntu/ai:/workspace"] # Start SSH service and add env path CMD ["sh", "-c", "echo 'export PATH=\"/opt/conda/bin:$PATH\"' >> ~/.bashrc && . ~/.bashrc && /usr/sbin/sshd -D"]
Building and Running the Docker Container
- Pull the PyTorch container image from Docker Hub:
docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
- Build the Docker image from the Dockerfile:
docker build -t cuda11.7-ssh-jupyter .
- Run the Docker container with GPU support and SSH enabled:
docker run --name my-ai-env --gpus all -p 3131:22 -p 8888:8888 -w /workspace -d cuda11.7-ssh-jupyter
- Connect to the container via SSH (replace the IP address as necessary):
ps: You can find the ip by usingssh root@<container-IP-address> -p3131 root
ifconfig
.
Verify your gpu in a new container
docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
Verify Jupyter
jupyter server --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''
visit http://172.31.230.225:8888/lab in chrome browser
Verify GPU with PyTorch
You can run the following Python script to verify that PyTorch is correctly installed and that the GPU is available:
import torch
def check_pytorch_and_gpu():
# Check if PyTorch is installed
if torch.__version__:
print(f"PyTorch version: {torch.__version__} is installed.")
else:
print("PyTorch is not installed.")
# Check if a GPU is available
if torch.cuda.is_available():
print(f"GPU is available. GPU name: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
else:
print("GPU is not available. Running on CPU.")
if __name__ == "__main__":
check_pytorch_and_gpu()
Backup docker images
- Save docker image
docker image save -o <output-file>.tar <image-name>:<tag>
- Load docker image
docker load -i /path/to/your-image-file.tar
Setting the Container Path (moved to dockerfile)
Since SSH-ing into a container does not initialize the environment as expected, we need to manually add the Anaconda path to ensure that Python packages are accessible.
- To add the Anaconda path to the container, execute the following command:
echo 'export PATH="/opt/conda/bin:$PATH"' >> ~/.bashrc source ~/.bashrc
config for containerd
- path:
/etc/containerd/config.toml
version = 2 [plugins] [plugins."io.containerd.runtime.v1.linux"] shim_debug = true [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "runc" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" [debug] level = "info" [metrics] address = "127.0.0.1:1338" grpc_histogram = false [grpc] address = "/run/containerd/containerd.sock" uid = 0 gid = 0 [timeouts] task_shutdown = "15s" [ttrpc] address = "" [proxy_plugins] [proxy_plugins."snapshot-overlayfs"] type = "snapshot" address = "/run/containerd/snapshotter-overlayfs.sock" [plugins."io.containerd.snapshotter.v1.overlayfs"] root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
- After config , restart the containerd service
sudo systemctl restart containerd
Next -> Reduce image size
Build image form nvidia-cuda
- pytorch
- python
- pip
- miniconda
https://hub.docker.com/r/nvidia/cuda https://pytorch.org/
Docker add user
- Write in Dockerfile
FROM python:3.10-slim ARG UID=1001 RUN useradd -u $UID -m appuser USER appuser
- Pass user id into Dockerfile
docker build --build-arg UID=$(id -u) -t myimage .
Wifi Settings
# 建立一個新的連線設定
sudo nmcli connection add type wifi ifname wlx00ad244780e7 con-name "LIN-static" ssid "LIN"
# 設定密碼與加密方式
sudo nmcli connection modify "LIN-static" wifi-sec.key-mgmt wpa-psk
sudo nmcli connection modify "LIN-static" wifi-sec.psk "your password"
# 設定固定 IP(記得依你的網路環境調整)
sudo nmcli connection modify "LIN-static" ipv4.addresses 192.168.1.100/24
sudo nmcli connection modify "LIN-static" ipv4.gateway 192.168.1.1
sudo nmcli connection modify "LIN-static" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify "LIN-static" ipv4.method manual
# 啟用新設定
sudo nmcli connection up "LIN-static"
Display Setting
- 成功告訴 X server:「使用 NVIDIA GTX 1060 來輸出畫面」
- 明確設定了 GPU,避免跟 GTX 750 Ti 混淆
- 在你目前的硬體環境下運作良好
以後要設定雙螢幕、多 GPU 使用、或 HDMI/DP 特定輸出,都可以在這份檔案中進一步客製化!
sudo tee /etc/X11/xorg.conf > /dev/null <<EOF
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0" 0 0
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/input/mice"
Option "ZAxisMapping" "4 5 6 7"
EndSection
Section "InputDevice"
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:1:0:0"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
EOF
VPN
Tailscale Free Tier Summary: Free to use, indefinitely Up to 3 users Up to 100 devices Access to nearly all Tailscale features Additional devices: $0.50 per device per month Official website: https://tailscale.com/
Device A (Client) ---\
+--> Tailscale 控制伺服器(交換資訊、協助穿透 NAT)
Device B (Client) ---/
=> 然後 A 與 B 嘗試直接溝通(使用 WireGuard 協議)
https://login.tailscale.com/admin/machines/ https://tailscale.com/download https://tailscale.com/blog/how-tailscale-works
Power Control
- 手動 shutdown
- 用 TP-Link app 關閉插座電源(完全切電)
- 等個 30 秒,再開啟電源
- 伺服器就會偵測到電源恢復,自動開機
ASUS B85M PLUS
Advanced > APM Configuration > Restore AC Power Loss >
[Power Off]:若系統電源中斷後再次連接電源,電腦保持關機狀態,不會自動開機
[Power On]:若系統電源中斷後再次連接電源,電腦會自動開機,不需要按壓機箱上的開機鍵
[Last State]:若系統電源中斷後再次連接電源,電腦會恢復到關機前的狀態,舉例如下:
a. 如果電源中斷前,系統是開機,睡眠,或休眠其中的一種狀態,那麼電源中斷後再次連接電源後,系統恢復至對應狀態
b. 如果電源中斷前,系統是關機狀態,那麼電源中斷後再次連接電源後,系統狀態還是關機狀態
Mail Notification
When your server boot up ready, and send a email to yourself.
msmtp 是一個輕量級的 SMTP 寄信工具
msmtp 是一個 SMTP client(客戶端),它的工作是:
- 連線到別人的 SMTP 伺服器(例如 Gmail、Yahoo、公司郵件伺服器)
- 使用你的帳號和密碼登入 (App Passwords)
- 把信「投遞出去」
mail 會幫你組 email → 然後把它交給 msmtp 寄出去
Flow
你寫的 script 或手動下指令
↓
mail(使用者介面)
↓
msmtp(SMTP 客戶端)
↓
smtp.mail.yahoo.com(Yahoo 的 SMTP server)
↓
你的 email 信箱 📨
Steps
- Install smtp client
sudo apt update sudo apt install msmtp
- Setting msmtprc
password from: https://login.yahoo.com/account/security and the generate App Passwordsdefaults auth on tls on tls_trust_file /etc/ssl/certs/ca-certificates.crt logfile ~/.msmtp.log account yahoo host smtp.mail.yahoo.com port 587 from <your yahoo mail> user <your yahoo mail> password <your psw> account default : yahoo
- Change level
chmod 600 ~/.msmtprc
- Test
echo -e "Subject: 測試信 from msmtp\n\n這是一封測試信" | msmtp <your yahoo mail>
- Create notification scripts
# boot_notify_email.sh #!/bin/bash # Get Time NOW=$(date +"%Y-%m-%d %H:%M:%S") SUBJECT_TEXT="🟢 Server Boot Notification $NOW" TO="<your yahoo mail>" # MIME encoding(Base64 + UTF-8) SUBJECT="=?UTF-8?B?$(echo -n "$SUBJECT_TEXT" | base64)?=" # System Info HOSTNAME=$(hostname) LOCAL_IP=$(hostname -I) PUBLIC_IP=$(curl -s ifconfig.me) UPTIME=$(uptime -p) DATE=$(date) DISK=$(df -h --output=source,size,used,avail,pcent,target | tail -n +2 | awk 'BEGIN {print "<table border=1 cellpadding=5 cellspacing=0><tr><th>Filesystem</th><th>Size</th><th>Used</th><th>Avail</th><th>Use%</th><th>Mounted on</th></tr>"} {printf "<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>", $1, $2, $3, $4, $5, $6} END {print "</table>"}') MEM=$(free -h | awk 'NR==1 {print "<table border=1 cellpadding=5 cellspacing=0><tr>"; for(i=1;i<=NF;i++) printf "<th>%s</th>", $i; print "</tr>"} NR==2 || NR==3 {printf "<tr>"; for(i=1;i<=NF;i++) printf "<td>%s</td>", $i; print "</tr>"} END {print "</table>"}') # Combine to HTML BODY=$(cat <<EOF Content-Type: text/html; charset=UTF-8 Subject: $SUBJECT To: $TO From: $TO <html> <body style="font-family: sans-serif;"> <h2>✅ The server has booted up!</h2> <p><strong>🖥️ Hostname:</strong> $HOSTNAME</p> <p><strong>🌐 Local IP:</strong> $LOCAL_IP</p> <p><strong>🌍 Public IP:</strong> $PUBLIC_IP</p> <p><strong>📈 Uptime:</strong> $UPTIME</p> <p><strong>🕒 Time:</strong> $DATE</p> <h3>💾 Disk Usage:</h3> $DISK <h3>🧠 Memory:</h3> $MEM </body> </html> EOF ) MAX_RETRIES=3 RETRY_DELAY=5 COUNT=0 SUCCESS=0 while [ $COUNT -lt $MAX_RETRIES ]; do echo "$BODY" | msmtp --read-envelope-from -t if [ $? -eq 0 ]; then echo "Mail sent successfully." SUCCESS=1 break else echo "Send failed. Retrying... ($((COUNT+1))/$MAX_RETRIES)" sleep $RETRY_DELAY ((COUNT++)) fi done if [ $SUCCESS -ne 1 ]; then echo "Failed to send mail after $MAX_RETRIES attempts." exit 1 fi
- setting start-up script
paste below and savechmod +x ~/boot_notify_email.sh crontab -e
@reboot /home/your_username/boot_notify_email.sh
References
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/release-notes.html#
https://github.com/NVIDIA/nvidia-container-toolkit/issues/154
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.
https://blog.csdn.net/haima95/article/details/139169784
https://docs.docker.com/engine/install/ubuntu/#uninstall-docker-engine