環境建置

Mon, Jun 30, 2025
7-minute read

Build AI development env for learning.

This note is about enabling GPU support in Docker and connecting to the container via SSH.
The setup uses CUDA 11.7 and PyTorch 2.0, includes configuring SSH services, setting up NVIDIA drivers, and a test script to verify if PyTorch and the GPU are working correctly.

Updated: (this is better.)
https://github.com/Microfish31/ai-env

Install docker-ce and docker client

docker-ce

sudo apt-get update
sudo apt install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-cache policy docker-ce
sudo apt install docker-ce

docker client
When you install Docker CE, the Docker client tools are automatically installed.
Configuration
This command sequence creates a docker group, adds the current user to the group, and refreshes the session to allow the user to run Docker commands without sudo immediately.
```
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
```
Decker Desktop (or this)
https://docs.docker.com/desktop/gpu/

Setting Up NVIDIA Drivers and Docker Container Toolkit

To enable GPU usage in the Docker container, install the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& sudo apt-get update

Install the NVIDIA Container Toolkit and restart Docker:

sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Creating the Dockerfile

We will use the official PyTorch CUDA 11.7 container image and install OpenSSH for remote access.

Create file name as dockerfile in current path. And paste the content below:

FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

# Set the environment variable
ENV PATH="/opt/conda/bin:$PATH"

# Install openssh-server
RUN apt-get update && apt-get install -y openssh-server

# SSH Configurations
RUN mkdir -p /var/run/sshd && \
    echo 'root:root' | chpasswd && \
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
    sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config

# Install JupyterLab
RUN pip install jupyterlab

# Expose port 22 for ssh
EXPOSE 22

# Expose port 8888 for JupyterLab
EXPOSE 8888

# Set the volume
VOLUME ["/home/ubuntu/ai:/workspace"]

# Start SSH service and add env path
CMD ["sh", "-c", "echo 'export PATH=\"/opt/conda/bin:$PATH\"' >> ~/.bashrc && . ~/.bashrc && /usr/sbin/sshd -D"]

Building and Running the Docker Container

Pull the PyTorch container image from Docker Hub:

docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

Build the Docker image from the Dockerfile:
```
docker build -t cuda11.7-ssh-jupyter .
```

Run the Docker container with GPU support and SSH enabled:

docker run --name my-ai-env --gpus all -p 3131:22 -p 8888:8888 -w /workspace -d cuda11.7-ssh-jupyter

Connect to the container via SSH (replace the IP address as necessary):
```
ssh root@<container-IP-address> -p3131
root
```
ps: You can find the ip by using ifconfig.

Verify your gpu in a new container

docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi

Verify Jupyter

jupyter server --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''

visit http://172.31.230.225:8888/lab in chrome browser

Verify GPU with PyTorch

You can run the following Python script to verify that PyTorch is correctly installed and that the GPU is available:

import torch

def check_pytorch_and_gpu():
    # Check if PyTorch is installed
    if torch.__version__:
        print(f"PyTorch version: {torch.__version__} is installed.")
    else:
        print("PyTorch is not installed.")

    # Check if a GPU is available
    if torch.cuda.is_available():
        print(f"GPU is available. GPU name: {torch.cuda.get_device_name(0)}")
        print(f"CUDA version: {torch.version.cuda}")
    else:
        print("GPU is not available. Running on CPU.")

if __name__ == "__main__":
    check_pytorch_and_gpu()

Backup docker images

Save docker image

docker image save -o <output-file>.tar <image-name>:<tag>

Load docker image

docker load -i /path/to/your-image-file.tar

Setting the Container Path (moved to dockerfile)

Since SSH-ing into a container does not initialize the environment as expected, we need to manually add the Anaconda path to ensure that Python packages are accessible.

To add the Anaconda path to the container, execute the following command:

echo 'export PATH="/opt/conda/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

config for containerd

path: /etc/containerd/config.toml

version = 2

[plugins]
[plugins."io.containerd.runtime.v1.linux"]
    shim_debug = true

[plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
        runtime_type = "io.containerd.runc.v2"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
        runtime_type = "io.containerd.runc.v2"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

[debug]
level = "info"

[metrics]
address = "127.0.0.1:1338"
grpc_histogram = false

[grpc]
address = "/run/containerd/containerd.sock"
uid = 0
gid = 0

[timeouts]
task_shutdown = "15s"

[ttrpc]
address = ""

[proxy_plugins]
[proxy_plugins."snapshot-overlayfs"]
    type = "snapshot"
    address = "/run/containerd/snapshotter-overlayfs.sock"

[plugins."io.containerd.snapshotter.v1.overlayfs"]
root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"

After config , restart the containerd service
```
sudo systemctl restart containerd
```

Next -> Reduce image size

Build image form nvidia-cuda

pytorch
python
pip
miniconda

https://hub.docker.com/r/nvidia/cuda https://pytorch.org/

Docker add user

Write in Dockerfile

FROM python:3.10-slim

ARG UID=1001
RUN useradd -u $UID -m appuser
USER appuser

Pass user id into Dockerfile

docker build --build-arg UID=$(id -u) -t myimage .

Wifi Settings

# 建立一個新的連線設定
sudo nmcli connection add type wifi ifname wlx00ad244780e7 con-name "LIN-static" ssid "LIN"

# 設定密碼與加密方式
sudo nmcli connection modify "LIN-static" wifi-sec.key-mgmt wpa-psk
sudo nmcli connection modify "LIN-static" wifi-sec.psk "your password"

# 設定固定 IP（記得依你的網路環境調整）
sudo nmcli connection modify "LIN-static" ipv4.addresses 192.168.1.100/24
sudo nmcli connection modify "LIN-static" ipv4.gateway 192.168.1.1
sudo nmcli connection modify "LIN-static" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify "LIN-static" ipv4.method manual

# 啟用新設定
sudo nmcli connection up "LIN-static"

Display Setting

成功告訴 X server：「使用 NVIDIA GTX 1060 來輸出畫面」
明確設定了 GPU，避免跟 GTX 750 Ti 混淆
在你目前的硬體環境下運作良好

以後要設定雙螢幕、多 GPU 使用、或 HDMI/DP 特定輸出，都可以在這份檔案中進一步客製化！

sudo tee /etc/X11/xorg.conf > /dev/null <<EOF
Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "ZAxisMapping" "4 5 6 7"
EndSection

Section "InputDevice"
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:1:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection
EOF

VPN

Tailscale Free Tier Summary: Free to use, indefinitely Up to 3 users Up to 100 devices Access to nearly all Tailscale features Additional devices: $0.50 per device per month Official website: https://tailscale.com/

Device A (Client)  ---\
                       +--> Tailscale 控制伺服器（交換資訊、協助穿透 NAT）
Device B (Client)  ---/

=> 然後 A 與 B 嘗試直接溝通（使用 WireGuard 協議）

https://login.tailscale.com/admin/machines/ https://tailscale.com/download https://tailscale.com/blog/how-tailscale-works

Power Control

手動 shutdown
用 TP-Link app 關閉插座電源（完全切電）
等個 30 秒，再開啟電源
伺服器就會偵測到電源恢復，自動開機

ASUS B85M PLUS
Advanced > APM Configuration > Restore AC Power Loss >
[Power Off]：若系統電源中斷後再次連接電源，電腦保持關機狀態，不會自動開機
[Power On]：若系統電源中斷後再次連接電源，電腦會自動開機，不需要按壓機箱上的開機鍵
[Last State]：若系統電源中斷後再次連接電源，電腦會恢復到關機前的狀態，舉例如下：
a. 如果電源中斷前，系統是開機，睡眠，或休眠其中的一種狀態，那麼電源中斷後再次連接電源後，系統恢復至對應狀態
b. 如果電源中斷前，系統是關機狀態，那麼電源中斷後再次連接電源後，系統狀態還是關機狀態

Mail Notification

When your server boot up ready, and send a email to yourself.
msmtp 是一個輕量級的 SMTP 寄信工具
msmtp 是一個 SMTP client（客戶端），它的工作是：

連線到別人的 SMTP 伺服器（例如 Gmail、Yahoo、公司郵件伺服器）
使用你的帳號和密碼登入 (App Passwords)
把信「投遞出去」

mail 會幫你組 email → 然後把它交給 msmtp 寄出去

Flow

你寫的 script 或手動下指令
         ↓
  mail（使用者介面）
         ↓
    msmtp（SMTP 客戶端）
         ↓
smtp.mail.yahoo.com（Yahoo 的 SMTP server）
         ↓
     你的 email 信箱 📨

Steps

Install smtp client
```
sudo apt update
sudo apt install msmtp
```

Setting msmtprc

defaults
auth on
tls on
tls_trust_file /etc/ssl/certs/ca-certificates.crt
logfile ~/.msmtp.log

account yahoo
host smtp.mail.yahoo.com
port 587
from <your yahoo mail>
user <your yahoo mail>
password <your psw>

account default : yahoo

password from: https://login.yahoo.com/account/security and the generate App Passwords

Change level
```
chmod 600 ~/.msmtprc
```

Test

echo -e "Subject: 測試信 from msmtp\n\n這是一封測試信" | msmtp <your yahoo mail>

Create notification scripts

# boot_notify_email.sh

#!/bin/bash
# Get Time
NOW=$(date +"%Y-%m-%d %H:%M:%S")

SUBJECT_TEXT="🟢 Server Boot Notification $NOW"
TO="<your yahoo mail>"

# MIME encoding（Base64 + UTF-8）
SUBJECT="=?UTF-8?B?$(echo -n "$SUBJECT_TEXT" | base64)?="

# System Info
HOSTNAME=$(hostname)
LOCAL_IP=$(hostname -I)
PUBLIC_IP=$(curl -s ifconfig.me)
UPTIME=$(uptime -p)
DATE=$(date)

DISK=$(df -h --output=source,size,used,avail,pcent,target | tail -n +2 | awk 'BEGIN {print "<table border=1 cellpadding=5 cellspacing=0><tr><th>Filesystem</th><th>Size</th><th>Used</th><th>Avail</th><th>Use%</th><th>Mounted on</th></tr>"} {printf "<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>", $1, $2, $3, $4, $5, $6} END {print "</table>"}')

MEM=$(free -h | awk 'NR==1 {print "<table border=1 cellpadding=5 cellspacing=0><tr>"; for(i=1;i<=NF;i++) printf "<th>%s</th>", $i; print "</tr>"} NR==2 || NR==3 {printf "<tr>"; for(i=1;i<=NF;i++) printf "<td>%s</td>", $i; print "</tr>"} END {print "</table>"}')

# Combine to HTML
BODY=$(cat <<EOF
Content-Type: text/html; charset=UTF-8
Subject: $SUBJECT
To: $TO
From: $TO

<html>
<body style="font-family: sans-serif;">
<h2>✅ The server has booted up!</h2>

<p><strong>🖥️ Hostname:</strong> $HOSTNAME</p>
<p><strong>🌐 Local IP:</strong> $LOCAL_IP</p>
<p><strong>🌍 Public IP:</strong> $PUBLIC_IP</p>
<p><strong>📈 Uptime:</strong> $UPTIME</p>
<p><strong>🕒 Time:</strong> $DATE</p>

<h3>💾 Disk Usage:</h3>
$DISK

<h3>🧠 Memory:</h3>
$MEM

</body>
</html>
EOF
)


MAX_RETRIES=3
RETRY_DELAY=5
COUNT=0
SUCCESS=0

while [ $COUNT -lt $MAX_RETRIES ]; do
    echo "$BODY" | msmtp --read-envelope-from -t
    if [ $? -eq 0 ]; then
        echo "Mail sent successfully."
        SUCCESS=1
        break
    else
        echo "Send failed. Retrying... ($((COUNT+1))/$MAX_RETRIES)"
        sleep $RETRY_DELAY
        ((COUNT++))
    fi
done

if [ $SUCCESS -ne 1 ]; then
    echo "Failed to send mail after $MAX_RETRIES attempts."
    exit 1
fi

setting start-up script

chmod +x ~/boot_notify_email.sh
crontab -e

paste below and save

@reboot /home/your_username/boot_notify_email.sh

References

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/release-notes.html#
https://github.com/NVIDIA/nvidia-container-toolkit/issues/154
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.
https://blog.csdn.net/haima95/article/details/139169784
https://docs.docker.com/engine/install/ubuntu/#uninstall-docker-engine

環境建置

Install docker-ce and docker client

Setting Up NVIDIA Drivers and Docker Container Toolkit

Creating the Dockerfile

Building and Running the Docker Container

Verify your gpu in a new container

Verify Jupyter

Verify GPU with PyTorch

Backup docker images

Setting the Container Path (moved to dockerfile)

config for containerd

Next -> Reduce image size

Docker add user

Wifi Settings

Display Setting

VPN

Power Control

Mail Notification

Flow

Steps

References

comments