Eisen's Blog

© 2022. All rights reserved.

使用 ansible 的 facts 生成 hosts 信息

2022 December-06

最近在看以前的代码,发现 ansible 的代码里有一些模板是从 ansiblevars 生成 /etc/hsots 的代码。这里做了些调研,了解了下用法。这里大量的过程都放到了视频里了,下面的算是笔记吧。

facts 是什么,如何收集

facts 可以理解为 ansible 从主机收集的各种信息,通过如下命令可以看其到底收集了什么:

ansible all -i inventory.yaml -m gather_facts

通过 ansible-doc gather_facts 可以知道这个行为是在 playbook 执行之处就自动执行的。

如何使用 facts 生成 hosts 文件

ansible 帮我们收集的 facts 会放进 vars 里供我们使用。比如 hostvars 这个变量里就有了 facts 的信息,可以通过 debug 来展示这个信息:

- hosts: "all"
  tasks:
  - name: Debug hostvars
    ansible.builtin.debug:
      var: hostvars
- hosts: "all"
  tasks:
  - name: Debug groups
    ansible.builtin.debug:
      var: groups

执行命令 ansible-playbook all -i inventory.yaml generate_hosts.yaml 可以看到其中的内容。

然后这里我提供一个简单的模板文件 hosts.j2:

{% for host in groups["all"] | sort -%}
  {% if hostvars[host]['ansible_default_ipv4'] is defined -%}
    {{ hostvars[host]['ansible_default_ipv4']['address'] }} {{ hostvars[host].hostname }}
  {%- endif %}  
{% endfor %}

利用如下的 playbook 就可以生成 hosts 文件了(我这里就放到了 home/test 做了个测试,没有实际覆盖 /etc/hosts)。

- hosts: "all"
  tasks:
  - name: Test hosts
    ansible.builtin.blockinfile:
      path: /home/ubuntu/test
      marker: "# -----{mark} NODES IN CLUSTER-----"
      create: true
      block: "{{ lookup('template', 'templates/hosts.j2') }}"

补充 inventory.yaml

virtualmachines:
  vars:
    ansible_user: ubuntu
  hosts:
    vm01:
      ansible_host: 106.75.236.174
      hostname: vm01
    vm02:
      ansible_host: 113.31.107.13
      hostname: vm02

可以看到这里给每个 vm 配了一个 hostname 然后在 hostvars 里也能拿到的。

参考资料


unbound 的安装与基础应用

2022 December-06

了解到 unbound 可以用于做本地的 recursive dns server 同时也能支持本地的域名解析,打算用这个东西给内网做域名解析。而用 unbound 有这么两个好处:

  1. 使用 recursive dns server 可以避免把请求都发给上级的缓存服务器,很大程度上保证了个人隐私,也相对会更安全,当然搭配 pi-hole 这样的东西使用效果更佳 https://docs.pi-hole.net/guides/dns/unbound/
  2. unbound 除了做 recursive dns server 外也能接管内网域名解析,作为一个 authoritative server 使用(当然,它并不是专门做这个的,但是规模比较小的网络是没问题的)

unbound 的安装

以下是在 ubuntu 20.04 的安装流程:

sudo apt update && sudo apt install unbound -y

基础配置

先来一个简单的配置:

server:
    # can be uncommented if you do not need user privilige protection
    # username: ""

    # can be uncommented if you do not need file access protection
    # chroot: ""

    # location of the trust anchor file that enables DNSSEC. note that
    # the location of this file can be elsewhere
    auto-trust-anchor-file: "/usr/local/etc/unbound/root.key"
    # auto-trust-anchor-file: "/var/lib/unbound/root.key"

    # send minimal amount of information to upstream servers to enhance privacy
    qname-minimisation: yes

    # specify the interface to answer queries from by ip-address.
    interface: 0.0.0.0
    # interface: ::0

    # addresses from the IP range that are allowed to connect to the resolver
    access-control: 192.168.0.0/16 allow
    # access-control: 2001:DB8/64 allow

把它放到 /etc/unbound/unbound.conf.d/myunbound.conf 这里,然后 systemctl restart unbound 重启服务。

解决 53 端口冲突的问题

不出意外的话,重启 unbound 服务会报错,大概的报错信息是说 53 端口已经被占用了。这个时候可以通过 netstat -tulpn 来查看端口占用情况,发现是 systemd-resolved 占用了 53 端口,简单搜索会找到 https://unix.stackexchange.com/questions/304050/how-to-avoid-conflicts-between-dnsmasq-and-systemd-resolved 这么一个问题。按照其中内容修改 /etc/systemd/resolved.conf 设置 DNSStubListener=no 并重启 systemd-resolved 服务就可以了。

测试 unbound 是否工作

$ dig openbayes.com @127.0.0.1

; <<>> DiG 9.16.1-Ubuntu <<>> openbayes.com @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52191
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;openbayes.com.			IN	A

;; ANSWER SECTION:
openbayes.com.		600	IN	A	106.75.109.110

;; Query time: 1524 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun Dec 04 14:59:32 CST 2022
;; MSG SIZE  rcvd: 58

可以看到第一次很慢,但是第二次由于已经有了缓存,速度会快起来:

$ dig openbayes.com @127.0.0.1

; <<>> DiG 9.16.1-Ubuntu <<>> openbayes.com @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26243
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;openbayes.com.			IN	A

;; ANSWER SECTION:
openbayes.com.		535	IN	A	106.75.109.110

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun Dec 04 15:00:37 CST 2022
;; MSG SIZE  rcvd: 58

让 unbound 接管本机的域名解析

上面的 dig 命令需要主动选择 @127.0.0.1 作为域名解析的服务。我们当然是希望默认就使用 unbound 来做域名解析。这里我参考的 unbound 文旦 https://unbound.docs.nlnetlabs.nl/en/latest/use-cases/home-resolver.html#setting-up-for-a-single-machine 进行配置。

首先继续修改 /etc/systemd/resolved.conf:

[Resolve]
DNS=127.0.0.1
#FallbackDNS=
#Domains=
DNSSEC=yes
#DNSOverTLS=no
#MulticastDNS=no
#LLMNR=no
#Cache=no-negative
DNSStubListener=no
#DNSStubListenerExtra=

然后强制更新下 /etc/resolv.conf:

ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf

最后重启 systemd-resolved 服务:

systemctl restart systemd-resolved

执行 dig 的时候,就默认使用 127.0.0.1#53 了呢。

到此为止,unbound 的基本配置就完成了。

添加 local-zone

最后就是利用 unbound 所提供的 local-zone 配置实现内网域名解析了:

server:
    # can be uncommented if you do not need user privilige protection
    # username: ""

    # can be uncommented if you do not need file access protection
    # chroot: ""

    # location of the trust anchor file that enables DNSSEC. note that
    # the location of this file can be elsewhere
    # auto-trust-anchor-file: "/usr/local/etc/unbound/root.key"
    # auto-trust-anchor-file: "/var/lib/unbound/root.key"

    # send minimal amount of information to upstream servers to enhance privacy
    qname-minimisation: yes

    # specify the interface to answer queries from by ip-address.
    interface: 0.0.0.0
    # interface: ::0

    # addresses from the IP range that are allowed to connect to the resolver
    access-control: 192.168.0.0/16 allow
    access-control: 10.23.0.0/16 allow
    # access-control: 2001:DB8/64 allow

    local-zone: "home.lan." static
    local-data: "abc.home.lan. A 127.0.0.1"
    local-data: "bbc.home.lan. A 127.0.0.1"

dig abc.home.lan 就发现域名指向了 127.0.0.1 了。


记一次 k8s gpu 集群中半精度性能差异引发的系统调优

2022 November-17

最近发现跑 pytorch gpu benchmark 的时候,AMD epyc cpu 下的 rtx 3090 明显要比 intel 下 3090 慢,而且差距挺大的,非常不能理解。折腾了挺长一段时间才最终定位到是因为 cpu 在容器里的调度问题导致的。过程兜兜转转花了不少时间,这里就不说太多了,记录下大体过程和最后调优的措施。

最开始的测试

测试脚本

  • python gpu benchmark: https://github.com/ryujaehun/pytorch-gpu-benchmark
  • model: models.resnet.__all__[1:] 只测试 resnet 模型
  • batch size: 64
  • cpu limit: 12
  • memory limit: 30G
  • gpu limit: 1
  • shm: docker 给 30G 而 k8s 里由于无法直接设置,就给了机器内存的大小

nvidia docker 脚本

docker run --rm -it --shm-size=30g --cpus=12 --memory=30G --
gpus '"device=1"' uhub.service.ucloud.cn/openbayesruntimes/pytorch:1.9.0-py36-cu111.70

两个差异很大的结果

Intel 平台

start
benchmark start : 2022/11/17 06:15:23
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : NVIDIA GeForce RTX 3090
uname_result(system='Linux', node='d1d271bdf102', release='5.4.0-131-generic', version='#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022', machine='x86_64', processor='x86_64')
                     scpufreq(current=1184.1902125, min=800.0, max=3400.0)
                    cpu_count: 80
                    memory_available: 258789797888
Benchmarking Training half precision type resnet18
/usr/local/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
resnet18 model average train time : 41.39636039733887ms
Benchmarking Training half precision type resnet34
resnet34 model average train time : 55.32251834869385ms
Benchmarking Training half precision type resnet50
resnet50 model average train time : 92.97645568847656ms
Benchmarking Training half precision type resnet101
resnet101 model average train time : 147.91772842407227ms
Benchmarking Training half precision type resnet152
resnet152 model average train time : 209.90628242492676ms
Benchmarking Training half precision type resnext50_32x4d
resnext50_32x4d model average train time : 132.71542072296143ms
Benchmarking Training half precision type resnext101_32x8d
resnext101_32x8d model average train time : 336.4134645462036ms
Benchmarking Training half precision type wide_resnet50_2
wide_resnet50_2 model average train time : 156.14235401153564ms
Benchmarking Training half precision type wide_resnet101_2
wide_resnet101_2 model average train time : 259.703106880188ms
Benchmarking Inference half precision type resnet18
resnet18 model average inference time : 31.02853298187256ms
Benchmarking Inference half precision type resnet34
resnet34 model average inference time : 39.35199737548828ms
Benchmarking Inference half precision type resnet50
resnet50 model average inference time : 41.26767635345459ms
Benchmarking Inference half precision type resnet101
resnet101 model average inference time : 48.41951370239258ms
Benchmarking Inference half precision type resnet152
resnet152 model average inference time : 67.41719722747803ms
Benchmarking Inference half precision type resnext50_32x4d
resnext50_32x4d model average inference time : 44.739885330200195ms
Benchmarking Inference half precision type resnext101_32x8d
resnext101_32x8d model average inference time : 103.05868148803711ms
Benchmarking Inference half precision type wide_resnet50_2
wide_resnet50_2 model average inference time : 49.078497886657715ms
Benchmarking Inference half precision type wide_resnet101_2
wide_resnet101_2 model average inference time : 83.67201805114746ms
benchmark end : 2022/11/17 06:21:41
end

AMD 平台

start
benchmark start : 2022/11/17 06:14:11
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : NVIDIA GeForce RTX 3090
uname_result(system='Linux', node='925b73b78805', release='5.15.0-43-generic', version='#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022', machine='x86_64', processor='x86_64')
                     scpufreq(current=2784.047382812501, min=1500.0, max=2200.0)
                    cpu_count: 256
                    memory_available: 485026041856
Benchmarking Training half precision type resnet18
/usr/local/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
resnet18 model average train time : 75.70790767669678ms
Benchmarking Training half precision type resnet34
resnet34 model average train time : 82.6269006729126ms
Benchmarking Training half precision type resnet50
resnet50 model average train time : 111.1276912689209ms
Benchmarking Training half precision type resnet101
resnet101 model average train time : 161.16506576538086ms
Benchmarking Training half precision type resnet152
resnet152 model average train time : 228.9912509918213ms
Benchmarking Training half precision type resnext50_32x4d
resnext50_32x4d model average train time : 143.40569496154785ms
Benchmarking Training half precision type resnext101_32x8d
resnext101_32x8d model average train time : 354.08830165863037ms
Benchmarking Training half precision type wide_resnet50_2
wide_resnet50_2 model average train time : 164.76832389831543ms
Benchmarking Training half precision type wide_resnet101_2
wide_resnet101_2 model average train time : 271.076135635376ms
Benchmarking Inference half precision type resnet18
resnet18 model average inference time : 63.87866973876953ms
Benchmarking Inference half precision type resnet34
resnet34 model average inference time : 68.00977230072021ms
Benchmarking Inference half precision type resnet50
resnet50 model average inference time : 73.05157661437988ms
Benchmarking Inference half precision type resnet101
resnet101 model average inference time : 81.68745994567871ms
Benchmarking Inference half precision type resnet152
resnet152 model average inference time : 87.46984004974365ms
Benchmarking Inference half precision type resnext50_32x4d
resnext50_32x4d model average inference time : 83.56608867645264ms
Benchmarking Inference half precision type resnext101_32x8d
resnext101_32x8d model average inference time : 108.2996940612793ms
Benchmarking Inference half precision type wide_resnet50_2
wide_resnet50_2 model average inference time : 78.30146789550781ms
Benchmarking Inference half precision type wide_resnet101_2
wide_resnet101_2 model average inference time : 90.06356239318848ms
benchmark end : 2022/11/17 06:23:29
end

可以看到,越小的模型性能差异越大,这里就让我非常怀疑是 cpu 的问题。

关注主频

这里 AMD 的 cpu 是 7773x ,64 核心,128 线程,看起来是个 monster 可是这种超多核心的服务器 cpu 的主频相对都低一些,比如这个 Base Clock 是 2.2GHz,Boost Clock 是 3.5GHz。和 Intel 平台的比,差了一点点(那边是 3.8GHz)但这不足以引起如此大的性能差异,同时在通过 cat /proc/cpuinfo 命令查看运行时的主频后也确认很多核心的主频都可以跑到 3.4GHz 的水平,说明没有什么诡异的 BIOS 配置限制了性能的发挥。所以主频不是导致性能差的罪魁祸首。

继续翻阅 AMD 的 CPU 调优手册

虽然主频没问题,但还是觉得 cpu 哪里不太对,于是开始在 bios 里翻各种配置项目,越翻越懵逼,看不懂那些项目什么意思,就只能通过搜索引擎找到 AMD EPYC CPU 的配置手册,不经意间看到一个 Kubernetes Container Tuning Guide for AMD EPYC 7003 Series Processors 这不就是我目前所需要的?在 3.2 Container Pinning Settings 这部分提到默认的调度规则是在所有的 cpu 上做调度:

container pinning settings

那对于这个双路一共 256 线程的机器来说,调度频繁切换应该会是一个很大的开销吧?这里提到的 static 则是说只在特定的 cpu 里做调度。按照这个关键信息,我翻阅了 知道了 static 规则用的就是 cgroupcpuset 系统。那我是不是可以先用 docker 试试看?

使用 cpuset 命令测试

docker run --rm -it --shm-size=30g --cpuset-cpus 4-15 --cpus=12 --memory=30G --
gpus '"device=1"' uhub.service.ucloud.cn/openbayesruntimes/pytorch:1.9.0-py36-cu111.70

这里就是增加了 --cpuset-cpus 4-15 告知 docker 这个容器的可用 cpu 的区间。结果发现性能有了很大的提升,实际速度也超过了 Intel 平台。那么这里就大体确定了是默认调度规则的问题了。

解决 nvidia-docker 与 cpu 调度策略冲突的问题

按照 Changing the CPU Manager Policy 我很快对一台机器进行了调整,并准备在 k8s 平台上再做一次 benchmark 结果发现使用 nvidia-smi 命令查看显卡信息的时候报错了:

Failed to initialize NVML: Unknown Error

又是一番查询,发现是刚刚使用的 cpu-manager-polcy=static 会和 nvidia-docker 有冲突:容器因为 static 的调度规则必须修改容器的运行时,而这种行为是 nvidia docker 所不支持的。目前的解决方式是让所有调度 gpu 的 Pod 都必须具备 Guaranteed 级别的 QoS,同时所有的容器都必须有整数个的 cpu limitation。对于这种 Pod kubelet 会特殊对待,跳过对其 cpuset 进行更新(因为它锁定了 cpuset 所以按理说也不用更新)。同时这个支持也是在 k8s 1.22 版本之后才支持的。为此我们的集群也不得不升级到了 1.22 版本。