## Proposal for *A online systematic scheduling algorithm over Distributed IO Systems.*

In the resource allocation problem in the Distributed Systems under the High Performance Computer, we don’t really know which device like disk, NIC (network interface) is more likely to be worn, or not currently on duty which may trigger delaying a while to get the data ready. The current solution is random or round robin scheduling algorithm in avoidance of wearing and dynamic routing for fastest speed. We can utilize the data collected to make it automatic.

Matured system administrator may know the pattern of the parameter to tweak like stride on the distributed File Systems, network MTUs for Infiniband card and the route to fetch the data. Currently, eBPF(extended Berkeley Packets Filter) can store those information like the IO latency on the storage node, network latency over the topology into the time series data. We can use these data to predict which topology and stride and other parameter may be the best way to seek data.

The data is online, and the prediction function can be online reinforce learning. Just like k-arm bandit, the reward can be the function of latency gains and device wearing parameter. The update data can be the real time latency for disks and networks. The information that gives to the RL bots can be where the data locate on disks, which data sought more frequently (DBMS query or random small files) and what frequency the disk make fail.

Benchmarks and evaluation can be the statistical gain of our systems latency and the overall disk wearing after the stress tests.

## 配合某戏精使用的 slurm 踩坑日记

## 安装

rpmbuild -ta slurm*.tar.bz2


1. hdf5 spec 依赖有点问题，tar -jxvf *, 把configure 部分改成 –with-hdf5=no。
2. CC 一定是系统gcc
3. undefined reference to DRL_MIN，源码DRL_MIN换成 2.2250738585072013830902327173324040642192159804623318306e-308。
装完进 $HOME/rpmbuild/RPMS/x86_64，yum install一下。 ### 总之躲不开的就是看源码 ## 配置 create 个 munged key 放 /etc/munge/munge.key compute control 节点的 /etc/slurm/slurm.conf 得一样。 ClusterName=epyc ControlMachine=epyc.node1 ControlAddr=192.168.100.5 SlurmUser=slurm1 MailProg=/bin/mail SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/var/spool/slurmctld SlurmdSpoolDir=/var/spool/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/linuxproc #PluginDir= #FirstJobId= ReturnToService=0 # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= SelectType=select/cons_tres SelectTypeParameters=CR_Core # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES NodeName=epyc.node1 NodeAddr=192.168.100.5 CPUs=256 RealMemory=1024 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=IDLE NodeName=epyc.node2 NodeAddr=192.168.100.6 CPUs=256 RealMemory=1024 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=IDLE PartitionName=control Nodes=epyc.node1 Default=YES MaxTime=INFINITE State=UP PartitionName=compute Nodes=epyc.node2 Default=NO MaxTime=INFINITE State=UP  动态关注 /var/log/slurm* 会有各种新发现。 建议不要开 slurmdbd， 因为很难配成功。 sacct 不需要这个功能。 ### 一点点关于QoS的尝试–基于 RDMA traffic slurm 里面有基于负载均衡的QoS控制，而 RDMA traffic 的时序数据很好拿到，那就很好动态调QoS了。 $ sudo opensm -g 0x98039b03009fcfd6 -F /etc/opensm/opensm.conf -B
message table affinity

$numactl --cpunodebind=0 ib_write_bw -d mlx5_0 -i 1 --report_gbits -F --sl=0 -D 10 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x8f QPN 0xdd16 PSN 0x25f4a4 RKey 0x0e1848 VAddr 0x002b65b2130000 remote address: LID 0x8d QPN 0x02c6 PSN 0xdb2c00 RKey 0x17d997 VAddr 0x002b8263ed0000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 0 0.000000 0.000000 0.000000 ---------------------------------------------------------------------------------------  改affinity $ numactl –cpunodebind=1 ib_write_bw -d mlx5_0 -i 1 –report_gbits -F –sl=1 -D 10



### 最后弄个playbook
yml
---

slurm_roles: []
slurm_partitions: []
slurm_nodes: []
slurm_config_dir: "{{ '/etc/slurm' }}"
slurm_configure_munge: yes

slurmd_service_name: slurmd
slurmctld_service_name: slurmctld
slurmdbd_service_name: slurmdbd

__slurm_user_name: "{{ (slurm_user | default({})).name | default('slurm') }}"
__slurm_group_name: "{{ (slurm_user | default({})).group | default(omit) }}"

__slurm_config_default:
AuthType: auth/munge
CryptoType: crypto/munge
SlurmUser: "{{ __slurm_user_name }}"
ClusterName: cluster
ProctrackType=proctrack/linuxproc
# slurmctld options
SlurmctldPort: 6817
SlurmctldLogFile: "{{ '/var/log/slurm/slurmctld.log' }}"
SlurmctldPidFile: >-
{{
'/var/run/slurm/slurmctld.pid'
}}
StateSaveLocation: >-
{{
'/var/lib/slurm/slurmctld'
}}
# slurmd options
SlurmdPort: 6818
SlurmdLogFile: "{{ '/var/log/slurm/slurmd.log' }}"
SlurmdPidFile: {{ '/var/run/slurm/slurmd.pid' }}
SlurmdSpoolDir: {{'/var/spool/slurm/slurmd'  }}

__slurm_packages:
client: [slurm, munge]
slurmctld: [munge, slurm, slurm-slurmctld]
slurmd: [munge, slurm, slurm-slurmd]
slurmdbd: [munge, slurm-slurmdbd]

__slurmdbd_config_default:
AuthType: auth/munge
DbdPort: 6819
SlurmUser: "{{ __slurm_user_name }}"
LogFile: "{{ '/var/log/slurm/slurmdbd.log' }}"


## SR-IOV 配置

Nc24rs version1 的 ConnectX-3 ib网卡是走的SR-IOV。虽然virtual没有很多可以hack的地方。但我也借机多了解了点PCIe部分的虚拟化

## VSCC20 总结

Poster

TL;DR

Azure Cyclecloud 能很好的scale 你想干的任何计算，但我们在超过3000W的机器上没有很多经验，也许我在实习的时候有，但也只是对开着的集群运维，我并不那么善于精确的计算cost，在我们队里也似乎没有其他对弹性部署的机器有很好的理解。在写final arch的时候，我们带着以前的思路，由于预算限制导致的想着在任何时候用同样的价格。但云上比赛最好是用最好的机器跑最需要的应用。比如CESM需要HB120rs,那就新开一个机器。比赛之前我们只是把所有编译好的程序放到了一直开着的自带备份xfs的NFS上，pbs、grafana稍稍能用，后来开的机器懒就不配了，也没写脚本，吐槽下cyclecloud这个垃圾前端，还有HPC选项只有ubuntu18.04 和 rhel7。之后还配了个lustre template尝试跑IO500，可是性能不如单盘就放弃了。

NFS在装OMED驱动的时候挂过一次，stale handle detected。我们重启了一遍pbs集群，nfs集群。还好数据都在，我们相继在那个时候用azcopy上传了所有已有结果，发现了60sHPCG结果并不合法。殷老师脸色铁青。32个机器，用pipeline装机器的方法我们花了2个多小时才搞完。然后又花了1个小时弄openmpi的各种问题。最后没能在1小时之内再跑一次HPCG。预算也要抄了，我们只得关掉开的32台P100，回去睡觉了。本次比赛我们和清华的差距就是我们benchmark 没跑完。以及CESM没跑完。

## On Chinese Future Caltech Professor

I didn't comment on others frequently, btw as for my English typing skill is god dam stupid. I decided to comment on the professor who had taught me for one semester at SIST, ShanghaiTech.

ShanghaiTech Professor usually pose their strong personal idea on the Writing Assignment, Paper Reading and Coding Project in order to delete the caiji like me and select excellent student to their lab.

## Chun Dong

As a person of random personality, I don't really speak high of a OCD (obsessive-compulsive disorder) patient. But Chun Dong is really a guy with great precision with every point of his life. Take his comment on my paper reading one as example, he greatly circle out literally all my grammar minor mistake. Admittedly, as a decent researcher, it should've be introverted as a good quality to be a bug-free author. I noticed that he was a editorial for one journal at U.S. so that his craziness is not that out-of-my-mind.

He is really long at reading Computer Architectural surrounding paper. Once I asked him some CPU micro arch problem, he could come up the time and the author name. Once I asked something on the algorithm, he could tell the right side and wrong side. Indeed, those newly up researchers who are from the Chinese world is hard-working, intelligent and hard on others.

## Shu Yin

I'm elder than you, btw I'll take all your ability and fame literally.

As the hardest required class's Teacher, he is easy to get the most intelligent student in this school. Btw after I cooperated with him for some minor things, I think he could not get rid of his ways of thinking from small towns. He's smart and push people. I was pushed to do something that is just for sake of that I think is not necessary to do to benefit me.

He's elder professor at ShanghaiTech. He deserved the progress he made right away. But the dream he realized may not be all the student that work for him should have. I'm not make allegations to any one. Where there's man, there's politics. I don't taboo talking anything looks wrong even may take fire on me.

Good Day.

## Zi Yu

I'm imposing much on you, I don't really teach you but let you self-learn.

Tutor is one that impose motto on his students to let them do not wandering or wasting their precious time in their precious youth. I doubt the skills of expression of Zi Yu, I doubted his ability of boosting, not until I watched his latest masterpiece. Shame of myself.

I may not have a well-grouded ability to solve probability problem, btw I'll always remember his motto, his slang, his style.

## 谈谈我们SC20 要遇到的对手

//不过感觉真正强的同学现在都不出国了。

## Solve the spack install gromacs+cuda+mpi not compatible with [email protected]

For a temporary solution: you can merely download their upstream release from here that fixed the cuda-11 not compatible issue. and put them to /path/to/spack/var/spack/cache/_source-cache/archive/cd/cd12bd4977e19533c1ea189f1c069913c86238a8a227c5a458140b865e6e3dc5.tar.gz.

The checksum can be gotten by sha256sum gromacs-prepare-2020.4.tar.gz. Besides, you should modify the 2020.4 checksum in /path/to/spack/var/spack/repos/builtin/packages/gromacs/package.py, add version('2020.4', sha256='cd12bd4977e19533c1ea189f1c069913c86238a8a227c5a458140b865e6e3dc5').

Then you can simply run spack install [email protected]+mpi+cuda