Installation Instructions
Installation with USB
Lang: English
Location: Portugal
Locales: en_US.UTF-8 (required by some progs)
Keymap: Portuguese
Network config: anything works, will change later
Hostname: {pokemon-x}
Domain: cluster.di.fct.unl.pt
Root passwd: none, repeat: none
Full name of new user: Operator
Username: op
Password: xxxxxx
Location: Lisbon
Partition:
Guided - use entire disk
sda
All files in one partition
Mirror country: Portugal
Mirror: deb.debian.org
Proxy: none
Popularity: yes, we are nice
Software: ssh, standard system utilities
Grub (if asked): yes - sda
Initial setup
sudo apt update && sudo apt full-upgrade
sudo apt install vim gnupg curl apt-transport-https ifenslave nfs-common
- Install microcode:
amd64-microcode
orintel-microcode
Network
Remove
127.0.1.1
line from/etc/hosts
->sudo sed -i '/127.0.1.1/d' /etc/hosts
Disable IPV6 in
/etc/sysctl.conf
:net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
sudo sysctl -p
/etc/network/interfaces
:
#IFDEF COMPUTER_HAS_MORE_THAN_ONE_NETWORK_INTERFACE
source /etc/network/interfaces.d/*
auto lo
iface lo inet loopback
auto bond0
iface bond0 inet dhcp
slaves eno5np0 eno6np1
bond-mode 802.3ad
#ELSE
/etc/network/interfaces:
Change allow-hotplug to auto
#ENDIF
Add nodes to
/etc/ethers
and/etc/hosts
in frontend (bond MAC should be the MAC of the first interface)Reboot and check if working
Management
Add sudo permissions to admin group in visudo:
# Members of the admin group may gain root privileges %admin ALL=(ALL) ALL #Add nopasswd to specific commands to op: op ALL=(ALL) NOPASSWD: /usr/bin/systemctl op ALL=(ALL) NOPASSWD: /usr/bin/apt
NFS + RAMFS
Create dirs
sudo mkdir /var/management sudo mkdir /mnt/share sudo mkdir /mnt/ramdisk
Setup
/etc/fstab
:172.30.10.2:/management /var/management nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0 172.30.10.2:/home_mounts/op /home/op nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0 172.30.10.2:/share /mnt/share nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0 none /mnt/ramdisk ramfs noauto,user,size=4G,mode=0770 0 0
sudo mount -a
LDAP Client
(Instructions from: https://wiki.debian.org/LDAP/NSS -> NSS Setup with libnss-ldapd)
Install ldap:
sudo apt-get install libnss-ldapd
#NOT libnss-ldapConfig:
ldap://172.30.10.1 dc=di,dc=fct,dc=unl,dc=pt (without "dc=cluster") passwd, group, shadow
Disable ldap ssh, allow only local users, in
/etc/pam.d/sshd
(can be at the very top):#Custom rules: account sufficient pam_localuser.so account required pam_listfile.so item=user sense=allow file=/etc/sshd/sshd.allow onerr=fail
Create sshd.allow file
sudo mkdir /etc/sshd && sudo touch /etc/sshd/sshd.allow
CGROUPS (Possibly temporary, not sure…)
Revert to v1, instead of v2 (required for oar) -> https://blog.christophersmart.com/2019/12/15/enabling-docker-in-fedora-31-by-reverting-to-cgroups-v1/
sudo sed -i '/^GRUB_CMDLINE_LINUX/ s/"$/ systemd.unified_cgroup_hierarchy=0"/' /etc/default/grub sudo update-grub sudo reboot
OAR
(Instructions from http://oar.imag.fr/docs/2.5/admin/installation.html)
Install:
sudo apt install oar-node
Link keys (oar .ssh folder):
sudo ln -s /var/management/oar_keys /var/lib/oar/.ssh sudo chown oar:oar /var/lib/oar/.ssh
Allow user environment in
/etc/ssh/sshd_config
:PermitUserEnvironmet Yes
Link epilogue, prologue and scripts:
sudo rm /etc/oar/epilogue && sudo ln -s /var/management/job_scripts/epilogue /etc/oar/epilogue sudo rm /etc/oar/prologue && sudo ln -s /var/management/job_scripts/prologue /etc/oar/prologue sudo ln -s /var/management/job_scripts/tools /etc/oar/tools
Add nodes to oar in frontend:
#Create file with node names, 1 per line and run vim /tmp/new_nodes sudo oar_resources_init -x /tmp/new_nodes #Set nodes cluster sudo oarnodesetting -h nodeX -p "cluster=Y" sudo oarnodesetting -h nodeX -p "schedule_order=X" sudo oarnodesetting -h nodeX -p "available_upto=0"
Docker + compose
Make sure LDAP is setup and the user
docker
exists, to prevent docker from creating a local group.Follow instructions from: https://docs.docker.com/engine/install/debian/
Docker.socket needs to wait for ldap. Also needs to remove dependency on sockets.target to avoid dependency cycle.
sudo systemctl edit docker.socket
[Unit]
After=nslcd.service sysinit.target
Requires=nslcd.service sysinit.target
Before=shutdown.target
Conflicts=shutdown.target
DefaultDependencies=no
Software
Runtimes
sudo apt install \
default-jdk maven scala gradle \
golang ocaml \
python3-dev python3-pip python3-venv python3-dev pipenv virtualenv fabric
Utils
sudo apt install \
zsh unzip zip htop iperf3 tmux screen dtach lshw dnsutils tcpdump linux-perf rsync psmisc tree ethtool
Libraries & Misc
sudo apt install \
make automake bison build-essential cmake g++ gcc \
valgrind libpq-dev httpie flex libtool \
bwm-ng gcc-multilib libomp-dev libomp-11-doc libevent-dev libtbb-dev libgdk3.0-cil-dev protobuf-compiler \
libuv1-dev libssl-dev libhwloc-dev msr-tools pdsh libboost-all-dev libgmp-dev libtbb-dev wiredtiger \
gmp-doc libgmp10-doc libmpfr-dev snmpd liblz4-dev libsnappy-dev libbz2-dev liblapacke \
libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev \
libboost-filesystem-dev libevent-dev pkg-config libboost-thread-dev \
liblz4-dev zlib1g-dev libsnappy-dev libbz2-dev protobuf-compiler
#——————————-KUBERNETES——————————- #Install Kubeadm #Requires custom repo -> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list deb https://apt.kubernetes.io/ kubernetes-xenial main EOF
#Install Helm #Requires custom repo -> https://helm.sh/docs/intro/install/ curl https://baltocdn.com/helm/signing.asc | sudo apt-key add - echo “deb https://baltocdn.com/helm/stable/debian/ all main” | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectl sudo apt-get install helm
sudo reboot
#—————————–NAGIOS————————————- source nagios/installcommands.txt #put support ip in allowed hosts #uncomment checkapt #copy commands from other node… sudo cp nagios/check* /usr/local/nagios/libexec/ sudo chmod +x /usr/local/nagios/libexec/check_* #support /usr/local/nagios/etc/servers -> duplicate entry and adapt ip addr
#—————————– PROMETHEUS ——————————-
TODO @jcleitao
All nodes: sudo apt install prometheus-node-exporter
#—————————— SPARK ——————————–
sudo cp -r spark-3.1.1-bin-hadoop3.2 /usr/lib/spark sudo vim /etc/profile.d/spark_env.sh ==> Insert the following: source /usr/lib/hadoop/etc/hadoop/hadoop-env.sh export SPARK_HOME=/usr/lib/spark export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin
sudo cp -r hadoop-3.2.2 /usr/lib/hadoop sudo vim /usr/lib/hadoop/etc/hadoop/hadoop-env.sh ==> Uncomment and change following linest to: export JAVA_HOME=/usr/lib/jvm/default-java export HADOOP_HOME=/usr/lib/hadoop ==> Uncomment export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop #——————————- GPU ———————————– start by adding the realms contrib and non-free to all repositories in /etc/apt/sources.list
sudo apt update && sudo apt dist-upgrade
sudo apt install firmware-linux-nonfree
sudo apt install nvidia-driver
sudo reboot
sudo apt install
nvidia-opencl-common nvidia-opencl-dev
nvidia-opencl-icd opencl-headers nvidia-cuda-dev
nvidia-cuda-toolkit-doc nvidia-cuda-gdb nvidia-cuda-mps
nvidia-cuda-toolkit
reboot
#to support docker executing tensorflow with access to the GPUs
#Para enganar o proximo comando, utilizar “debian10” como distribution”
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-docker2
sudo systemctl restart docker
##Bug that probably will be fixed soon: /etc/nvidia-container-runtime/config.toml -> remove @ from ldconfig = “@/sbin/ldconfig”