Installation Instructions

Install instructions for new nodes

Installation with USB

Lang: English
Location: Portugal
Locales: en_US.UTF-8 (required by some progs)
Keymap: Portuguese
Network config: anything works, will change later
Hostname: {pokemon-x}
Domain: cluster.di.fct.unl.pt
Root passwd: none,  repeat: none
Full name of new user: Operator
Username: op
Password: xxxxxx
Location: Lisbon
Partition:
	Guided - use entire disk
	sda
	All files in one partition
Mirror country: Portugal
Mirror: deb.debian.org
Proxy: none
Popularity: yes, we are nice
Software: ssh, standard system utilities
Grub (if asked): yes - sda

Initial setup

  • sudo apt update && sudo apt full-upgrade

  • sudo apt install vim gnupg curl apt-transport-https ifenslave nfs-common


  • Install microcode: amd64-microcode or intel-microcode

Network

  • Remove 127.0.1.1 line from /etc/hosts -> sudo sed -i '/127.0.1.1/d' /etc/hosts

  • Disable IPV6 in /etc/sysctl.conf:

    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1
    

    sudo sysctl -p

  • /etc/network/interfaces:

#IFDEF COMPUTER_HAS_MORE_THAN_ONE_NETWORK_INTERFACE

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto bond0
iface bond0 inet dhcp
	slaves eno5np0 eno6np1
   	bond-mode 802.3ad

#ELSE

/etc/network/interfaces:
	Change allow-hotplug to auto

#ENDIF

  • Add nodes to /etc/ethers and /etc/hosts in frontend (bond MAC should be the MAC of the first interface)

  • Reboot and check if working

Management

  • Add sudo permissions to admin group in visudo:

    # Members of the admin group may gain root privileges
    %admin ALL=(ALL) ALL
    
    #Add nopasswd to specific commands to op:
    op ALL=(ALL) NOPASSWD: /usr/bin/systemctl
    op ALL=(ALL) NOPASSWD: /usr/bin/apt
    

NFS + RAMFS

  • Create dirs

    sudo mkdir /var/management
    sudo mkdir /mnt/share
    sudo mkdir /mnt/ramdisk
    
  • Setup /etc/fstab:

    172.30.10.2:/management    /var/management   nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
    172.30.10.2:/home_mounts/op   /home/op   nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
    172.30.10.2:/share   /mnt/share   nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
    none    /mnt/ramdisk    ramfs    noauto,user,size=4G,mode=0770    0    0
    
  • sudo mount -a

LDAP Client

(Instructions from: https://wiki.debian.org/LDAP/NSS -> NSS Setup with libnss-ldapd)

  • Install ldap: sudo apt-get install libnss-ldapd #NOT libnss-ldap

  • Config:

    ldap://172.30.10.1
    dc=di,dc=fct,dc=unl,dc=pt (without "dc=cluster")
    passwd, group, shadow
    
  • Disable ldap ssh, allow only local users, in /etc/pam.d/sshd (can be at the very top):

    #Custom rules:
    account    sufficient     pam_localuser.so
    account    required        pam_listfile.so item=user sense=allow file=/etc/sshd/sshd.allow onerr=fail
    
  • Create sshd.allow file sudo mkdir /etc/sshd && sudo touch /etc/sshd/sshd.allow

CGROUPS (Possibly temporary, not sure…)

OAR

(Instructions from http://oar.imag.fr/docs/2.5/admin/installation.html)

  • Install: sudo apt install oar-node

  • Link keys (oar .ssh folder):

    sudo ln -s /var/management/oar_keys /var/lib/oar/.ssh
    sudo chown oar:oar /var/lib/oar/.ssh
    
  • Allow user environment in /etc/ssh/sshd_config: PermitUserEnvironmet Yes

  • Link epilogue, prologue and scripts:

    sudo rm /etc/oar/epilogue && sudo ln -s /var/management/job_scripts/epilogue /etc/oar/epilogue
    sudo rm /etc/oar/prologue && sudo ln -s /var/management/job_scripts/prologue /etc/oar/prologue
    sudo ln -s /var/management/job_scripts/tools /etc/oar/tools
    
  • Add nodes to oar in frontend:

    
    #Create file with node names, 1 per line and run
    vim /tmp/new_nodes
    sudo oar_resources_init -x /tmp/new_nodes
    
    #Set nodes cluster
    sudo oarnodesetting -h nodeX -p "cluster=Y"
    sudo oarnodesetting -h nodeX -p "schedule_order=X"
    sudo oarnodesetting -h nodeX -p "available_upto=0"
    

Docker + compose

  • Make sure LDAP is setup and the user docker exists, to prevent docker from creating a local group.

  • Follow instructions from: https://docs.docker.com/engine/install/debian/

  • Docker.socket needs to wait for ldap. Also needs to remove dependency on sockets.target to avoid dependency cycle.

sudo systemctl edit docker.socket

[Unit]
After=nslcd.service  sysinit.target
Requires=nslcd.service sysinit.target
Before=shutdown.target
Conflicts=shutdown.target
DefaultDependencies=no

Software

Runtimes

sudo apt install \
    default-jdk maven scala gradle \
    golang ocaml \
    python3-dev python3-pip python3-venv python3-dev pipenv virtualenv fabric

Utils

sudo apt install \
    zsh unzip zip htop iperf3 tmux screen dtach lshw dnsutils tcpdump linux-perf rsync psmisc tree ethtool

Libraries & Misc

sudo apt install \
    make automake bison build-essential cmake g++ gcc \
    valgrind libpq-dev httpie flex libtool  \
    bwm-ng gcc-multilib libomp-dev libomp-11-doc libevent-dev libtbb-dev libgdk3.0-cil-dev protobuf-compiler \
    libuv1-dev libssl-dev libhwloc-dev msr-tools pdsh libboost-all-dev libgmp-dev libtbb-dev wiredtiger \
    gmp-doc libgmp10-doc libmpfr-dev snmpd liblz4-dev libsnappy-dev libbz2-dev liblapacke \
    libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev \
    libboost-filesystem-dev libevent-dev pkg-config libboost-thread-dev \
    liblz4-dev  zlib1g-dev libsnappy-dev libbz2-dev protobuf-compiler

#——————————-KUBERNETES——————————- #Install Kubeadm #Requires custom repo -> https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list deb https://apt.kubernetes.io/ kubernetes-xenial main EOF

#Install Helm #Requires custom repo -> https://helm.sh/docs/intro/install/ curl https://baltocdn.com/helm/signing.asc | sudo apt-key add - echo “deb https://baltocdn.com/helm/stable/debian/ all main” | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list

sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectl sudo apt-get install helm

sudo reboot

#—————————–NAGIOS————————————- source nagios/installcommands.txt #put support ip in allowed hosts #uncomment checkapt #copy commands from other node… sudo cp nagios/check* /usr/local/nagios/libexec/ sudo chmod +x /usr/local/nagios/libexec/check_* #support /usr/local/nagios/etc/servers -> duplicate entry and adapt ip addr

#—————————– PROMETHEUS ——————————-

TODO @jcleitao

All nodes: sudo apt install prometheus-node-exporter

#—————————— SPARK ——————————–

sudo cp -r spark-3.1.1-bin-hadoop3.2 /usr/lib/spark sudo vim /etc/profile.d/spark_env.sh ==> Insert the following: source /usr/lib/hadoop/etc/hadoop/hadoop-env.sh export SPARK_HOME=/usr/lib/spark export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin

sudo cp -r hadoop-3.2.2 /usr/lib/hadoop sudo vim /usr/lib/hadoop/etc/hadoop/hadoop-env.sh ==> Uncomment and change following linest to: export JAVA_HOME=/usr/lib/jvm/default-java export HADOOP_HOME=/usr/lib/hadoop ==> Uncomment export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop #——————————- GPU ———————————– start by adding the realms contrib and non-free to all repositories in /etc/apt/sources.list

sudo apt update && sudo apt dist-upgrade sudo apt install firmware-linux-nonfree sudo apt install nvidia-driver sudo reboot sudo apt install
nvidia-opencl-common nvidia-opencl-dev
nvidia-opencl-icd opencl-headers nvidia-cuda-dev
nvidia-cuda-toolkit-doc nvidia-cuda-gdb nvidia-cuda-mps
nvidia-cuda-toolkit

reboot

#to support docker executing tensorflow with access to the GPUs #Para enganar o proximo comando, utilizar “debian10” como distribution” distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt update sudo apt install -y nvidia-docker2 sudo systemctl restart docker

##Bug that probably will be fixed soon: /etc/nvidia-container-runtime/config.toml -> remove @ from ldconfig = “@/sbin/ldconfig”


Last modified 30.05.2022