My project in 2024

Prerequisite

CMALab has quite a lot of compute resources: 9 racks across 3 sites.

  • 5 racks in main site
  • 2 racks in another site
  • 2 racks in the other site

Till 2023

Compute resource of our lab was quite ancient:

  • only L2 switches, static IP for each server
  • no security
    • ssh access with (shared and short) password
    • no bastion host
  • manual ownership control by a single person

What I did in 2024

Network

Designed and deployed network for the cluster in our lab

  • management fabric
    • OSPF over WireGuard site-to-site
    • 1-10G leaf-spine for each site
  • data fabric (for 5-rack site)
    • OSPF + ECMP leaf-spine with L3 HW offloading
    • 10-100G with jumbo frame

Security

  • Bastion host per site
  • ssh access via key pair only

Storage

  • Deployed all-flash Ceph storage cluster in data fabric
  • Whole rack, 10 nodes, 4+2 EC, 70 TiB

Management

Deployed management system for the whole cluster

  • monitoring via node-exporter, Prometheus and Grafana
  • automatic alert using Grafana and Discord webhook
  • on-demand resources as-user-want via canonical MAAS