My project in 2024
Prerequisite
CMALab has quite a lot of compute resources: 9 racks across 3 sites.
- 5 racks in main site
- 2 racks in another site
- 2 racks in the other site
Till 2023
Compute resource of our lab was quite ancient:
- only L2 switches, static IP for each server
- no security
- ssh access with (shared and short) password
- no bastion host
- manual ownership control by a single person
What I did in 2024
Network
Designed and deployed network for the cluster in our lab
- management fabric
- OSPF over WireGuard site-to-site
- 1-10G leaf-spine for each site
- data fabric (for 5-rack site)
- OSPF + ECMP leaf-spine with L3 HW offloading
- 10-100G with jumbo frame
Security
- Bastion host per site
- ssh access via key pair only
Storage
- Deployed all-flash Ceph storage cluster in data fabric
- Whole rack, 10 nodes, 4+2 EC, 70 TiB
Management
Deployed management system for the whole cluster
- monitoring via node-exporter, Prometheus and Grafana
- automatic alert using Grafana and Discord webhook
- on-demand resources as-user-want via canonical MAAS