Team:Sysadmin/Lessons Learned

From MCH2022 wiki
Jump to navigation Jump to search

Team:Sysadmin Lessons learned

Team:

  • Pretalx is moody, CPU resource heavy and its program (static schedule) exporter does not function properly. Do not use next event or provide it without any support from Sysadmin's side.
  • Setup multiple networks beforehand to separate sysadmin/non-sysadmin services.
    • We should be able to add more network on-demand.
  • Set up monitoring as one of the earlier services
  • During early setup outline a policy on semi/unmanaged VM's for other teams
  • During early setup outline a policy on private email boxes on the event domain
  • Explicitly tell teams they are banned from using non-MCH mailboxes for visitor-facing inbound mail addresses
    • Team:Fire/First-Aid/Security used a shared gmail mailbox during the event
  • Zammad's mail daemon sometimes gets stuck: set up monitoring on the unprocessed queue size
  • Keycloak only handles authentication on its own side, no authorization. 2FA also is seemingly an afterthought: use something else next time
  • Consider running web services (eg. engelsystem) in an cluster, for ex. K8s
    • At the time of writing:
      • The Angelsystem is not capable yet of handling read-only MySQL nodes...
      • Mediawiki supports running with read-only MySQL nodes.
  • Monitor usage of services
    • No usage means decommissioning them, which decreases the maintenance burden
  • Outline an change authorization flow and agree upon it with PL.
    • eg. 1st line contacts of teams are allowed to authorize changes for their team's services. Make sure these 1st line contacts are defined/approved by PL (not by the team itself) and are limited in size (for ex: max 2 for larger teams, otherwise 1).
  • Have frequent team meetings, eg. monthly, as a check-up on the progress
  • Ensure everyone actively have the same level of responsibilities
  • Be wary of latecomers in the team: they still require getting familiar with the infra which is hard to do on short notice.
  • Be prepared for hardware failure or unscheduled downtimes: datacenter outages can and will happen
    • eg. Have presence in another data-center for mission critical things such as mail
  • During the event set up a standby schedule
  • Keep in mind we need to stretch VLAN's between hypervisors
    • Use VXLAN w/ Open vSwitch for example, or (maybe better?) BGP-EVPN w/ VXLAN
      • VXLAN with classic Linux Bridges (on one of the hosts) seems to have some performance (throughput) issues. Investigate before considering a mixed set-up
  • Try to minimize the maintenance burden of individual services: look into things such as Docker.

Organizational:

  • Make it explicit to other teams that, when requesting services, they should file a ticket. Also list what details are needed for these tickets.
  • Set a deadline on new requests close to the event: a failure to plan should not be our problem
  • Make sure a team is responsible for the content, programming and design of the main website. Sysadmin maintains services, it does not build websites.
    • Note: Having people overlap between these teams is fine, but having this responsibility within Sysadmin isn't

Event:

  • Ensure webservices have an appropriate session lifetime
  • Power CAN and WILL fail in the field: either net power or an UPS is REQUIRED for on-site presence
  • Ensure NOC has backup power on critical network infrastructure from our POV
    • Hosting services on-site is of no use if the network hosting it goes down
  • Request a proper office space on the field (Desk+Proper chair, maybe bring a display/kb/mouse?). Handling incidents in Heaven etc is not a good alternative.