Team:Sysadmin/Lessons Learned

From MCH2022 wiki

Jump to navigation Jump to search

Team:Sysadmin Lessons learned

Team:

Pretalx is moody, CPU resource heavy and its program (static schedule) exporter does not function properly. Do not use next event or provide it without any support from Sysadmin's side.
Setup multiple networks beforehand to separate sysadmin/non-sysadmin services.
- We should be able to add more network on-demand.
Set up monitoring as one of the earlier services
During early setup outline a policy on semi/unmanaged VM's for other teams
During early setup outline a policy on private email boxes on the event domain
Explicitly tell teams they are banned from using non-MCH mailboxes for visitor-facing inbound mail addresses
- Team:Fire/First-Aid/Security used a shared gmail mailbox during the event
Zammad's mail daemon sometimes gets stuck: set up monitoring on the unprocessed queue size
Keycloak only handles authentication on its own side, no authorization. 2FA also is seemingly an afterthought: use something else next time
Consider running web services (eg. engelsystem) in an cluster, for ex. K8s
- At the time of writing:
  - The Angelsystem is not capable yet of handling read-only MySQL nodes...
  - Mediawiki supports running with read-only MySQL nodes.
Monitor usage of services
- No usage means decommissioning them, which decreases the maintenance burden
Outline an change authorization flow and agree upon it with PL.
- eg. 1st line contacts of teams are allowed to authorize changes for their team's services. Make sure these 1st line contacts are defined/approved by PL (not by the team itself) and are limited in size (for ex: max 2 for larger teams, otherwise 1).
Have frequent team meetings, eg. monthly, as a check-up on the progress
Ensure everyone actively have the same level of responsibilities
Be wary of latecomers in the team: they still require getting familiar with the infra which is hard to do on short notice.
Be prepared for hardware failure or unscheduled downtimes: datacenter outages can and will happen
- eg. Have presence in another data-center for mission critical things such as mail
During the event set up a standby schedule
Keep in mind we need to stretch VLAN's between hypervisors
- Use VXLAN w/ Open vSwitch for example, or (maybe better?) BGP-EVPN w/ VXLAN
  - VXLAN with classic Linux Bridges (on one of the hosts) seems to have some performance (throughput) issues. Investigate before considering a mixed set-up
Try to minimize the maintenance burden of individual services: look into things such as Docker.

Organizational:

Make it explicit to other teams that, when requesting services, they should file a ticket. Also list what details are needed for these tickets.
Set a deadline on new requests close to the event: a failure to plan should not be our problem
Make sure a team is responsible for the content, programming and design of the main website. Sysadmin maintains services, it does not build websites.
- Note: Having people overlap between these teams is fine, but having this responsibility within Sysadmin isn't

Event:

Ensure webservices have an appropriate session lifetime
Power CAN and WILL fail in the field: either net power or an UPS is REQUIRED for on-site presence
Ensure NOC has backup power on critical network infrastructure from our POV
- Hosting services on-site is of no use if the network hosting it goes down
Request a proper office space on the field (Desk+Proper chair, maybe bring a display/kb/mouse?). Handling incidents in Heaven etc is not a good alternative.

Retrieved from "https://wiki.mch2022.org/index.php?title=Team:Sysadmin/Lessons_Learned&oldid=18444"