
Job Information
Space Exploration Technologies Corp. Kubernetes Platform Site Reliability Engineer (Starlink) in Redmond, Washington
SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal ofenabling human life on Mars. KUBERNETES PLATFORM SITE RELIABILITY ENGINEER (STARLINK) At SpaceX we're leveraging our experience in building rockets and spacecraft to deploy Starlink, the world's most advanced broadband internet system. Starlink is the world's largest satellite constellation and is providing fast, reliable internet to 5M+ users worldwide. We design, build, test, and operate all parts of the system - thousands of satellites, consumer receivers that allow users to connect within minutes of unboxing, and the software that brings it all together. We've only begun to scratch the surface of Starlink's potential global impact and are looking for best-in-class engineers to help maximize Starlink's utility for communities and businesses around the globe. As an engineer focused on Starlink's software and network infrastructure, you will design, operate and scale the infrastructure we use to run the world's largest satellite constellation and manage a network that handles millions of daily users worldwide. These positions cover a variety of areas ranging from Site Reliability Engineering, to Developer Operations, to our internal Kubernetes platforms. You will develop automation to deploy and manage on-premise compute resources, create highly scalable and maintainable software products, and directly collaborate with engineering across the board. RESPONSIBILITES: Develop automation to deploy and manage on-premise Kubernetes clusters Deploy and manage core infrastructure such as databases, monitoring and distributed storage Closely collaborate with software engineers to create highly scalable, operable, and maintainable products Engage in and improve the whole lifecycle of services -- from inception and design, through deployment, operation and refinement Monitoring and alerting supporting systems to have high availability Hands-on integration and troubleshooting across the entire Starlink stack Identify areas for improvement and create innovative solutions that enable high system availability BASIC QUALIFICATIONS: Bachelor's degree in computer science, information systems/IT, or an engineering discipline and 1+ years of professional experience in Site Reliability Engineering or DevOps; OR 3+ years of professional experience in Site Reliability Engineering or DevOps in lieu of a degree 1+ years of professional experience with Linux operating systems Experience with Terraform, Ansible, or other infrastructure tools Experience with containerization technologies (i.e. OCI containers, Kubernetes) Experience scripting in Bash, Python, or other similar languages Development experience in Python, C++, or Go PREFERRED SKILLS AND EXPERIENCE: 1+ years of experience with Python and Python-based development frameworks Experience managing Kubernetes clusters, not just using them Knowledge of Linux boot process and systems configuration Deep understanding of testing, continuous integration, build, deployment & continuous monitoring Understanding of relevant build technologies, such as Bazel and Makefiles Focus on performance bottlenecks and performance improvement techniques Understanding of distributed databases and data modeling Experience with automatically managing dozens, hundreds, or thousands of servers (eg: Terraform or Ansible) Strong networking knowledge of TCP/IP Excellent communications skills with the ability to communicate with customers, peers, management etc. in both formal and informal situations ADDITIONAL REQUIREMENTS: Must be willing to work extended hours and weekends as needed COMPENSATION AND BENEFITS: