We are hiring a Site Reliability Engineer who will build reliable, high capacity, and well-performing systems in support of our mission to reimagine learning for millions of students and learners worldwide.
As a Site Reliability Engineer, you will care about telemetry, cost, security, performance, and reliability in infrastructure. You will collaborate in a DevOps model with product development teams; designing, deploying, and managing automation tools that increase predictability as well as time to market while reducing cost.
Code:, Java, PHP, NodeJS, and GoLang
RDBMS: Oracle, PostGreSQL, MySQL
Cache: Couchbase, Redis, ElastiCache, DynamoDB
Containers: ECS, K8S, Docker
Cloud: Amazon AWS
Telemetry: New Relic, CloudWatch
Build: Jenkins, CircleCI, GitHub Actions
Run: PagerDuty, Exigence
Hands-on design, analysis, development, and troubleshooting of highly distributed large-scale production systems and event-driven, cloud-based services
Ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code) Participate in continual learning of the AWS ecosystem, game-day scenarios, and professional conferences
Collaborative solutions of enterprise applications with development teams utilizing our software stack
Produce Base AMIs and rotate/patch all hosts every 30 days
Actively monitor AWS Cost Explorer, and utilize optimizer to decrease costs while maintaining Service Level Objectives Observability Engineering
Ownership of reliability, uptime, system security, cost, operations, capacity, resiliency, and performance-analysis thereof Define, monitor, and report on service level indicators for applications workloads
Support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms
Maintain telemetry that improves the visibility of our applications' performance and business metrics and keep operational workload in check
Develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks.
Support healthy software development practices, including complying with agile software development methodology, building standards for code reviews, work packaging, and continuous delivery
Partner with CyberSecurity and develop plans and automation to respond to new risks and vulnerabilities
Collaborate with Systems Admins to coordinate middleware, network, storage, database, Windows, Linux, VMware maintenance Automate legacy on-prem system maintenance and migrate to the cloud via thoughtful redesign
Collaborate with dev teams to identify failure points and blast radius of systems
Validate the effectiveness of monitoring and observability configurations
Coordinate failure injection testing
Observe and document steady-state production levels, growth patterns
Plan and forecast for seasonal growth, communicate trend lines with leadership, enhance infrastructure scaling plans to accommodate 2x planned load
Coordinate improvements of existing software and infrastructure to meet resiliency goals
Experience as a software engineer, with practical experience developing, debugging, and deploying enterprise applications Experience with infrastructure automation technologies (like Terraform, Puppet, Ansible)
Expertise in container/container-fleet-orchestration technologies like ECS or Kubernetes
Cloud and container-native Linux administration/build/management skills (AWS AMIs, Packer, etc.)
Versatility with troubleshooting diverse sets of hosting technologies strongly desired. These include web server platforms, application platforms, operating systems, network components, virtualization technologies, storage, and database platforms.
Expertise with continuous-deployment based software development lifecycles (e.g. CI/CD)
Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora),
Experience with application caching strategies and high concurrency workloads
Expertise with Lean/Agile deployment processes (Blue/Green, ZDT, Canary, load balancers/DNS strategies)
Familiarity with telemetry SaaS systems like New Relic
Strong problem solving, root cause analysis, and systems engineering skills
Excellent presentation and communication skills
Ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer-focused, collaborative, data-driven) ways.
Demonstrated expertise building and managing highly scaled production infrastructure in the cloud (AWS required; GCP, Azure, OpenStack a plus)
Expertise with SDLC branching, SCM, and code deployment systems (git/git-flow, Jenkins, CircleCI, TravisCI, etc.)
Nice to Have
Being able to translate between development, operations, security, product, and management dialects is a highly-sought skill. Ability to translate knowledge and ideas into written-word as documentation.