Dealer.com
As a member of the SRE team at Dealer.com, you will bring a collaborative style to efforts that raise the maturity levels of the engineering practices across all agile teams delivering our products. The tools and use-cases are diverse, and our challenge is to increase the development velocity by optimizing various parts of the pipeline and increase application stability. We look to instill core SRE practices into the engineering teams including measuring SLIs/SLOs, increasing visibility/observability through monitoring tools, guide chaos engineering efforts to improve overall resiliency, and lead Gameday/Production Readiness reviews across all engineering disciplines. We’re experts in AWS and use cutting edge tools developed in-house and open-source software and enable teams to deploy faster with zero downtime.
We are looking for A Senior Site Reliability Engineer who is passionate about customer expectation attainment through smart engineering that produces resilient improvements to platform reliability and reduced burden.
Primary Responsibilities And Essential Functions
As a Senior Site Reliability Engineer at Cox Automotive you will:
- Design and assist in the setup and maintenance of application monitoring, alerting, and insights
- Facilitate Gamedays and Production Readiness reviews to continue increasing resiliency in our applications
- Reduce mean time to identify (MTTI) by helping teams create dependency
- Reduce mean time to recovery (MTTR) by helping troubleshoot, monitor, alert, and automating recovery.
- Improve mean time between failures (MBTF) by helping teams define SLI/SLOs and prioritize proactive investment tasks.
- Have a natural tendency to avoid toil and want to automate it away
- Take complex and not maybe well-defined problem and come up with a technically reasonable solution
- Take ownership of processes or solutions that can be shared across teams globally
- Build and rollout solutions to be consumed by multiple teams
- Have innate curiosity about how things work
- Engage with product/capability engineering teams to ensure best practices are implemented
- Improve predictability and reliability of software releases, workflows, and operating software.
- Provide consulting expertise in AWS, cloud design, and operations
- Identify new technologies that can improve our area of responsibility, design and conduct proofs-of-concept, and communicate results throughout the organization
Minimum Qualifications
- Bachelor’s degree (4-year degree program) in Computer Science or related field, and 4+ years of relevant experience.
- Expertise in analyzing and setting alerts on metrics, logs of large-scale distributed systems
- Ability to debug, optimize code, and automate routine tasks
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive
- Understanding of Linux/Windows operating systems
- Experience with Python or PowerShell or related scripting languages
- Experience rolling out highly available, mission-critical applications
- Experience with Git version control systems and branching strategies
- Experience with Cloud Computing platforms (Amazon AWS, Kubernetes)
- Experience in release engineering / automation with cloud environments
- Experience with security and network / distributed computing concepts
- Experience with continuous integration tools (Jenkins, GitHub Actions, Artifactory preferred)
- Experience with Database Server infrastructure (RDS, Aurora, DynamoDB, Oracle, etc)
- Experience with agile development, continuous integration and automated testing
- Experience with Infrastructure as Code (Terraform or AWS CDK)
- Excellent written communication, problem solving, and process management skills