Dealer.com
Site Reliability Engineering (SRE) is about driving a culture focused on empowering our Product and Engineering teams to deliver the reliable outcomes our customers expect. We are looking for someone with passion and vision around what it means to lead the product and engineering to a culture focused on reliability, has lived the challenges of influencing others in the space, and enjoys working with and enabling our teams while aligning and influencing our Enterprise Strategy.
As Director of Site Reliability Engineering, you will play a critical part in both establishing and executing our site reliability strategy and F&I. You will lead a team of engineers focused on partnering with and influencing our Product, Architecture and Engineering in delivering on the strategy to establish SRE practices. Along with SRE practice establishment you will also lead a team of L1.5 support engineers to enable timely support to our customers and act liaison between the level 1 customer service department and engineering.
You will need to have a strong engineering background with Devops and formal SRE experience. Extensive technical knowledge in the development, delivery, and implementation of critical solutions and an expertise in the value and principles of SRE (SLI/SLO, Error Budgets, Toil, Observability, Release Engineering is a must to be successful in this role. You will have a demonstrated ability to develop, communicate, and execute your vision resulting in the adoption of practices, tooling and mentorship that will ultimately strengthen and maintain our leadership in the automotive marketplace.
Responsibilities
- Establish a comprehensive strategy and roadmap in partnership with enterprise SRE team. This would involve continually defining reliability goals, measuring and working to improve services.
- Build a team to execute on the roadmap to enforce automation, monitoring and resiliency.
- Establish SLA, SLOs, formalize them and track performance against them in partnership with vendors, application teams, infrastructure teams and business stakeholders.
- Evaluate the current tiers of service of our applications, reliability standards and practice to define steps to continuously improve on them.
- Conduct blameless postmortem on priority incidents of top tier critical applications.
- Be a promoter of best practices to improve our service levels and present recommendations with strong justification for funding approval.
- Create Dashboards and reports to communicate key metrics.
- Establish and lead a community of practice to foster continuous improvement of system performance, reliability and share knowledge, lessons learned across the IT organization.
- Responsible for level one production support by working with and developing a team of Engineers in an onsite offshore model.
- Create and enforce site reliability standards and work with the Infrastructure organization and Application delivery teams to continuously improve production stability, resilience, while concurrently reducing our risk profile over time.
- Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals.
- Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability.
- Perform analytics on previous incidents to understand root causes and better predict and prevent future issues.
- Drafts, implements, and executes policies and procedures to facilitate a quality level 2 customer service experience.
- Establishes performance metrics for level 1.5 customer service representatives.
- Establishes service levels and requirements for the department.
- Develops and implements methods to record, assess, and analyze customer feedback.
- Acts as a liaison between the level 1 customer service department and engineering
Qualifications:
- Bachelor’s degree in Computer Science, Information Technology or a relevant field
- Minimum (5) five years management experience
- Minimum (10) ten years’ experience managing large-scale digital software systems
- Understanding of DevOps and SRE best practices
- Expertise in cloud development and deployment technologies, including containerization and multi-cloud configurations
- Knowledge of common persistence frameworks