MetLife’s Application Maintenance & Support (AMS) organization is responsible for maintaining and improving a global portfolio of diverse applications. Within AMS the Application Improvement team is empowered to proactively identify and implement changes to improve quality and stability in production. We are looking for pure technologists that have a track record of using cutting edge technologies to identify opportunities to improve applications which run in very complex environments. We are seeking a Site Reliability Engineer to join the team. SREs are a rare mix of sysadmins and development Engineers, and as such can understand and explain the effect of product architecture decisions on the ability to run as distributed systems. They are driven by professional curiosity and a desire to develop a deep understanding of their services and the technologies they depend upon.
Defining, implementing and responding to key business transaction alerts and performing triage to diagnose and resolve critical issues.
Analyzing problems in the environment with a mind for restoration of service now and automated prevention in the future.
Communicating across the organization to maintain awareness, direct restoration efforts and identify situationally appropriate knowledge workers.
Working closely with your team lead and other staff to become educated on the wide array of systems and their interdependencies.
Creating apps or scripts to automate operational tasks and incorporate the solutions into our platform.
Collaborating with support and development teams.
Identifying existing metrics and monitoring systems to prevent future production incidents.
Speaking with confidence during conference calls of 50+ participants.
Identifying opportunities to build innovative tools and solve unique operations problems.
Prioritize problems, incidents and service requests, leveraging business understanding; Advise on break-fix coding & document break-fix code changes
Senior Site Reliability Engineer
Essential Business Experience and Technical Skills:
5+ years problem-solving and debugging skills across a variety of integrated platforms
Self-motivated with the desire and ability to explore and understand difficult concepts inside large architectures with the aptitude to be a good team player and the willingness to learn and implement new technologies as needed
Expert in using application performance monitoring tools (AppDynamics or other) in Identify recurring patterns of reliability and performance issues and provide proactive tuning recommendations for fixing them
Ability to dive into unfamiliar code and root cause problems- a strong object-oriented design and programming background using all the latest technologies is a must
Expert knowledge in large scale web operations and web-based Java/J2EE architectures and JVM configurations
Ability to understand interrelated production problems you must have the willingness and aptitude to learn what business need the applications we support perform and then relate that to the overall technological environment.
Bachelor’s degree required with a major in related field preferred
Experience contributing to architecture design reviews to improve scalability, reliability, capacity and performance
Background identifying and driving opportunities to improve automation
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
Cloud, SaaS, PaaS, IaaS, Containers
Understanding of DevOps Culture
Experience working in databases like Oracle and Mongo