Returning candidate?

Mid Site Reliability Engineer

Job Locations: RO-B-Bucharest
Job area: IT & Digital
Employment type: Permanent
Workplace: Hybrid

Overview

Expleo is a global engineering, technology, and consulting service provider that partners with leading organizations to guide them through their business transformation, helping them achieve operational excellence and future-proof their businesses.

Expleo benefits from more than 50 years of experience developing complex products in automotive and aerospace, optimizing manufacturing processes, and ensuring the quality of information systems. Leveraging its deep sector knowledge and wide-ranging expertise in fields including AI engineering, digitalization, automation, cybersecurity and data science, the group’s mission is to fast-track innovation through each step of the value chain.

With a worldwide presence in 30 countries, our global footprint includes excellence centers around the world, including Romania since 1994.

Responsibilities

Monitoring and Reliability:

Monitor the performance, availability, and reliability of systems and applications, ensuring adherence to SLAs.
Implement and maintain monitoring and alerting systems to proactively identify issues.
Maintain a holistic end-to-end view of services: application, underlying infrastructure, and all dependencies.
Ensure services are installed, configured, changed, and operated in compliance with regulatory, security, and service level requirements.
Drive automation improvements across the CI/CD pipeline and operational processes.
Participate in and validate service designs and changes to maintain quality of operations.

Incident Management:

Lead incident response efforts during outages or major incidents, coordinating with cross-functional teams.
Conduct post-mortem analyses to identify root causes and implement corrective actions.
Be available for on-call activities outside of business hours (stand-by).

Reporting and SLA Management:

Ensure the 2nd Line of Support has all information available to manage incidents and problems.
Gather and analyze metrics and events from infrastructure and applications for capacity planning, performance improvements, and incident analysis.
Analyze communication and ticketing tools to define and implement a common approach.

Qualifications

Good experience with Linux, PostgreSQL, and HA/DR technologies.
Proficiency with cloud management tools: Terraform, Puppet, GitLab.
Experience with monitoring tools: Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana).
Experience with CI/CD pipelines and network management (debugging network issues).
GCP knowledge (recommended).
Experience with other cloud providers is a plus (e.g., AWS, Azure).
Fluent in English.

Benefits

Benefit Platform
Holiday Voucher
Private medical insurance
Performance bonus
Easter and Christmas bonus
Employee referral bonus
Bookster subscription
7card
Work from home options depending on project
#LI-EB1

Options

Apply nowApply

Refer a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Share to social media

Can't find the job of your choice?
Upload your C.V. / Resume here for our recruiters to view.