Technical Skills Must have:
Must‑Have
Observability & Reliability Engineering
· Strong hands‑on experience across core observability pillars including metrics, traces, service health and distributed systems visibility
· Practical experience implementing OpenTelemetry across application, platform and infrastructure layers
· Ability to design, deploy and operate end‑to‑end observability pipelines (collector‑to‑backend, agent management, data flows, routing and filtering)
· Strong understanding of SLI/SLO frameworks, error budgets and reliability‑focused operating models
· Experience defining alerting strategy, tuning thresholds and reducing operational noise through effective signal engineering
Observability Platforms & Tooling
· Hands‑on expertise in one or more enterprise‑grade observability platforms (Dynatrace, Splunk Observability, Datadog or equivalent)
· Proficiency with Prometheus ecosystem components including Alertmanager
· Experience designing clear, insightful dashboards and visualisations using Grafana
· Strong troubleshooting capability using metrics, traces and dependency insights to diagnose performance and availability issues
Cloud & Platform Monitoring
· Strong technical experience with at least one major public cloud (AWS, Azure or GCP)
· Monitoring fundamentals across cloud‑native services including compute, storage, networking, load balancers and managed services
· Solid understanding of cloud networking constructs (VPC/VNet, subnets, routing, NAT, firewalls and security groups)
Containers & Kubernetes
· Working knowledge of Kubernetes objects (pods, services, deployments) and operational lifecycle
· Experience monitoring containerised/app‑modernisation workloads
· Basic experience with Helm or Kustomize for packaging, configuration and deployment
· Ability to troubleshoot application behaviour and platform-level issues within container environments
Programming & Automation
· Proficiency in one or more languages (Python, Go, Java) to support automation and tooling
· Experience writing automation scripts and utilities supporting observability and SRE practices
· Awareness of integrating observability checks within CI/CD pipelines
· Comfort with shell scripting for diagnostics and operational tasks
Data & Analytics
· Strong understanding of time‑series data and telemetry characteristics
· Hands‑on experience with PromQL, SignalFlow, Metrics Explorer or equivalent query languages
· Ability to analyse latency percentiles (p95/p99), error rates and throughput metrics
· Working knowledge of SQL for querying telemetry backends or data stores
🚀 We’re looking for our Country Manager for the UK! 🇬🇧 (Do you have the soul of a co-founder?) At...
Apply For This JobAbout The Company: OCS UK & Ireland is a leading facilities management company with 50,000+ colleagues and a turnover in...
Apply For This JobAbout Mettler ToledoMETTLER TOLEDO is a global leader in precision instruments and services. We are renowned for innovation and quality...
Apply For This JobJob Description Join the team redefining how the world experiences design. Hiya, g’day, mabuhay, kia ora, 你好, hallo, bonjour, vítejte!...
Apply For This JobCheck you match the skill requirements for this role, as well as associated experience, then apply with your CV below....
Apply For This JobCareer Area Procurement Job Description Your Work Shapes the World at Caterpillar Inc. When you join Caterpillar, you’re joining a...
Apply For This Job