Software Engineer

at Caterpillar

CaterpillarBangalore, KarnatakaPosted 2026-06-04

Want this job?

Let DoneWithWork tailor your resume to this exact posting, write the cover letter, and submit the application for you.

Apply with DoneWithWork — $19.99/mo

View original posting →

Job description

Career Area:Technology, Digital and DataJob Description:Your Work Shapes the World at Caterpillar Inc. When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.Own production reliability for assigned services through proactive monitoring, alerting, and operational excellence.Participate in 24x7 on‑call rotation, leading P1/P2 incident triage, stabilization, and resolution.Ensure adherence to SLOs, SLIs, SLAs, and availability targets.Alerting & MonitoringDesign, implement, and tune actionable alerts to reduce noise and false positives.Build and maintain alerting using tools such as:Datadog / Dynatrace / AppDynamics / BroadcomCloudWatch, Azure MonitorSynthetic monitoring tools (ThousandEyes or equivalents)Create and maintain operational dashboards for application, infrastructure, and business KPIs.Drive alert rationalization and standardization across teams.Incident Management & RCALead or contribute to Root Cause Analysis (RCA) and Post‑Incident Reviews (PIRs).Perform event correlation across metrics, logs, traces, and deployments.Identify recurring issues and partner with engineering teams for permanent fixes.Produce clear RCA documentation including timeline, impact, root cause, and corrective actions.Observability & ToolingImplement and operate observability platforms covering:Metrics, logs, tracesService topology and dependency mappingWork with OpenTelemetry‑based pipelines where applicable.Improve visibility into upstream/downstream dependencies.Support onboarding of applications into standard SRE tooling and frameworks.Automation & Toil ReductionIdentify manual and repetitive operational tasks and automate them using scripting or workflows.Contribute to self‑healing and auto‑remediation solutions.Improve MTTR through automation, runbooks, and tooling enhancements.Collaboration & GovernanceWork closely with application teams, platform teams, and cloud engineers.Review application designs from a reliability and operability perspective.Contribute to SRE standards, best practices, and documentation.Required Skills & ExperienceCore Experience5–6 years of experience in SRE, DevOps, Production Support, or Platform EngineeringStrong experience handling production incidents (P1/P2) and RCAsMonitoring & AlertingHands‑on experience with monitoring and alerting tools such as:Datadog, Dynatrace, AppDynamics, BroadcomCloudWatch, Azure MonitorSynthetic monitoring tools (ThousandEyes or similar)Experience designing noise‑free, service‑impact‑based alertsRCA & TroubleshootingStrong skills in log analysis, metric correlation, and distributed tracingExperience performing structured RCAs and postmortemsUnderstanding of incident patterns, failure modes, and resilienceCloud & InfrastructureExperience with AWS and/or AzureWorking knowledge of containers and orchestration (ECS/EKS/Kubernetes)Experience with databases (Postgres, Oracle, or similar)Automation & ProgrammingProficiency in at least one scripting language: Python, Bash, or JavaScriptFamiliarity with CI/CD pipelines and IaC concepts (Terraform, CloudFormation – good to have)Nice to HaveExperience with OpenTelemetryExposure to AIOps / event correlation / AI‑assisted RCAExperience with service maps, dependency graphs, and topology modelingPrior experience supporting mission‑critical or customer‑facing platformsBehavioral & Soft SkillsStrong problem‑solving and analytical mindsetClear communication during high‑pressure incidentsAbility to collaborate across engineering, product, and operations teamsOwnership mindset with a focus on long‑term reliability over short‑term fixesWhat Success Looks Like in This RoleReduced alert noise and faster incident detectionImproved MTTR and fewer repeat incidentsHigh‑quality RCAs leading to permanent improvementsStrong operational readiness of onboarded applications Posting Dates:June 4, 2026 - June 4, 2026Caterpillar is an Equal Opportunity Employer. Qualified applicants of any age are encouraged to applyNot ready to apply? Join our Talent Community.

Want this job?

Let DoneWithWork tailor your resume to this exact posting, write the cover letter, and submit the application for you.

Apply with DoneWithWork — $19.99/mo

View original posting →