Get matched →

Staff Platform Engineer, AI/ML Infrastructure

at Pfizer

Pfizer10 LocationsPosted 2026-06-04
Want this job?

Let DoneWithWork tailor your resume to this exact posting, write the cover letter, and submit the application for you.

Apply with DoneWithWork — $19.99/mo

View original posting →

Job description

Staff Platform Engineer, AI/ML InfrastructureDepartment:AI Software & OperationsRole SummaryThe Staff Platform Engineer, AI/ML Infrastructure will provide technical leadership for thecloud platforms, deployment systems, and operational foundations that power enterprise-scalegenerative AI applications.This role will define and evolve the infrastructure architecture for AI/ML platforms running across AWS,Kubernetes, serverless, and containerized environments. The engineer will lead platform standards forreliability, scalability, observability, CI/CD, security, and developer enablement, while partnering closelywith software engineering, AI engineering, security, and operations teams.The ideal candidate combines deep hands-on cloud engineering experience with staff-level technicalinfluence. They are comfortable designing infrastructure patterns, writing infrastructure-as-code,improving delivery pipelines, mentoring engineers, and making architectural decisions that raise theoperational maturity of AI platforms across multiple teams.Key ResponsibilitiesDefine and drive the technical strategy for AI/ML platform infrastructure supporting generative AIapplications, LLM integrations, model routing, and enterprise AI services.Architect, build, and operate scalable cloud platforms using AWS services such as EKS, ECSFargate, Lambda, DynamoDB, S3, OpenSearch, Secrets Manager, CloudWatch, ALB, and MWAA.Establish reusable infrastructure patterns using CloudFormation, Helm, and Terraform to supportreliable multi-environment and multi-region deployments.Lead CI/CD architecture using GitHub Actions, reusable workflows, OIDC-based AWSauthentication, automated quality gates, deployment promotion, and environment approvals.Design and improve observability across AI platforms, including CloudWatch dashboards, logs,alarms, Prometheus/Grafana, OpenSearch, Langfuse, and LLM-specific operational metrics.Build platform capabilities for GenAI workloads, including model availability monitoring.Partner with software engineering teams to improve deployment reliability, rollback strategies,health checks, autoscaling, load testing, and runtime performance.Define and enforce security and compliance practices for infrastructure, including IAM permissionboundaries, Secrets Manager usage, secret scanning, audit logging, tagging standards, andchange-management controls.Provide technical leadership for cost optimization, capacity planning, environment standardization,and operational resilience across development, test, production, and sandbox environments.Mentor engineers, review architecture and infrastructure designs, and influence platformengineering practices across teams.Basic QualificationsBachelor’s degree in Computer Science, Engineering, Information Technology, or a relatedtechnical field, or equivalent practical experience.7+ years of experience in DevOps, platform engineering, cloud infrastructure, site reliabilityengineering, or software engineering roles.Strong hands-on experience with AWS/Azure/GCP infrastructure and services, including container,serverless, networking, storage, observability, and security services.Experience designing and operating production systems on Kubernetes, ECS/Fargate, orcomparable container orchestration platforms.Proficiency with infrastructure-as-code, especially CloudFormation, Terraform, Helm, or similartooling.Strong CI/CD experience with GitHub Actions or similar platforms, including reusable workflows,automated testing, deployment gates, and cloud authentication.Experience building and operating observability solutions using CloudWatch, Prometheus/Grafana,OpenSearch, or similar tools.Strong understanding of cloud security practices, IAM, secrets management, least-privilegeaccess, audit logging, and compliance requirements.Experience supporting distributed systems, microservices, APIs, asynchronous workloads, andmulti-environment deployments.Demonstrated ability to lead technical design, mentor engineers, and influence engineeringpractices across teams.Preferred QualificationsExperience supporting AI/ML or generative AI platforms, including LLM gateways, model routing,prompt observability, token metering, or model failover.Experience operating platforms in regulated enterprise environments, ideally healthcare,pharmaceutical, finance, or life sciences.Experience with multi-account, multi-region AWS architectures and enterprise governancepatterns.Experience with cost optimization, autoscaling strategies, capacity planning, and cloud budgetmonitoring.Experience with load testing and performance validation using tools such as Locust or comparableframeworks.Strong Python or scripting skills for platform automation, operational tooling, and CI/CD extensions.Ability to communicate complex technical decisions clearly to engineering, security, operations,and leadership audiences.Technical EnvironmentThis role works across a modern AI platform ecosystem including: Cloud:AWS EKS, ECS Fargate, Lambda, DynamoDB, S3, OpenSearch, CloudWatch, SecretsManager, ALB, VPC, IAMInfrastructure-as-Code: CloudFormation, Helm, TerraformCI/CD: GitHub Actions, reusable workflows, OIDC federation, environment approvals, automatedrelease promotionAI/ML Platform: AWS Bedrock, Azure OpenAI, LiteLLM, LangfuseObservability: CloudWatch dashboards and alarms, Prometheus, Grafana, OpenSearch, Langfuse,custom metricsSecurity & Governance: IAM permission boundaries, secret scanning, audit logging, taggingcompliance, change-management automationEngineering Practices: Docker, Python, pre-commit, automated testing, load testing, code qualitygates, monorepo service standardsLeadership ExpectationsAs a J090 Staff-level engineer, this role is expected to operate beyond individual delivery. The engineerwill identify systemic platform gaps, define technical direction, create reusable standards, and raiseengineering maturity across multiple teams.Success in this role requires strong judgment, ownership, and communication. The
Want this job?

Let DoneWithWork tailor your resume to this exact posting, write the cover letter, and submit the application for you.

Apply with DoneWithWork — $19.99/mo

View original posting →