ML Ops Engineer - Augury
- חברה: Augury
- מיקום: Augury Israel
- טכנולוגיות: Python, Databricks, MLflow, Airflow
תיאור המשרה
Design and evolve production MLOps capabilities across the full ML lifecycle including datasets, features, models, evaluations, deployments, monitoring, retraining, and feedback signals.
Build systems for experiment tracking, artifact management, reproducibility, versioning, lineage, promotion workflows, and production readiness.
Develop reusable platform tooling, golden paths, and engineering standards that improve consistency and delivery velocity across teams.
Build operational infrastructure for LLM and agentic systems including prompts, tools, traces, evaluations, observability, safety boundaries, and production monitoring.
Design evaluation and monitoring frameworks for AI systems including answer quality, latency, grounding, reliability, and operational regressions.
Build and optimize large-scale training pipelines supporting heterogeneous data sources and scalable compute patterns.
Write clean, modular, production-grade Python services and platform libraries.
Drive engineering quality through automated testing, CI/CD, observability, deployment standards, and operational best practices.
5+ years of professional software engineering, MLOps, or ML platform engineering experience in production environments.
Significant experience building or owning production ML infrastructure and lifecycle systems.
Strong Python engineering skills with production-grade architecture, modular design, testing, packaging, and robust error handling.
Strong understanding of the end-to-end ML lifecycle including training, deployment, monitoring, retraining, reproducibility, and lineage.
Experience working with large-scale data platforms such as Databricks, Spark, Delta Lake, or equivalent ecosystems.
Experience with ML platform and MLOps frameworks such as MLflow, Metaflow, Kubeflow, or equivalent ML lifecycle-management systems.
Proven ability to design reusable workflow orchestration using Airflow, Metaflow, or Databricks, covering automation, scheduling, dependency management, and production reliability.
Familiarity with operational patterns for LLMOps, AgentOps, and production AI systems.
Strong written and verbal communication skills in English.
Experience with industrial, IoT or manufacturing platforms.
Experience with feature stores, model registries, dataset versioning, and lineage systems.
Experience with AI agents, RAG systems, production GenAI applications, or evaluation frameworks.
תחומי אחריות
Design and evolve production MLOps capabilities across the full ML lifecycle including datasets, features, models, evaluations, deployments, monitoring, retraining, and feedback signals.
Build systems for experiment tracking, artifact management, reproducibility, versioning, lineage, promotion workflows, and production readiness.
Develop reusable platform tooling, golden paths, and engineering standards that improve consistency and delivery velocity across teams.
Build operational infrastructure for LLM and agentic systems including prompts, tools, traces, evaluations, observability, safety boundaries, and production monitoring.
Design evaluation and monitoring frameworks for AI systems including answer quality, latency, grounding, reliability, and operational regressions.
Build and optimize large-scale training pipelines supporting heterogeneous data sources and scalable compute patterns.
Write clean, modular, production-grade Python services and platform libraries.
Drive engineering quality through automated testing, CI/CD, observability, deployment standards, and operational best practices.
5+ years of professional software engineering, MLOps, or ML platform engineering experience in production environments.
Significant experience building or owning production ML infrastructure and lifecycle systems.
Strong Python engineering skills with production-grade architecture, modular design, testing, packaging, and robust error handling.
Strong understanding of the end-to-end ML lifecycle including training, deployment, monitoring, retraining, reproducibility, and lineage.
Experience working with large-scale data platforms such as Databricks, Spark, Delta Lake, or equivalent ecosystems.
Experience with ML platform and MLOps frameworks such as MLflow, Metaflow, Kubeflow, or equivalent ML lifecycle-management systems.
Proven ability to design reusable workflow orchestration using Airflow, Metaflow, or Databricks, covering automation, scheduling, dependency management, and production reliability.
Familiarity with operational patterns for LLMOps, AgentOps, and production AI systems.
Strong written and verbal communication skills in English.
Experience with industrial, IoT or manufacturing platforms.
Experience with feature stores, model registries, dataset versioning, and lineage systems.
Experience with AI agents, RAG systems, production GenAI applications, or evaluation frameworks.
דרישות
Design and evolve production MLOps capabilities across the full ML lifecycle including datasets, features, models, evaluations, deployments, monitoring, retraining, and feedback signals.
Build systems for experiment tracking, artifact management, reproducibility, versioning, lineage, promotion workflows, and production readiness.
Develop reusable platform tooling, golden paths, and engineering standards that improve consistency and delivery velocity across teams.
Build operational infrastructure for LLM and agentic systems including prompts, tools, traces, evaluations, observability, safety boundaries, and production monitoring.
Design evaluation and monitoring frameworks for AI systems including answer quality, latency, grounding, reliability, and operational regressions.
Build and optimize large-scale training pipelines supporting heterogeneous data sources and scalable compute patterns.
Write clean, modular, production-grade Python services and platform libraries.
Drive engineering quality through automated testing, CI/CD, observability, deployment standards, and operational best practices.
5+ years of professional software engineering, MLOps, or ML platform engineering experience in production environments.
Significant experience building or owning production ML infrastructure and lifecycle systems.
Strong Python engineering skills with production-grade architecture, modular design, testing, packaging, and robust error handling.
Strong understanding of the end-to-end ML lifecycle including training, deployment, monitoring, retraining, reproducibility, and lineage.
Experience working with large-scale data platforms such as Databricks, Spark, Delta Lake, or equivalent ecosystems.
Experience with ML platform and MLOps frameworks such as MLflow, Metaflow, Kubeflow, or equivalent ML lifecycle-management systems.
Proven ability to design reusable workflow orchestration using Airflow, Metaflow, or Databricks, covering automation, scheduling, dependency management, and production reliability.
Familiarity with operational patterns for LLMOps, AgentOps, and production AI systems.
Strong written and verbal communication skills in English.
Experience with industrial, IoT or manufacturing platforms.
Experience with feature stores, model registries, dataset versioning, and lineage systems.
Experience with AI agents, RAG systems, production GenAI applications, or evaluation frameworks.