![]() ![]() This is exactly what we can achieve by running Airflow on Kubernetes with the Kubernetes executor. What we need, therefore, is a way to combine the expressiveness of individual Airflow operators with the ability to run those tasks in an isolated, containerized environment. Although we could do everything we ever want to do with just these two operators, we lose the expressiveness we would otherwise have had in our Airflow DAGs if we could use the entire gamut of Airflow operators. We can’t, for example, use the PythonOperator or BashOperator. Containerization (docker images) used along with DockerOperator or KubernetesPodOperator seems to solve a lot of the major issues, but we are restricted to using only these two operators. The result is “always on” Celery workers.Ī common pattern starts to emerge from previous problems. Adding additional Celery worker nodes is straightforward, but, the workers have to be configured, packages and dependencies have to be installed, virtual environments have to be created, code has to be synced, etc before Airflow tasks can run on the new node all of which is cumbersome, to say the least, unless we are using the DockerOperator or KubernetesPodOperator. It is also equally important to be able to easily scale down the number of nodes in the cluster when done to reduce wasted resources and operating costs (think of all the idling GPU nodes). Wouldn’t it be nice if we didn’t have to deal with all of these queues ourselves and simply label a node as GPU and have Airflow schedule GPU tasks on that node?įor bursty or batch workloads, it is important to be able to quickly add additional worker nodes to the cluster to handle the increased workload. We’ll have to create a different queue for these GPU tasks, pass the newly created queue as a parameter to whatever Airflow operator we are using and then configure the GPU nodes (which are also Celery workers) to wait for tasks on this new queue. What happens when some tasks require a GPU to do their work? By default, all tasks are pushed to a default queue and all Celery workers are configured to listen to the default queue and so the task could get scheduled on any node. Targeting tasks to particular nodes in the Celery cluster Another way is to containerize our Airflow tasks (code + dependencies) and use the DockerOperator or KubernetesPodOperator. But, there are severe limitations to the PythonVirtualenvOperator in that we can only run simple functions (not object-oriented code) and the start-up times for those tasks are atrocious when we have to create virtual environments with huge Python packages like TensorFlow and PyTorch every time the task is executed. What happens when we need to run legacy code which depends on TensorFlow 1.x alongside modern code which depends on TensorFlow 2.x? One way is to have Airflow create a virtual environment and run code inside it using something like the PythonVirtualenvOperator. Sure, we could automate all of this, but it is still a pain when we have to deal with thousands of Airflow tasks with different, often conflicting dependencies. For example, if a PythonOperator task depends on TensorFlow, we need to have TensorFlow installed on every worker node. We will either have to use a shared volume or synchronize code across multiple nodes ourselves. Unless we are using DockerOperator or KubernetesPodOperator (which we will discuss shortly), our DAGs and all its dependencies must be present on all the Celery worker nodes. We probably also need Flower for monitoring the Celery workers.The Celery executor requires a message broker like RabbitMQ or Redis.Then why do we need something more Kubernetes native? What would we do without you, Kubernetes?Īirflow has a mature Celery executor that allows us to run tasks concurrently across multiple worker nodes. This blog post is not an Airflow tutorial but rather talks about my journey with building and running highly scalable and reliable ML pipelines on Kubernetes with Airflow and requires some understanding of Airflow and Kubernetes. I build ML pipelines for processing vast amounts of data and serving hundreds of data science models with very strict QoS parameters. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |