setrgym.blogg.se - Airflow performance testing

#Airflow performance testing code#

When multiple features are developed at the same time, they have to share the test environment which oftentimes leads to conflicts since testing in isolation is not possible. Only one version of an Airflow DAG such as our marketing ROI pipeline can exist in each environment. A set of Spark tables represents a data environment. As our data layer, we’re mainly using AWS S3 with data organized as Spark tables. The live pipeline uses the live data environment while the test pipeline uses the test data environment. In this setup, we have two so-called pipeline environments, a production (live) and a test environment. On merge to the main branch, we deploy to the production server. When a pull request is opened, the Airflow pipeline is deployed to the test server. We have two servers, production and test.

#Airflow performance testing code#

The Airflow code is stored in a github repository. The following section explains the problem in more depth.Īs mentioned earlier, we are using Airflow to orchestrate the overall pipeline. Since different teams are working on different components of the ROI pipeline in parallel, evaluating the impact of a change on the final ROI in isolation is required to work effectively (i.e. It means that while we oftentimes have assumptions on what the impact of a change in input data or to our components has on the ROI, we require the new version of the ROI pipeline to be run end-to-end to confirm our assumptions. Problem StatementĪ recurring problem we faced during the development relates to the nature of the marketing ROI which lacks a ground truth 1. You can read more about the way we measure campaign effectiveness from a functional perspective in our previous blog post. These components are owned and developed by different cross-functional teams (applied science, engineering, product) within Performance Marketing. Examples for said components are our input data preparation, marketing attribution model or an incremental profit forecast for our campaigns. It consists of various sub-pipelines (components), some of which are built using our python sdk zFlow. The ROI pipeline is a batch based data- and machine learning pipeline powered by Databricks Spark and orchestrated by Apache Airflow.

Talking about measurement, one of the core systems we’ve built and continuously extended over the years is our so-called marketing ROI (return on investment) pipeline. To do so, we build services that allow us to manage campaigns, optimize and distribute content, and measure the performance of the campaigns at scale. In the Performance Marketing department, we run paid advertisement campaigns for Zalando.