A orquestração de fluxos é fundamental para as soluções de engenharia de dados, pois o processamento dos dados costuma ser periódico e encapsulado em fluxos de trabalho, o cenário ideal para automatização. Dese modo, um orquestrador de fluxos é a ferramenta chave para viabilizar a escalabilidade do processamento dos dados.
No entanto, conforme o fluxo de dados se torna mais complexo e elaborado, torna-se um desafio cada vez maior garantir que as dependências entre as várias etapas de processamento de dados funcionem de forma eficaz e eficiente. …
Cross-DAG dependency may reduce cohesion in data pipelines and, without having an explicit solution in Airflow or in a third-party plugin, those pipelines tend to become complex to handle. That is the reason we, at QuintoAndar, have created an intermediate component to handle relationships across data pipelines called Mediator, in order for them to be scalable and maintainable by any team.
At QuintoAndar we seek automation and modularization in our data pipelines and believe that breaking them into many responsibility modules (DAGs) enhances maintainability, reusability and understanding to move data from one point to another. However, extending interconnections between DAGs…
Thanks for sharing this article Tomasz.
These new features from version 2.0 are definitely awesome. I'm especially excited about the TaskGroup, to replace the Subdags. Since the Subdags have some weird behavior and bugs with MySQL.
I see the Xcom Backend as a powerful feature also. And I also see we shall want to use different Xcom Backends for different DAGs (a DAG can write Data in a GCloud and another in an S3 for example).
This was what I thought first reading the section, to be honest. This is a suggestion for future releases =]
Since Dbveaver does not have yet a native driver for Trino, we need to create a custom one for connecting to our databases (the community is implementing it on this PR).
For this tutorial, I am using Dbeaver version 7.3.2 and Trino jar version 351.
Creating a new driver
1 - Open the driver manager at menu Database > Driver Manager
At QuintoAndar, we have dozens of data sources which we daily extract through specific workflows, to feed and enrich the analytical databases with information related to the daily renting and selling events of our business, such as bookings, contracts, visits, houses characteristics, etc. We chose the Apache Airflow as the main platform for controlling the extraction-transform-load (ETL) jobs and also the machine learning pipelines.
We want to share with you in this post our recent experience using this tool and some parts of our recent data architecture.
Airflow: it’s a breeze
Data Engineer@QuintoAndar. Data and Gin lover, always curious, amateur musician, and Crossfit rat.