SOUTHWORKS Dev Team
February 23, 2024
Exploring the transformation from a serial ETL and notification solution, relying on multithreading and pointers, to a parallelized and highly optimized system using the .NET Dataflow library.
By Rodrigo Bergara, Basilio Farach, Nahuel Perales, SOUTHWORKS
In the realm of data processing, optimization opportunities often lie untapped. This blog post explores the transformation from a serial ETL and notification solution, relying on multithreading and pointers, to a parallelized and highly optimized system using the .NET Dataflow library.
We showcase a case study where the performance of an application processing large datasets is significantly improved by leveraging the Dataflow Parallel library.
Data processing is a recurring solution in the industry and sometimes a missed opportunity for optimization improvement on some projects. Using .NET Dataflow Parallel library solutions that involve code optimization with parallelism can be handled easily in a way that makes sense from the very beginning.
We will present a scenario in which we enhance our application's performance and streamline the processing of extensive data, around 80 million records, through the utilization of .NET Dataflow for code parallelization.
The initial solution involves a single-threaded Load process that sequentially handles data grouping, uploading batches to storage, and finalizing the remaining data. Although parallelization could enhance this process, it introduces challenges such as data transfer, synchronization, and object locking.
In this particular situation, there exists a Load process that receives data following the Transformation process. The responsibilities of this Load class include:
.NET Dataflow Library is a practical toolkit for handling parallel tasks without making things overly complicated, managing the technicalities of parallelization so that you can focus on other development tasks.
Imagine it as a pipeline model that streamlines the heavy lifting, allowing you to concentrate on the essential logic. What's neat? Not only it takes care of parallelization but also ensures smooth communication between different blocks. You get to decide how your data moves and buffers.
Plus, it's proficient in managing threads and optimizing your computer’s capabilities for better efficiency. Whether you prefer a straightforward pipeline or want to experiment with networked graphs, Dataflow is a reliable companion that simplifies the process.
Dataflow blocks core behavior is to buffer and process data in different ways.
To create a Dataflow pipeline, you need to complete the following steps:
The Dataflow model relies on well-defined decoupled tasks. The blog post details the configuration of blocks, introducing a Custom Block (Discriminator) composed of ActionBlock, BroadcastBlock, and BatchBlocks for efficient data grouping.
Prior to engaging with Dataflow, it is crucial to recognize that the underlying model relies on well-defined, decoupled tasks. This implies that the tasks within the system are carefully defined and operate independently, encouraging modularity and flexibility in the parallelized processing. Understanding this foundational aspect is key to harnessing the full potential of Dataflow, as it allows for effective coordination and execution of diverse tasks within the parallelized workflow.
Blocks can be configured to adapt to specific requirements. In this particular instance, the process involves specifying various parameters. Specifically, we provide information on the capacity, cancellation token, and maximum degree of parallelism for both the Transformation and Action Blocks. Additionally, for the Batch blocks, the configuration encompasses setting the maximum number of groups, this configuration approach ensures that each block operates in alignment with the desired parameters, enhancing the adaptability and efficiency of the overall Dataflow system.
The solution with the Dataflow library includes a CustomBlock (Discriminator) that receives the data from the Transformation phase and groups each piece of data received to continue processing.
The custom block is composed of an ActionBlock that functions as a router to store data correctly, a broadcast block to share copies of the Batches, and a Dictionary of BatchBlocks to organize the data.
Initialization: CustomBatchBock is set up to group incoming data into batches based on Batch and ID, with a specified batch size.
Block Setup: It uses a dictionary to discriminate batches by key, an action block to process and route data, and a broadcast block to share multiple batches.
Processing: Incoming values are processed by the Action Block, ensuring each ID is processed only once, and then sent to the appropriate batch.
Completion: When processing is complete, it triggers the completion of individual batches, ensuring all data is processed and broadcasted.
Inside CustomBatchBlock the BatchBlock is used to group data by IDs.
Batch Block was used to group data and process by batches.
We use the Broadcast Block to share copies of the blocks.
The TranformBlock data structure is designed to take in a single input block, which is buffered. A delegated process manages the block, and afterward, a return task generates a buffered response. The process is not completed until the return task finishes the work.
Transform Block was used to manipulate data
This unique structure, TransformMany, possesses the distinctive feature of accepting a single block and yielding none to many results. Within this block, two tasks unfold. The initial task takes charge of processing the input block, generating a collection of resulting blocks. Then, the second task extracts items from this collection one by one, delivering them to the external buffer. These tasks can be delegated and employed in both synchronous and asynchronous manners.
Transform Many Block was used to manipulate data.
The Action Block used is used as the last step of the process when a batch is loaded to send a notification at the end of the pipeline.
Linking the blocks to create the pipeline is really easy.
A visual representation of the pipeline with the used blocks.
To start the processing data is sent to _batch a BatchBlock that will buffer data initially.
To end the processing a last batch is received and used to trigger the last buffered data and complete the blocks.
With the help of .NET Dataflow's features, we were able to divide the tasks in the process and execute them simultaneously. This approach of parallelization enabled us to use the full potential of the available hardware resources, leading to a significant decrease in the overall execution time. The benefits of parallelization are noticeable in situations where the workload is inherently parallelizable. Tasks that previously were carried out sequentially can now be processed concurrently, resulting in a considerable increase in throughput. This improvement not only enhances the user experience by providing faster results but also enables the handling of larger datasets or more simultaneous requests. Additionally, the simplicity of expressing parallelism through .NET Dataflow made the codebase more maintainable and readable. The framework's high-level abstractions simplified the implementation of parallel processing, allowing the development team to concentrate on individual tasks' logic rather than the intricate details of managing concurrency.