Publications

Journal Articles

2024

On the Performance of Malleable APGAS Programs and Batch Job Schedulers

Patrick Finnerty, Jonas Posner, Janek Bürger, Leo Takaoka, and Takuma Kanzaki

Springer Nature Computer Science, 2024

Abs Bib doi:10.1007/s42979-024-02641-7

Malleability—the ability for applications to dynamically adjust their resource allocations at runtime—presents great potential to enhance the efficiency and resource utilization of modern supercomputers. However, applications are rarely capable of growing and shrinking their number of nodes at runtime, and batch job schedulers provide only rudimentary support for such features. While numerous approaches have been proposed to enable application malleability, these typically focus on iterative computations and require complex code modifications. This amplifies the challenges for programmers, who already wrestle with the complexity of traditional MPI inter-node programming. Asynchronous Many-Task (AMT) programming presents a promising alternative. In AMT, computations are split into many fine-grained tasks, which are processed by workers. This makes transparent task relocation via the AMT runtime system possible, thus offering great potential for enabling efficient malleability. In this work, we propose an extension to an existing AMT system, namely APGAS for Java. We provide easy-to-use malleability programming abstractions, requiring only minor application code additions from programmers. Runtime adjustments, such as process initialization and termination, are automatically managed by our malleability extension. We validate our malleability extension by adapting a load balancing library handling multiple benchmarks. We show that both shrinking and growing operations cost low execution time overhead. In addition, we demonstrate compatibility with potential batch job schedulers by developing a protoannote batch job scheduler that supports malleable jobs. Through extensive real-world job batches execution on up to 32 nodes, involving rigid, moldable, and malleable programs, we evaluate the impact of deploying malleable APGAS applications on supercomputers. Exploiting scheduling algorithms, such as FCFS, Backfilling, Easy-Backfilling, and one exploiting malleable jobs, the experimental results highlight a significant improvement regarding several metrics for malleable jobs. We show a 13.09% makespan reduction (the time needed to schedule and execute all jobs), a 19.86% increase in node utilization, and a 3.61% decrease in job turnaround time (the time a job takes from its submission to completion) when using 100% malleable job in combination with our protoannote batch job scheduler compared to the best-performing scheduling algorithm with 100% rigid jobs.
@article{FinnertyMalleableSNCS24, author = {Finnerty, Patrick and Posner, Jonas and B\"urger, Janek and Takaoka, Leo and Kanzaki, Takuma}, title = {On the Performance of Malleable APGAS Programs and Batch Job Schedulers}, journal = {Springer Nature Computer Science}, year = {2024}, doi = {10.1007/s42979-024-02641-7}, google_scholar_id = {0EnyYjriUFMC} }

2022

Task-Level Resilience: Checkpointing vs. Supervision

Jonas Posner, Lukas Reitz, and Claudia Fohry

Special Issue International Journal of Networking and Computing (IJNC), 2022

Abs Bib doi:10.15803/ijnc.12.1_47

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming implemented with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs. This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments, running time predictions, and simulations of job set executions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
@article{PosnerCheckpointingIJNC22, author = {Posner, Jonas and Reitz, Lukas and Fohry, Claudia}, title = {Task-Level Resilience: Checkpointing vs. Supervision}, journal = {Special Issue International Journal of Networking and Computing (IJNC)}, year = {2022}, volume = {12}, number = {1}, pages = {47--72}, doi = {10.15803/ijnc.12.1_47}, google_scholar_id = {_FxGoFyzp5QC} }

2019

A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

Jonas Posner, Lukas Reitz, and Claudia Fohry

Future Generation Computer Systems (FGCS), 2019

Abs Bib doi:10.1016/j.future.2019.11.031

Fault tolerance is an important requirement for successful program execution on exascale systems. The common approach, checkpointing, regularly saves a program’s state, such that the execution can be restarted after permanent node failures. Checkpointing is often performed on system level, but its deployment on application level can reduce the running time overhead. The drawback of application-level checkpointing is a higher programming expense. It pays off if the checkpointing is applied to reusable patterns. We consider task pools, which exist in many variants. The paper supposes that tasks are generated dynamically and are free of side effects. Further, the final result must be computed from individual task results by reduction. Moreover, the pools must be distributed with private queues, and adopt work stealing. The paper describes and evaluates three application-level fault tolerance schemes for task pools. All use uncoordinated checkpointing and regularly save information in a resilient store. The first scheme (called AllFT) saves descriptors of all open tasks; the second scheme (called IncFT) selectively and incrementally saves only part of them; and the third scheme (called LogFT) logs stealing events and writes checkpoints in parallel to task processing. All schemes have been implemented by extending the Global Load Balancing (GLB) library of the “APGAS for Java” programming system. In experiments with the UTS, NQueens, and BC benchmarks with up to 672 workers, the running time overhead during failure-free execution, compared to a non-resilient version of GLB, was typically below 6%. The recovery cost was negligible, and there was no clear winner among the three schemes. A more detailed performance analysis with synthetic benchmarks revealed that IncFT and LogFT are superior in scenarios with large task descriptors.
@article{PosnerFaultToleranceFGCS19, author = {Posner, Jonas and Reitz, Lukas and Fohry, Claudia}, title = {A Comparison of Application-Level Fault Tolerance Schemes for Task Pools}, journal = {Future Generation Computer Systems (FGCS)}, year = {2019}, volume = {105}, pages = {119--134}, doi = {10.1016/j.future.2019.11.031}, google_scholar_id = {Tyk-4Ss8FVUC} }

2018

Hybrid Work Stealing of Locality-Flexible and Cancelable Tasks for the APGAS Library

Jonas Posner, and Claudia Fohry

The Journal of Supercomputing, 2018

Abs Bib doi:10.1007/s11227-018-2234-8

Since large parallel machines are typically clusters of multicore nodes, parallel programs should be able to deal with both shared memory and distributed memory. This paper proposes a hybrid work stealing scheme, which combines the lifeline-based variant of distributed task pools with the node-internal load balancing of Java’s Fork/Join framework. We implemented our scheme by extending the APGAS library for Java, which is a branch of the X10 project. APGAS programmers can now spawn locality-flexible tasks with a new asyncAny construct. These tasks are transparently mapped to any resource in the overall system, so that the load is balanced over both nodes and cores. Unprocessed asyncAny-tasks can also be cancelled. In performance measurements with up to 144 workers on up to 12 nodes, we observed near linear speedups for four benchmarks and a low overhead for cancellation-related bookkeeping.
@article{PosnerHybridSuper18, author = {Posner, Jonas and Fohry, Claudia}, title = {Hybrid Work Stealing of Locality-Flexible and Cancelable Tasks for the APGAS Library}, journal = {The Journal of Supercomputing}, publisher = {Springer}, year = {2018}, pages = {1435--1448}, doi = {10.1007/s11227-018-2234-8}, google_scholar_id = {IjCSPb-OGe4C} }
A Java Task Pool Framework providing Fault-Tolerant Global Load Balancing

Jonas Posner, and Claudia Fohry

Special Issue on the International Journal of Networking and Computing (IJNC), 2018

Abs Bib doi:10.15803/ijnc.8.1_2

Fault tolerance is gaining importance in parallel computing, especially on large clusters. Traditional approaches handle the issue on system-level. Application-level approaches are becoming increasingly popular, since they may be more efficient. This paper presents a fault-tolerant work stealing technique on application level, and describes its implementation in a generic reusable task pool framework for Java. When using this framework, programmers can focus on writing sequential code to solve their actual problem. The framework is written in Java and utilizes the APGAS library for parallel programming. It implements a comparatively simple algorithm that relies on a resilient data structure for storing backups of local pools and other information. Our implementation uses Hazelcast’s IMap for this purpose, which is an automatically distributed and fault-tolerant key-value store. The number of backup copies is configurable and determines how many simultaneous failures can be tolerated. Our algorithm is shown to be correct in the sense that failures are either tolerated and the computed result is the same as in non-failure case, or the program aborts with an error message.
@article{PosnerFaultToleranceIJNC18, author = {Posner, Jonas and Fohry, Claudia}, title = {A Java Task Pool Framework providing Fault-Tolerant Global Load Balancing}, journal = {Special Issue on the International Journal of Networking and Computing (IJNC)}, year = {2018}, volume = {8}, number = {1}, pages = {2--31}, doi = {10.15803/ijnc.8.1_2}, google_scholar_id = {eQOLeE2rZwMC} }

2015

Fault Tolerance Schemes for Global Load Balancing in X10

Claudia Fohry, Marco Bungart, and Jonas Posner

Scalable Computing: Practice and Experience (SCPE), 2015

Abs Bib doi:10.12694/scpe.v16i2.1088

Scalability postulates fault tolerance to be efficient. One approach handles permanent node failures at user level. It is supported by Resilient X10, a Partitioned Global Address Space language that throws an exception when a place fails. We consider task pools, which are a widely used pattern for load balancing of irregular applications, and refer to the variant that is implemented in the Global Load Balancing framework GLB of X10. Here, each worker maintains a private pool and supports cooperative work stealing. Victim selection and termination detection follow the lifeline scheme. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. We consider a single worker per node, and assume that failures are rare and uncorrelated. The paper introduces two fault tolerance schemes. Both are based on regular backups of the local task pool contents, which are written to the main memory of another worker and updated in the event of stealing. The first scheme mainly relies on synchronous communication. The second scheme deploys asynchronous communication, and significantly improves on the first scheme efficiency and robustness. Both schemes have been implemented by extending the GLB source code. Experiments were run with the Unbalanced Tree Search (UTS) and Betweenness Centrality benchmarks. For UTS on 128 nodes, for instance, we observed an overhead of about 81% with the synchronous scheme and about 7% with the asynchronous scheme. The protocol overhead for a place failure was negligible.
@article{FohryFaultToleranceSCPE15, author = {Fohry, Claudia and Bungart, Marco and Posner, Jonas}, title = {Fault Tolerance Schemes for Global Load Balancing in X10}, journal = {Scalable Computing: Practice and Experience (SCPE)}, year = {2015}, volume = {16}, number = {2}, pages = {169--186}, doi = {10.12694/scpe.v16i2.1088}, google_scholar_id = {2osOgNQ5qMEC} }

Dissertation

2021

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

Jonas Posner

University of Kassel, Germany, 2021

Abs Bib doi:10.17170/kobra-202207286542

High-Performance Computing (HPC) enables solving complex problems from various scientific fields including key societal problems such as COVID-19. Recently, traditional simulations have been joined by more diverse workloads, including irregular ones limiting the predictability of the computations. Workloads are run on HPC machines that comprise an increasing number of hardware components, and serve multiple users simultaneously. To enable efficient and productive programming of today’s HPC machines and beyond, it is essential to address a variety of issues, including: load balancing (i.e., utilizing all resources equally), fault tolerance (i.e., coping with hardware failures), and resource elasticity (i.e., allowing the addition/release of resources). In this thesis, we address these issues in the context of Asynchronous Many-Task (AMT) programming. In AMT, programmers split a computation into many fine-grained execution units (called tasks), which are dynamically mapped to processing units (e.g., threads) by a runtime system. While AMT is becoming established for single computers, we are focusing on cluster AMTs, which are currently merely protoannotes with limited functionalities. Regarding load balancing, we propose a work stealing technique that transparently schedules tasks to resources of the overall system, balancing the workload over all processing units. In this context, we introduce several tasking constructs. Experiments show good scalability, and a productivity evaluation shows intuitive use. Regarding fault tolerance, we propose four techniques to protect programs transparently. All perform localized recovery and continue the program execution with fewer resources. Three techniques write uncoordinated checkpoints in a resilient store: One saves descriptors of all open tasks; the second saves only part of them; and the third logs stealing events to reduce the number of checkpoints. The fourth technique does not write checkpoints at all, but exploits natural task duplication of work stealing. Experiments show no clear winner between the techniques. For instance, the first one has a failure-free running time overhead below 1% and a recovery overhead below 0.5 seconds, both for smooth weak caling. Simulations of job set executions show that the completion time can be reduced by up to 97%. Regarding resource elasticity, we propose a technique to enable the addition and release of nodes at runtime by transparently relocating tasks accordingly. Experiments show costs for adding and releasing nodes below 0.5 seconds. Additionally, simulations of job set executions show that the completion time can be reduced by up to 20%.
@phdthesis{PosnerPhD22, author = {Posner, Jonas}, title = {Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems}, school = {University of Kassel, Germany}, year = {2021}, doi = {10.17170/kobra-202207286542}, google_scholar_id = {hqOjcs7Dif8C} }

Conference and Workshop Articles

2025

Stackless vs. Stackful Coroutines: A Comparative Study for RDMA-based Asynchronous Many-Task (AMT) Runtimes

Mia Reitz, and Jonas Posner

In Proceedings International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops (PAW-ATM), 2025

Abs Bib

Asynchronous Many-Task (AMT) runtimes manage parallelism by suspending and migrating tasks between processes, with their state captured in continuations. The efficiency of suspending, migrating, and resuming these continuations is critical to application performance. This work directly compares stackful and stackless coroutines as continuation implementations in a cluster environment using RDMA-based coordinated work stealing. We implement and evaluate two functionally equivalent AMT runtimes for a fine-grained, recursive workload: one using traditional stackful coroutines, and another using C++20 stackless coroutines. Our results show that both approaches yield nearly identical overall performance for small-state tasks. Stackful coroutines are created 2.4x faster, while stackless coroutines switch context 3.5x faster and have smaller frames. However, the smaller frame size of stackless coroutines does not significantly reduce communication time, which is dominated by network latency. We conclude that both coroutine types are viable, with stackless coroutines offering advantages as task state increases.
@inproceedings{PAW25, author = {Reitz, Mia and Posner, Jonas}, title = {{Stackless vs. Stackful Coroutines: A Comparative Study for RDMA-based Asynchronous Many-Task (AMT) Runtimes}}, booktitle = {Proceedings International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops (PAW-ATM)}, year = {2025}, publisher = {ACM}, }
Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads

Patrick Zojer, Jonas Posner, and Taylan Özden

In Proceedings Latin American High Performance Computing Conference (CARLA), 2025

Abs Bib

Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37-67%, job makespan by 16-65%, job wait times by 73-99%, and node utilization improves by 5-52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.
@inproceedings{Carla25, author = {Zojer, Patrick and Posner, Jonas and \"Ozden, Taylan}, title = {{Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads}}, booktitle = {Proceedings Latin American High Performance Computing Conference (CARLA)}, year = {2025}, }
Dynamic Resource Management: Comparison of Asynchronous Many-Task (AMT) and Dynamic Processes with PSets (DPP)

Jonas Posner, Nick Bietendorf, Dominik Huber, Martin Schreiber, and Martin Schulz

In Workshop on Asynchronous Many-Task Systems and Applications (WAMTA), 2025

Abs Bib Slides

Dynamic resource management allows programs running on supercomputers to adjust resource allocations at runtime. This dynamism offers potential improvements in both individual program efficiency and overall supercomputer utilization. Despite growing interest in recent years, the adoption of dynamic resource management remains limited due to inadequate support from widely used resource managers, such as Slurm, and programming environments, such as MPI. Furthermore, developing flexible programs introduces substantially higher programming complexity compared to static programs. While recent research has improved MPI’s resource flexibility, significant programmability challenges remain. Additionally, MPI-based solutions rely on low-level message-passing primitives, which are particularly challenging to use for non-iterative workloads. Asynchronous Many-Task (AMT) programming offers a promising alternative to MPI. By decomposing computations into tasks that are dynamically scheduled by the runtime system, AMT is well suited to handling irregular and dynamic workloads. AMT’s transparent resource management is ideal for dynamic resources, allowing the runtime system to seamlessly redistribute tasks in response to node changes without requiring additional programmer effort. In this work, we compare the “Dynamic Processes with PSets (DPP)” design principle implemented in an MPI-based environment and the APGAS+GLB AMT runtime system. We implement benchmarks in both environments to evaluate programmability and perform experiments on up to 16 nodes to analyze the performance of static and flexible programs. Results demonstrate that GLB simplifies programming with built-in load balancing and resource flexibility. In contrast, the MPI-DPP implementation achieves superior performance in handling node changes but at the cost of increased programming complexity.
@inproceedings{DPPvsAPGASWAMTA25, author = {Posner, Jonas and Bietendorf, Nick and Huber, Dominik and Schreiber, Martin and Schulz, Martin}, title = {{Dynamic Resource Management: Comparison of Asynchronous Many-Task (AMT) and Dynamic Processes with PSets (DPP)}}, booktitle = {Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)}, year = {2025}, }

2024

The Impact of Evolving APGAS Programs on HPC Clusters

Jonas Posner

In Proceedings Euro-Par Parallel Processing Workshops (DynResHPC), 2024

Abs Bib doi:10.1007/978-3-031-90200-0_25 Slides Preprint

High-performance computing (HPC) clusters are traditionally managed statically, i.e., user jobs maintain a fixed number of computing nodes for their entire execution. This approach becomes inefficient with the increasing prevalence of dynamic and irregular workloads, which have unpredictable computation patterns that result in fluctuating resource needs at runtime. For instance, nodes cannot be released when they are not needed, limiting the overall supercomputer performance. However, the realization of jobs that can grow and shrink their number of node allocations at runtime is hampered by a lack of support in both resource managers and programming environments. This work leverages evolving programs that grow and shrink autonomously through automated decision-making, making them well-suited for dynamic and irregular workloads. The Asynchronous Many-Task (AMT) programming model has recently shown promise in this context. In AMT, computations are decomposed into many fine-grained tasks, enabling the runtime system to transparently migrate these tasks across nodes. Our study builds on the APGAS AMT runtime system, which supports evolving capabilities, i.e., handles process initialization and termination automatically and requires only minimal user code additions. We enable communication between APGAS and a prototype resource manager as well as extend the Easy-Backfilling job scheduling algorithm to support evolving jobs. We conduct extensive real-world job batch executions on 10 nodes—involving a mix of rigid, moldable, and evolving programs—to evaluate the impact of evolving APGAS programs on supercomputers. Our experimental results demonstrate a 23% reduction in job batch makespan and a 29% reduction in job turnaround time for evolving jobs.
@inproceedings{PosnerEvolvingDynRes24, author = {Posner, Jonas}, title = {The Impact of Evolving APGAS Programs on HPC Clusters}, booktitle = {Proceedings Euro-Par Parallel Processing Workshops (DynResHPC)}, year = {2024}, doi = {10.1007/978-3-031-90200-0_25}, }
Evolving APGAS Programs: Automatic and Transparent Resource Adjustments at Runtime

Jonas Posner, Raoul Goebel, and Patrick Finnerty

In Proceedings Workshop on Asynchronous Many-Task Systems and Applications (WAMTA), 2024

Abs Bib doi:10.1007/978-3-031-61763-8_15 Slides

In the rapidly evolving field of High-Performance Computing (HPC), the need for resource elasticity is paramount, particularly in addressing the dynamic nature of irregular computational workloads. A key area of elasticity lies within programming models that typically offer limited support. Fully elastic programs are both malleable—capable of dynamically adjusting resources in response to external job scheduler requests—and evolving—autonomously deciding when and how to adjust resources, e.g., through automated decision-making. Previous elasticity approaches typically relied on iterative workloads and required complex code modifications. Asynchronous Many-Task (AMT) programming is emerging as a powerful alternative. In AMT, computations are split into fine-grained tasks, allowing transparent task relocation by the runtime system and unlocking significant potential for efficient elasticity. This work-in-progress proposes an extension to the existing AMT APGAS that recently incorporated malleability. Our extension adds evolving capabilities providing automatic and transparent resource adjustments to meet changing computational workloads at runtime. Our easy-to-use abstractions require only minimal code additions; adjustments such as process initialization and termination are managed automatically. Our extension is validated via a load-balancing library for irregular workloads. We propose two heuristics for automatic computational load detection: one that uses CPU loads provided by the operating system, and another that exploits detailed insights into task loads. We evaluate our approach using a novel synthetic benchmark that starts with a single task evolving into two irregular trees connected by a long sequential branch. Preliminary results are promising, indicating that both the CPU-based heuristic and the task-based heuristic showing similar efficiency.
@inproceedings{PosnerEvolvingWAMTA24, author = {Posner, Jonas and Goebel, Raoul and Finnerty, Patrick}, title = {Evolving APGAS Programs: Automatic and Transparent Resource Adjustments at Runtime}, booktitle = {Proceedings Workshop on Asynchronous Many-Task Systems and Applications (WAMTA)}, year = {2024}, doi = {10.1007/978-3-031-61763-8_15}, google_scholar_id = {MXK_kJrjxJIC} }

2023

Enhancing Supercomputer Performance with Malleable Job Scheduling Strategies

Jonas Posner, Fabian Hupfeld, and Patrick Finnerty

In Proceedings Euro-Par Parallel Processing Workshops (PECS), 2023

Abs Bib doi:10.1007/978-3-031-48803-0_14 Slides

In recent years, supercomputers have experienced significant advancements in performance and have grown in size, now comprising several thousands nodes. To unlock the full potential of these machines, efficient resource management and job scheduling—assigning parallel programs to nodes—are crucial. Traditional job scheduling approaches employ rigid jobs that use the same set of resources throughout their lifetime, resulting in significant resource under-utilization. By employing malleable jobs that are capable of changing their number of resources during execution, the performance of supercomputers has potential to increase. However, designing algorithms for scheduling malleable jobs is challenging since it requires complex strategies to determine when and how to reassign resources among jobs while maintaining fairness. In this work, we extend a recently proposed malleable job scheduling algorithm by introducing new strategies. Specifically, we propose three priority orders to determine which malleable job to consider for resource reassignments and the number of nodes when starting a job. Additionally, we propose three reassignment approaches to handle the delay between scheduling decisions and the actual transfer of resources between jobs. This results in nine algorithm variants. We then evaluate the impact of deploying malleable jobs scheduled by our nine algorithm variants. For that, we simulate the scheduling of job sets containing varying proportions of rigid and malleable jobs on a hypothetical supercomputer. The results demonstrate significant improvements across several metrics. For instance, with 20% of malleable jobs, the overall completion time is reduced by 11% while maintaining high node utilization and fairness.
@inproceedings{PosnerSchedulingPECS23, author = {Posner, Jonas and Hupfeld, Fabian and Finnerty, Patrick}, title = {Enhancing Supercomputer Performance with Malleable Job Scheduling Strategies}, booktitle = {Proceedings Euro-Par Parallel Processing Workshops (PECS)}, year = {2023}, publisher = {Springer}, doi = {10.1007/978-3-031-48803-0_14}, google_scholar_id = {MXK_kJrjxJIC} }
Malleable APGAS Programs and their Support in Batch Job Schedulers

Patrick Finnerty, Reo Takaoka, Takuma Kanzaki, and Jonas Posner

In Proceedings Euro-Par Parallel Processing Workshops (AMTE), 2023

Abs Bib doi:10.1007/978-3-031-48803-0_8 Slides

Malleability—the ability for applications to dynamically adjust their resource allocations at runtime—presents great potential to enhance the efficiency and resource utilization of modern supercomputers. However, applications are rarely capable of growing and shrinking their number of nodes at runtime, and batch job schedulers provide only rudimentary support for these features. While numerous approaches have been proposed for enabling application malleability, these typically focus on iterative computations and require complex code modifications. This amplifies the challenges for programmers, who already wrestle with the complexity of traditional MPI inter-node programming. Asynchronous Many-Task (AMT) programming presents a promising alternative. Computations are split into many fine-grained tasks, which are processed by workers. This way, AMT enables transparent task relocation via the runtime system, thus offering great potential for efficient malleability. In this paper, we propose an extension to an existing AMT system, namely APGAS for Java, that provides easy-to-use malleability. More specifically, programmers enable application malleability with only minimal code additions, thanks to the simple abstractions we provide. Runtime adjustments, such as process initialization and termination, are automatically managed. We demonstrate the ease of integration between our extension and future batch job schedulers through the implementation of a simplistic malleable batch job scheduler. Additionally, we validate our extension through the adaption of a load balancing library handling multiple benchmarks. Finally, we show that even a simplistic scheduling strategy for malleable applications improves resource utilization, job throughput, and overall job response time.
@inproceedings{FinnertyMalleableAMTE24, author = {Finnerty, Patrick and Takaoka, Reo and Kanzaki, Takuma and Posner, Jonas}, title = {Malleable APGAS Programs and their Support in Batch Job Schedulers}, booktitle = {Proceedings Euro-Par Parallel Processing Workshops (AMTE)}, year = {2023}, publisher = {Springer}, doi = {10.1007/978-3-031-48803-0_8}, google_scholar_id = {8k81kl-MbHgC} }

2021

Transparent Resource Elasticity for Task-Based Cluster Environments with Work Stealing

Jonas Posner, and Claudia Fohry

In Proceedings International Conference on Parallel Processing (ICPP) Workshops (P2S2), 2021

Abs Bib doi:10.1145/3458744.3473361

Resource elasticity allows to dynamically change the resources of running jobs, which may significantly improve the throughput on supercomputers. Elasticity requires support from both job schedulers and user applications. Whereas the adaptation of traditional programs requires additional programmer effort, task-based programs can be made elastic in a transparent way. In this paper, we propose a corresponding technique for implementation in a runtime system. We refer to a work stealing-based runtime for clusters, which uses the lifeline scheme for victim selection, combines inter-node work stealing with intra-node work sharing, and handles dynamic independent tasks, i.e., tasks that may spawn child tasks but do not otherwise cooperate. We experimentally assess the elasticity overhead of our scheme and find that adding/releasing up to 64 nodes takes less than 0.5 seconds. This value is determined with the help of a new formula that estimates the overhead-free running time of work-stealing programs with a changing number of workers. Using this result, we then quantify the gain of deploying elastic jobs. For that, we simulate the execution of job sets that contain some percentage of elastic jobs on two hypothetical supercomputers. We use an existing elastic job scheduler, which we concretize, e.g. by a new heuristic to determine the minimum, maximum, and preferred number of nodes for a job. Results show that the makespan can be reduced by up to 20% if most jobs are elastic.
@inproceedings{PosnerElasticityP2S221, author = {Posner, Jonas and Fohry, Claudia}, title = {Transparent Resource Elasticity for Task-Based Cluster Environments with Work Stealing}, booktitle = {Proceedings International Conference on Parallel Processing (ICPP) Workshops (P2S2)}, year = {2021}, publisher = {ACM}, pages = {1--10}, doi = {10.1145/3458744.3473361}, google_scholar_id = {Se3iqnhoufwC} }
Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks

Jonas Posner, Lukas Reitz, and Claudia Fohry

In Proceeding International Parallel and Distributed Processing Symposium (IPDPS) Workshops (APDCM), 2021

Abs Bib doi:10.1109/IPDPSW52791.2021.00089

With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs.This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
@inproceedings{PosnerCheckpointingAPDCM21, author = {Posner, Jonas and Reitz, Lukas and Fohry, Claudia}, title = {Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks}, booktitle = {Proceeding International Parallel and Distributed Processing Symposium (IPDPS) Workshops (APDCM)}, year = {2021}, publisher = {IEEE}, doi = {10.1109/IPDPSW52791.2021.00089}, google_scholar_id = {LkGwnXOMwfcC} }

2020

System-Level vs. Application-Level Checkpointing

Jonas Posner

In International Conference on Cluster Computing (CLUSTER), 2020

Abs Bib doi:10.1109/CLUSTER49012.2020.00051

Fault tolerance is becoming increasingly important since the probability of permanent hardware failures increases with machine size. A typical resilience approach to fail/stop failures today is checkpointing, which can be performed on system- or application-level. Both levels come in many variants, but they fundamentally differ. On system-level, no code changes are required, full program states are saved, and after a failure the program must be restarted from the last checkpoint. In contrast, on application-level, only user-defined data are check-pointed, which requires some programming effort. Thereby, the running time overhead may be reduced significantly, and programs may continue execution after failures.
@inproceedings{PosnerDMTCPCluster20, author = {Posner, Jonas}, title = {System-Level vs. Application-Level Checkpointing}, booktitle = {International Conference on Cluster Computing (CLUSTER)}, publisher = {IEEE}, year = {2020}, pages = {404--405}, doi = {10.1109/CLUSTER49012.2020.00051}, google_scholar_id = {WF5omc3nYNoC} }

2018

Comparison of the HPC and Big Data Java Libraries Spark, PCJ and APGAS

Jonas Posner, Lukas Reitz, and Claudia Fohry

In Proceedings International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops (PAW-ATM), 2018

Abs Bib doi:10.1109/PAW-ATM.2018.00007

Although Java is rarely used in HPC, there are a few notable libraries. Use of Java may help to bridge the gap between HPC and big data processing. This paper compares the big data library Spark, and the HPC libraries PCJ and APGAS, regarding productivity and performance. We refer to Java versions of all libraries. For APGAS, we include both the original version and an own extension by locality-flexible tasks. We consider three benchmarks: Calculation of π from HPC, Unbalanced Tree Search (UTS) from HPC, and WordCount from the big data domain. In performance measurements with up to 144 workers, the extended APGAS library was the clear winner. With 144 workers, APGAS programs were up to a factor of more than two faster than Spark programs, and up to about 30% faster than PCJ programs. Regarding productivity, the extended APGAS programs consistently needed the lowest number of different library constructs. Spark ranged second in productivity and PCJ third.
@inproceedings{PosnerSparkPAW18, author = {Posner, Jonas and Reitz, Lukas and Fohry, Claudia}, title = {Comparison of the HPC and Big Data Java Libraries Spark, PCJ and APGAS}, booktitle = {Proceedings International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops (PAW-ATM)}, publisher = {ACM}, year = {2018}, pages = {11--22}, doi = {10.1109/PAW-ATM.2018.00007}, google_scholar_id = {9yKSN-GCB0IC} }
A Selective and Incremental Backup Scheme for Task Pools

Claudia Fohry, Jonas Posner, and Lukas Reitz

In Proceedings International Conference on High Performance Computing & Simulation (HPCS), 2018

Abs Bib doi:10.1109/HPCS.2018.00103

Checkpointing is a common approach to prevent loss of a program’s state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated application-level checkpointing technique for task pools. It selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before. The checkpoints are held in a resilient in-memory data store. Our technique applies to any task pool variant in which workers operate at the top of local pools, and work stealing operates at the bottom. Furthermore, the tasks must be free of side effects, and the final result must be calculated by reduction from individual task results. We implemented the technique for the lifeline-based global load balancing variant of task pools. This variant couples random victim selection with an overlay graph for termination detection. A fault-tolerant realization already exists in the form of a Java library, called JFT_GLB. It uses the APGAS and Hazelcast libraries underneath. Our implementation modifies JFT_GLB by replacing its nonselective checkpointing scheme with our new one. In experiments, we compared the overhead of the new scheme to that of JFT_GLB, with UTS, BC and two synthetic benchmarks. The new scheme required slightly more running time when local pools were small, and paid off otherwise.
@inproceedings{FohryIncrementalHPCS18, author = {Fohry, Claudia and Posner, Jonas and Reitz, Lukas}, title = {A Selective and Incremental Backup Scheme for Task Pools}, booktitle = {Proceedings International Conference on High Performance Computing {\&} Simulation (HPCS)}, year = {2018}, pages = {621--628}, doi = {10.1109/HPCS.2018.00103}, google_scholar_id = {qjMakFHDy7sC} }
A Combination of Intra- and Inter-place Work Stealing for the APGAS Library

Jonas Posner, and Claudia Fohry

In Proceedings Parallel Processing and Applied Mathematics (PPAM) Workshops (WLPP), 2018

Abs Bib doi:10.1007/978-3-319-78054-2_22

Since today’s clusters consist of nodes with multicore processors, modern parallel applications should be able to deal with shared and distributed memory simultaneously. In this paper, we present a novel hybrid work stealing scheme for the APGAS library for Java, which is a branch of the X10 project. Our scheme extends the library’s runtime system, which traditionally performs intra-node work stealing with the Java Fork/Join framework. We add an inter-node work stealing scheme that is inspired by lifeline-based global load balancing. The extended functionality can be accessed from the APGAS library with new constructs. Most important, locality-flexible tasks can be submitted with asyncAny, and are then automatically scheduled over both nodes and cores. In experiments with up to 144 workers on up to 12 nodes, our system achieved near linear speedups for three benchmarks.
@inproceedings{PosnerCombinationWLPP18, author = {Posner, Jonas and Fohry, Claudia}, title = {A Combination of Intra- and Inter-place Work Stealing for the APGAS Library}, booktitle = {Proceedings Parallel Processing and Applied Mathematics (PPAM) Workshops (WLPP)}, publisher = {Springer}, year = {2018}, pages = {234--243}, doi = {10.1007/978-3-319-78054-2_22}, google_scholar_id = {UeHWp8X0CEIC} }

2017

Fault Tolerance for Cooperative Lifeline-Based Global Load Balancing in Java with APGAS and Hazelcast

Jonas Posner, and Claudia Fohry

In International Parallel and Distributed Processing Symposium (IPDPS) Workshops (APDCM), 2017

Abs Bib doi:10.1109/ipdpsw.2017.31

Fault tolerance is a major issue for parallel applications. Approaches on application-level are gaining increasing attention because they may be more efficient than system-level ones. In this paper, we present a generic reusable framework for fault-tolerant parallelization with the task pool pattern. Users of this framework can focus on coding sequential tasks for their problem, while respecting some framework contracts. The framework is written in Java and deploys the APGAS library as well as Hazelcast’s distributed and fault-tolerant IMap. Our fault-tolerance scheme uses two system-wide maps, in which it stores, e.g., backups of local task pools. Framework users may configure the number of backup copies to control how many simultaneous failures are tolerated. The algorithm is correct in the sense that the computed result is the same as in non-failure case, or the program aborts with an error message. In experiments with up to 128 workers, we compared the framework’s performance with that of a non-fault-tolerant variant during failure-free operation. For the UTS and BC benchmarks, the overhead was at most 35%. Measured values were similar as for a related, but less flexible fault-tolerant X10 framework, without a clear winner. Raising the number of backup copies to six only marginally improved the overhead.
@inproceedings{PosnerFaultToleranceAPDCM17, author = {Posner, Jonas and Fohry, Claudia}, title = {Fault Tolerance for Cooperative Lifeline-Based Global Load Balancing in Java with APGAS and Hazelcast}, booktitle = {International Parallel and Distributed Processing Symposium (IPDPS) Workshops (APDCM)}, year = {2017}, publisher = {IEEE}, pages = {854--863}, doi = {10.1109/ipdpsw.2017.31}, google_scholar_id = {u-x6o8ySG0sC} }

2016

Cooperation vs. Coordination for Lifeline-Based Global Load Balancing in APGAS

Jonas Posner, and Claudia Fohry

In Proceedings of the 6th ACM SIGPLAN Workshop on X10, 2016

Abs Bib doi:10.1145/2931028.2931029

Work stealing can be implemented in either a cooperative or a coordinated way. We compared the two approaches for lifeline-based global load balancing, which is the algorithm used by X10’s Global Load Balancing framework GLB. We conducted our study with the APGAS library for Java, to which we ported GLB in a first step. Our cooperative variant resembles the original GLB framework, except that strict sequentialization is replaced by Java synchronization constructs such as critical sections. Our coordinated variant enables concurrent access to local task pools by using a split queue data structure. In experiments with modified versions of the UTS and BC benchmarks, the cooperative and coordinated APGAS variants had similar executions times, without a clear winner. Both variants outperformed the original GLB when compiled with Managed X10. Experiments were run on up to 128 nodes, to which we assigned up to 512 places.
@inproceedings{PosnerCooperationX1016, author = {Posner, Jonas and Fohry, Claudia}, title = {Cooperation vs. Coordination for Lifeline-Based Global Load Balancing in APGAS}, booktitle = {Proceedings of the 6th ACM SIGPLAN Workshop on X10}, year = {2016}, publisher = {ACM}, pages = {13--17}, doi = {10.1145/2931028.2931029}, google_scholar_id = {zYLM7Y9cAGgC} }

2015

Towards an Efficient Fault-Tolerance Scheme for GLB

Claudia Fohry, Marco Bungart, and Jonas Posner

In Proceedings of the ACM SIGPLAN Workshop on X10, 2015

Abs Bib doi:10.1145/2771774.2771779

X10’s Global Load Balancing framework GLB implements a user-level task pool for inter-place load balancing. It is based on work stealing and deploys the lifeline algorithm. A single worker per place alternates between processing tasks and answering steal requests. We have devised an efficient fault-tolerance scheme for this algorithm, improving on a simpler resilience scheme from our own previous work. Among the base ideas of the new scheme are incremental backups of “stable” tasks and an actor-like communication structure. The paper reports on our ongoing work to extend the GLB framework accordingly. While details of the scheme are left out, we discuss implementation issues and preliminary experimental results.
@inproceedings{FohryFaultToleranceX1015, author = {Fohry, Claudia and Bungart, Marco and Posner, Jonas}, title = {Towards an Efficient Fault-Tolerance Scheme for GLB}, booktitle = {Proceedings of the ACM SIGPLAN Workshop on X10}, year = {2015}, publisher = {ACM}, pages = {27--32}, doi = {10.1145/2771774.2771779}, google_scholar_id = {YsMSGLbcyi4C} }

2014

Fault-Tolerant Global Load Balancing in X10

Marco Bungart, Claudia Fohry, and Jonas Posner

In Proceedings International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014

Abs Bib doi:10.1109/synasc.2014.69

Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.
@inproceedings{BungartFaultToleranceSYNASC14, author = {Bungart, Marco and Fohry, Claudia and Posner, Jonas}, title = {Fault-Tolerant Global Load Balancing in X10}, booktitle = {Proceedings International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)}, year = {2014}, pages = {471--478}, publisher = {IEEE}, doi = {10.1109/synasc.2014.69}, google_scholar_id = {W7OEmFMy1HYC} }

Posters and Extended Abstracts

2024

Resource Adaptivity at Task-Level

Jonas Posner

At Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM), 2024

Bib doi:10.5281/zenodo.14211666 Slides

@conference{ResAdapPAW24,
  author = {Posner, Jonas},
  title = {{Resource Adaptivity at Task-Level}},
  booktitle = {Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM)},
  year = {2024},
  addendum = {Extended Abstract},
  doi = {10.5281/zenodo.14211666}
}

Project Wagomu: Elastic HPC Resource Management

Jonas Posner, and Patrick Finnerty

At ISC High Performance Conference, 2024

Bib Poster

@conference{PosnerWagomuISC24,
  author = {Posner, Jonas and Finnerty, Patrick},
  title = {Project Wagomu: Elastic HPC Resource Management},
  booktitle = {ISC High Performance Conference},
  year = {2024},
}

2022

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

Jonas Posner

At International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022

Bib Poster

@conference{PosnerAMTSC22,
  author = {Posner, Jonas},
  title = {Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems},
  booktitle = {International Conference on High Performance Computing, Networking, Storage and Analysis (SC)},
  year = {2022},
}

Asynchronous Many-Tasking (AMT): Load Balancing, Fault Tolerance, Resource Elasticity

Jonas Posner

At ISC High Performance Conference, 2022

Bib Poster

@conference{PosnerAMTISC22,
  author = {Posner, Jonas},
  title = {Asynchronous Many-Tasking (AMT): Load Balancing, Fault Tolerance, Resource Elasticity},
  booktitle = {ISC High Performance Conference},
  year = {2022},
}

2021

Resource Elasticity at Task-Level

Jonas Posner

At Proceedings International Parallel and Distributed Processing Symposium (IPDPS), Ph.D. Forum, 2021

Bib doi:10.1109/IPDPSW52791.2021.00160 Poster

@conference{PosnerElasticity21,
  author = {Posner, Jonas},
  title = {Resource Elasticity at Task-Level},
  booktitle = {Proceedings International Parallel and Distributed Processing Symposium (IPDPS), Ph.D. Forum},
  year = {2021},
  publisher = {IEEE},
  doi = {10.1109/IPDPSW52791.2021.00160},
  addendum = {Extended Abstract}
}

Locality-Flexible and Cancelable Tasks for the APGAS Library

Jonas Posner

At EuroHPC Summit Week, PRACEdays, 2021

Bib Poster

@conference{PosnerLocalityPrace21,
  author = {Posner, Jonas},
  title = {Locality-Flexible and Cancelable Tasks for the APGAS Library},
  booktitle = {EuroHPC Summit Week, PRACEdays},
  year = {2021},
}

2017

A Generic Reusable Java Framework for Fault-Tolerant Parallelization with the Task Pool Pattern

Jonas Posner

At International Parallel and Distributed Processing Symposium (IPDPS), Ph.D. Forum, 2017

Bib Poster

@conference{PosnerFaultToleranceIPDPS17,
  author = {Posner, Jonas},
  title = {A Generic Reusable Java Framework for Fault-Tolerant Parallelization with the Task Pool Pattern},
  booktitle = {International Parallel and Distributed Processing Symposium (IPDPS), Ph.D. Forum},
  year = {2017},
}