2024: ScaRLib: Towards a hybrid toolchain for aggregate computing and many-agent reinforcement learning

Simulation and detailed description publicly available at https://github.com/ScaRLib-group/SCP-ScaRLib-flock-demo DOI DOI. Related to the paper ``ScaRLib: Towards a hybrid toolchain for aggregate computing and many-agent reinforcement learning’’ published in the journal Science of Computer Programming (DOI: 10.1016/j.scico.2024.103176).

Abstract

ScaRLib is a Scala-based framework designed to merge macroprogramming and multi-agent reinforcement learning (MARL) in large-scale cyber-physical systems. It integrates Alchemist, ScaFi, and PyTorch into a modular toolchain that allows developers to define and train swarm behaviors through high-level abstractions and declarative configurations.

In essence, ScaRLib makes it possible to teach simulated agents collective behaviors — such as coordination, cohesion, and collision avoidance — by combining aggregate computing models with deep reinforcement learning techniques. This hybrid approach paves the way for adaptive, scalable, and self-organizing swarm systems.

Experiment description

Objective

This experiment demonstrates how reinforcement learning can be combined with aggregate computing to produce flocking behavior — where agents (e.g., drones) learn to maintain cohesion while avoiding collisions.

Setup

Agents are deployed in a $2D$ continuous environment simulated by Alchemist, each capable of moving in eight directions (N, S, E, W, and diagonals). Each agent perceives only its local neighborhood (the five nearest agents) and decides its next move based on its learned policy.

The ScaFi aggregate program defines how agents perceive their surroundings — computing distances to neighbors and building their state representation. Learning is managed through the Centralized Training, Decentralized Execution (CTDE) model in ScaRLib, where agents are trained collectively but act independently during simulation.

Reward Design

The reinforcement signal is defined by two components:

  • Cohesion factor: rewards agents that remain close to their neighbors.

  • Collision factor: penalizes agents that get too close (i.e., distance lower than a target distance $\delta$).

This balance encourages the emergence of smooth, coordinated group motion.

Results

After training over 1000 epochs with 50 agents, the learned policy produces robust flocking behaviors:

  • Agents naturally form cohesive clusters while avoiding collisions.

  • The same policy scales effectively to 100 and 200 agents, maintaining stable inter-agent distances ($\simeq 2 \times \delta $. Performance analysis confirms polynomial scalability, with Alchemist efficiently handling simulations with hundreds of agents.

Images

(a) (b) (c)

(d) (e) (f)

Snapshots of learned flocking policy over time.