Row-based parallel sorting on a random shared-memory vector
============================================================

Data model
----------
shm_generator allocates a flat 1D uint8_t buffer in SysV shared memory
and fills it with random bytes (in parallel). The schedulers view that
buffer as numRows rows of length n:

    row r occupies bytes [r*n, (r+1)*n) of the SHM segment.

Each scheduler attaches to SHM, wraps each row r in a row<uint8_t> with
copy_data=false (so it operates in place on SHM), and calls quick_sort.

Workflow
--------
    ./shm_generator <numRows> <n>      # create + fill
    ./<some_scheduler> [args]          # sort rows in place in SHM
    ./shm_print <numRows> <n>          # peek at row 0
    ./shm_generator delete             # release SHM

The schedulers
--------------
1. static_scheduler
   Block partition. Thread tid gets rows [tid*numRows/k, (tid+1)*numRows/k).
   The last thread mops up the remainder. A balanced variant
   (sort_rows_static_b) distributes the remainder one-per-thread.
   CLI: ./static_scheduler numRows n k

2. dynamic_scheduler
   One row per pop under a single mutex. Lowest overhead per unit of
   work decision, highest contention on the mutex.

3. chunk_scheduler
   Threads grab CHUNK rows per pop instead of one. Trades a little
   load-balance for a lot less mutex traffic.

4. chunk_steal_scheduler
   Per-thread deques pre-seeded round-robin with CHUNK-aligned starts.
   A thread pops from the FRONT of its own deque; when empty it tries
   to steal from the BACK of a victim's deque (LIFO steal).

5. guided_scheduler
   chunk = remaining / k, floored at 1. Large bites at the start,
   smaller as work runs out — reduces mutex traffic vs. dynamic
   while keeping tail load-balance.

6. adaptive_scheduler
   Like guided, but each thread's chunk is multiplied by
   (global_avg_ms_per_row / my_avg_ms_per_row). Faster threads
   take bigger bites; slow ones take less.

7. aimd
   Additive-increase / multiplicative-decrease on the cap of
   concurrently in-flight chunks (contention_window), driven by
   the EWMA load signal from UtilizationMonitor (which reads
   /proc/stat in the background). Chunk size is fixed at CHUNK_ROWS;
   the AIMD logic tunes how many chunks can be live at once.

The row<T> class is exercised in every scheduler — each row gets wrapped
in a row<uint8_t> bound directly to the SHM bytes (no copy), sorted in
place, and torn down.