"pma_sam" utility to aid re-sparsification of pma heap files
Copyright (C) 2022  Terence Kelly
Contact:  tpkelly @ { acm.org, cs.princeton.edu, eecs.umich.edu }

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.


As of November 2022 pma_sam is distributed separately from the pma
persistent memory allocator.  Separate distribution is a stopgap
measure; it is anticipated that pma_sam will be bundled with the next
release of pma ("Avon 9"), which is expected sometime in 2023.  For
now the standalone pma_sam is available at the main pma Web site:
http://web.eecs.umich.edu/~tpkelly/pma/     [optionally "https"]

pma_sam addresses a request that some pma users, and particularly
some pm-gawk users, have voiced:  It would be nice if de-allocated
persistent memory consumed no *storage* resources beneath the pma
heap file.  It's a bit of a long story, with several caveats.

The pma heap file typically begins life as a logically large *sparse*
file, i.e., a file whose footprint upon underlying storage resources
is *zero* and whose logical pages consist entirely of zero bytes.  As
pma fills its persistent heap with data, the file system allocates
storage beneath the heap file as necessary, gradually making the file
denser.  If data on the persistent heap are de-allocated, the storage
beneath such free'd data is not released back to the storage system,
so the heap file can have a large storage footprint despite the fact
that relatively few data on the pma persistent heap are "live"
(in-use, allocated but not yet free'd).  Which is wasteful.

pma_sam is a simple command-line utility that uses a pma interface,
pma_set_avail_mem(), to set de-allocated persistent memory on pma's
free lists to zero bytes.  This allows a separate command, "fallocate
--dig-holes", to re-sparsify the heap file in-place.  If fallocate
doesn't deliver the expected results on your system, try "cp
--sparse" (which may or may not work).  The pma_sam_run.csh script
compiles pma_sam and uses it to sparsify a pm-gawk heap file; you'll
need pma (release "Avon 8" or later) and pm-gawk to run this script.

Not all file systems support sparse files, and not all of those
support "fallocate --dig-holes", so there's no guarantee that pma_sam
will work as intended.  If it does work, it can be very slow:  In the
worst case it requires time proportional to the *logical* size of the
heap file, which may be terabytes.

Furthermore the re-sparsification that pma_sam and fallocate attempt
is possible only if entire pages of free'd memory can be found, where
a page is typically 4 KiB.  There exist patterns of persistent memory
allocation and de-allocation that intersperse "live" and free'd
persistent memory blocks in such a way that very little memory is
live, yet every page of memory contains a tiny bit of live data.  The
net effect is that re-sparsification isn't possible.

Read about pma here:

Terence Kelly, Zi Fan Tan, Jianan Li, and Haris Volos, "Persistent
Memory Allocation," ACM _Queue_ magazine, Vol. 20 No. 2 (March/April
2022).
PDF:   https://dl.acm.org/doi/pdf/10.1145/3534855
HTML:  https://queue.acm.org/detail.cfm?id=3534855

Read about pm-gawk in the User Manual, available at

http://web.eecs.umich.edu/~tpkelly/pma/

