Gzipping or not Gzipping HEALPix maps


March 6, 2023

HEALPix maps are generally stored in FITS format, they can be Gzipped to save disk space. Reading gzipped maps is natively supported in healpy, however, it takes longer because maps are uncompressed on the fly.

As usual, the best way to assess this is to test. In this case, I am testing using the JupyterHub@NERSC running on a Cori shared CPU server, the maps have about 30% of unobserved pixels and are stored in double precision.

For this case, the performance loss due to compression is significant, in particular if accessing just a subset of the maps. Therefore it is hard to justify compression, unless storage is really a limiting factor.

This is heavily case-dependent, so a new test (possibly reusing this notebook) should be performed on a different dataset and machine.

import healpy as hp
cd /global/cfs/cdirs/cmbs4/dc/dc1/
!rm -r test_readgzip || true
%mkdir test_readgzip
cd test_readgzip

Stage data

folders = ["test1", "test2",  "test1gz", "test2gz"]
folders_string = " ".join(folders)
!mkdir -p $folders_string
input_folder = "../staging/noise_sim/outputs_rk/coadd/LAT0_CHLAT"
m = "coadd_LAT0_CHLAT_f150_001of001_map.fits"
cov = "coadd_LAT0_CHLAT_f150_001of001_cov.fits"
for f in folders:
    !cp $input_folder/$m $input_folder/$cov $f


for f in folders:
    if f.endswith("gz"):
        !time gzip $f/$m
        !time gzip $f/$cov

real    3m24.095s
user    3m13.681s
sys 0m4.717s

real    5m27.430s
user    5m13.233s
sys 0m7.657s

real    3m22.517s
user    3m12.237s
sys 0m4.510s

real    5m21.693s
user    5m7.241s
sys 0m7.664s
ls test1gz
for each in [m, cov]:
    fits = !stat -c "%s" test1/$each
    each += ".gz"
    gz = !stat -c "%s" test1gz/$each
    print(f"{each} compression factor: {int(gz[0])/int(fits[0]):.0%}")
coadd_LAT0_CHLAT_f150_001of001_map.fits.gz compression factor: 63%
coadd_LAT0_CHLAT_f150_001of001_cov.fits.gz compression factor: 42%

Benchmark map access


_ = hp.read_map(f"test1/{m}", (0,1,2))
CPU times: user 33.2 s, sys: 26.4 s, total: 59.6 s
Wall time: 1min 6s

_ = hp.read_map(f"test1gz/{m}.gz", (0,1,2))
CPU times: user 1min 36s, sys: 24.7 s, total: 2min 1s
Wall time: 2min 2s

_ = hp.read_map(f"test2/{m}", 0)
CPU times: user 10.6 s, sys: 9.98 s, total: 20.6 s
Wall time: 27.3 s

_ = hp.read_map(f"test2gz/{m}.gz", 0)
CPU times: user 1min 12s, sys: 11.8 s, total: 1min 24s
Wall time: 1min 25s

Benchmark noise covariance access


_ = hp.read_map(f"test1/{cov}", (0,1,2,3,4,5))
CPU times: user 1min 13s, sys: 50 s, total: 2min 3s
Wall time: 2min 23s

_ = hp.read_map(f"test1gz/{cov}.gz", (0,1,2,3,4,5))
CPU times: user 3min 10s, sys: 54.9 s, total: 4min 5s
Wall time: 4min 6s

_ = hp.read_map(f"test2/{cov}", 0)
CPU times: user 10.9 s, sys: 13.5 s, total: 24.4 s
Wall time: 44.2 s

_ = hp.read_map(f"test2gz/{cov}.gz", 0)
CPU times: user 2min 7s, sys: 19.5 s, total: 2min 27s
Wall time: 2min 28s