Large-volume TIFF → PDF/A combine (memory-safe, background, FTP intake)
What changed and why
Combining many scanned TIFF pages into one PDF/A used to run a single ImageMagick
convert over all the files at once, in the web request. For large documents
(e.g. 258 TIFFs at ~42 MB each ≈ 11 GB) that ran the server out of memory and timed
out the request — the combine "failed on volumes."
It is now:
- Memory-safe. Each page is converted to its own single-page PDF with bounded
ImageMagick limits (and
-compress JPEG), the pages are concatenated withqpdfin batches, then a single Ghostscript pass produces the PDF/A. Peak memory is one page's worth (~100 MB) regardless of page count. Output is always PDF/A (the archival target; pdfa-2b by default). - Background. The web "Process" / "Recreate" buttons only set the job to
queued; a cron worker (ahg:tiff-pdf-process) runs the merge and emails the job's user on completion/failure. - FTP-friendly. Two intake paths feed the same queue (below).
The combined PDF/A is attached to the record as a master digital object, and the
ahg:optimize-pdfs task then generates its fast-loading .web.pdf sibling.
Intake paths
-
From a record (Link digital object) — auto-link. Launch the TIFF→PDF merge in a record context (
information_objectset). The combined PDF/A is attached to that record automatically. -
Manual server folder. POST to
tiffpdfmerge/importFolderwithfolder=<path>(and optionalinformation_object_id). The files are referenced in place (no browser upload, no copy), the job is queued. The folder must sit under the allowed base (app_tiff_combine_import_base, default = web dir). -
FTP drop-folder — create now, link later. Drop a folder of page TIFFs under the watch base named with the target record reference:
<watch-base>/<record-ref>/page0001.tif, page0002.tif, ...ahg:tiff-combine-watch(cron) detects a ready folder (a.readymarker, or no file modified for--stable-minutes), maps<record-ref>→ record (slug then identifier), queues the combine, and writes a.queuedmarker. If no record matches, the PDF/A is still created so it can be linked later. Watch base =app_tiff_combine_watch_dir(default<web-dir>/uploads/tiff-combine). -
Recreate / retry.
tiffpdfmerge/recreate(job_id) re-queues a completed or failed job — use it to regenerate after an upload finishes or after a convert failure.
Host setup
Needs Ghostscript + qpdf + ImageMagick (convert) on PATH (all standard):
sudo apt-get install -y ghostscript qpdf imagemagick
Use a temp dir on real disk with ample free space (intermediate page PDFs are small
with -compress JPEG, but the merged + PDF/A copies need room). Configure via the
temp_directory job setting (default /tmp/tiff-pdf-merge); point it at non-tmpfs
storage if /tmp is RAM-backed.
Cron
# Process queued combine jobs (memory-safe, notifies on done)
* * * * * www-data cd /usr/share/nginx/<instance> && php symfony ahg:tiff-pdf-process >> /var/log/atom-tiff-pdf.log 2>&1
# Auto-queue FTP drop-folders
*/5 * * * * www-data cd /usr/share/nginx/<instance> && php symfony ahg:tiff-combine-watch >> /var/log/atom-tiff-pdf.log 2>&1
Backfill / manual run
cd /usr/share/nginx/<instance>
php symfony ahg:tiff-pdf-process # process whatever is queued now
php symfony ahg:tiff-combine-watch # scan the drop-folder once
Notes
- Plugin-only; no AtoM base (
apps/qubit/...) changes. - Twin of the Heratio (Laravel)
ahg:optimize-pdfs/ TIFF-PDF tooling. - Masters/originals are never modified.