[liberationtech] [open-science] Removing watermarks from pdfs (pdfparanoia)

jvoisin julien.voisin at dustri.org
Sun Feb 10 18:07:49 PST 2013


hello,
I am the developer behind the previously cited MAT
(https://mat.boum.org). I just want to add my 2 cents based on what I
learned by developing metadata-anonymisation processes.

Since visible metadata like lines of text, or pictures can be detected
visually and removed with the help of some pdfminer-fu, I rather speak
about hidden metadata/watermarks.

Since PDF is a pretty complex format to process, I'm doing a rendering
of it on a cairo[1] surface, and then saving this surface to a PDF file.
Since this produces a completely new PDF, this strips a large part of
(if not all) hidden wartermarks/metadata, without transforming the text
into pictures. The whole process is implemented in MAT [2].

This could be added in pdfparanoia to counter hidden threats.


1. http://www.cairographics.org/
2.
https://gitweb.torproject.org/user/jvoisin/mat.git/blob/HEAD:/MAT/office.py#l141



More information about the liberationtech mailing list