PDF editor for Linux

Gareth · 25 August 2022 11:15

As we have a few Linux users here, is anyone aware of any PDF editors that work under Linux? More specifically I need to redact some info in a document.

I’ve seen some online web apps, but the content of the document is fairly sensitive so I’d rather not just have to blindly trust they’re not harvesting docs that people upload.

toryroo · 25 August 2022 11:25

Hubby was having a play for me a while ago. I’ll ask him what he used. Knowing him it wouldn’t have been an online thing!

graham · 25 August 2022 11:45

Can you pull the document into Libre Office, edit it with the redactions required then export it again as a pdf?

This Libre Office help page explains how to redact in LO.
Libre Office is FOSS so is free to use and and available across platforms.

toryroo · 25 August 2022 12:03

Xournal he said it doesn’t work on all PDFs though but work a try!

graham · 25 August 2022 12:08

certainly seems to be available for Linux Ubuntu

A cautionary note:

The most important thing to note when editing PDFs with Xournal is that to save your changes, you don’t just click Save but instead click File→Export to PDF. If you are editing a PDF, I suggest exporting into a new document—I’ve noticed in the past that sometimes if I save on top of an existing PDF, it will overwrite only a particular layer, so I’ll see a blank document apart from my changes. If you export to a new PDF, you can avoid this risk while also preserving your original, unedited PDF.

Gareth · 25 August 2022 13:03

Thanks both. Will give this a whirl.

graham · 25 August 2022 13:24

I’ve just loaded Xournal++ on my Ubuntu system.
The next step I took was to select a pdf document in Files, right click on it and select Open with other application and from there selected Xournal++.
Then into Tools when the document is loaded Eraser Options and select Whiteout. The eraser tool then does its job in redacting anything under the tool as you can see from this partly redacted form… The size of the eraser tool can be adjusted.
In all, a good utility to use for pdfs. Good call @toryroo

digitracker · 25 August 2022 14:27

Nice call with xournal thanks Tory, just installed it on Arch linux. If anyone needs to work on a read only doc there are a couple of workarounds here :- package management - Edit Read Only PDF Form - Ask Ubuntu

RicePudding · 25 August 2022 14:34

MasterPDF
One of the reasonably priced rare licence fee paying PDF editing tools that runs on multiple OSes.

RicePudding · 25 August 2022 14:36

I used to use Xournal, but I seem to recall it actually creates an overlay over the original PDF file as a separate file.

RicePudding · 25 August 2022 14:39

LibreOffice isn’t too bad providing the PDF document to be imported isn’t too complex in terms of layout structure, as text areas are imported as textbox objects, which don’t always have the same flow as you would see in the original PDF, nor an identical font display when editing and re-exporting. It is one of those things where you would need to play around with it to see if it suits your purpose.

billybutcher · 25 August 2022 20:46

PDF is a dreadful format for editing - the “millions of textboxes” output being one issue sometimes one per line of text in the document - which is massively painful.

The “good” PDF to (for example) Word converters actually work by rendering the page, then using OCR to convert to Word, though you still get the problem that forcing Word to reproduce an exact document layout rarely goes well

graham · 26 August 2022 08:47

does anything with windoze go well

RicePudding · 26 August 2022 09:14

I use Abby FineReader PDF (on macOS) when I need to preserve layout, it really isn’t too bad, but not cheap.
Unfortunately, the Linux OCR offerings I have tried so far are pretty lame in comparison, usually only managing to extract the text into a text file output, and none of the layout. Even then, there are often quite a few recognition errors requiring manual post-recognition correction.

graham · 26 August 2022 09:31

In Linux, if I need to preserve formats (such as I have done in my tax help guides) I use the application Shutter to capture the window, save it and edit out (redact) any confidential bits.

RicePudding · 26 August 2022 10:37

Yes, but then you are painstakingly and inefficiently bitmap editing a rasterized image of the original document.

There are surely much more efficient ways to achieve the intended result, it is just that the Linux software development community still hasn’t addressed that issue which has already been solved on other OSes for nearly 30 years.

On a side note, one of my regular work-related issues that requires OCR is that I receive PDF documents containing text , images, and formulae/equations, often formatted in a particular way (more particularly, conformant with, inter alia, WIPO standards for patent documents.

Attempting to do this on a Linux workstation is an exercise in frustration and a complete waste of time.

A XML subset of SGML was tried for a number of years prior to this and on the whole, failed to gain majority acceptance because users found it too hard to read/understand/implement with their usual word processor.

The patent profession and processing institutions (patent offices, etc) are now forcing a move to DOCX-based solutions for the preparation, filing and processing of patent applications. The major issues with this are the disparities in DOCX format in the various software implementations, including in Microsoft’s own products over the years, and naturally the lack of any cohesive validation system for being able to prove that what someone actually typed in a DOCX file and submitted is what was actually received and correctly rendered at the other end . The way the patent offices attempt to get around this problem is by converting the DOCX to PDF using XSLT processing and presenting that PDF to the user as a cross-check before validating the submission.

Unfortunately, whilst the patent offices have imposed this new format, it is up to the users to check, each time they submit a document, that the submitted document as received by the patent office corresponds to what the user intended to submit. Depending on the technical field, this can be a near impossible task with some patent applications representing hundreds, if not, thousands of pages (e.g. in the biotech sector). Additionally, the inclusion of images and mathematical formulae require further extra care to ensure that nothing has gone missing or been changed during the conversion.

Some patent offices require that the whole DOCX file be resubmitted when making later amendments to content (corrections, deletions, etc). Here again, the risk of getting things wrong at one or other of the ends of submission/reception process are fairly significant, and can prove fatal to the patentee in the event of litigation. These concerns are unfortunately largely dismissed by the patent offices, which have become reliant on the DOCX format as the “Holy Grail” way forward to reducing their own operating costs, as PDF/TIFF processing (the USPTO would only accept TIFF files for years) was inevitably very labour intensive. In time, the aim is almost certainly to remove all use of the PDF format completely.

graham · 26 August 2022 10:53

it works for me… I’ve done this a few years running now - it’s surprising how quickly I can produce those slides for the tax help guides using LO Draw and then exporting the result directly as a pdf for use by SF’ers.

billybutcher · 26 August 2022 11:25

I’m curious why you think “The Linux Software Community” should provide a solution to this problem, which seems to have a relatively narrow focus.

Software solutions for Linux are either commercial - which tend to focus on the server side of thing where Linux is very successful and there is money around to employ people to do the development work or on the needs of the smaller group of enthusiasts for desktop Linux.

I don’t think desktop Linux is ever going to be as ubiquitous as Windows - there will always be the chicken-and-egg situation that it is too niche for commercial software developers to write or port their Windows applications over, but unless they do desktop Linux will remain niche.

There are, of course, stand-out applications which are very nearly as good as commercial offerings on Windows (Blender, for example) but there are an enormous number of half finished, half working apps that someone threw together for their own purposes and then lost interest.

In between are good apps that don’t quite match the functionality of their Windows counterparts (I’d put the GIMP here and probably LibreOffice as well).

The big thing about Linux for me is that I don’t need a “Redhat Account” to use it, the spyware component is non-existant (or can be zapped completely) and I don’t have to keep on paying for it (I don’t mind paying for software, I don’t mind paying for upgrades as long as I decide the point in time that I need the features; I object to being locked into subscription models to be allowed to continue using my computer).

So you are restricted to Windows or MacOS, and a subscription model.

If you are working commercially that’s probably OK, you want a reliable tool and it’s just a business expense.

I’m curious though how you wound up with what sounds like a bit of a niche product - how many other OCR packages did you evaluate on Windows/MacOS and how did they shape up?

Although I “grew up” on Unix and still prefer Linux for most of may day-to-day computing I don’t have anything much against he technology involved in Windows.

Well, apart from the rather dire 9x/ME series - I mean anything NT derived, and post NT4 as well which is the point it more-or-less worked properly.

It’s the upgrade cycle lock-in and spying that I object to.

graham · 26 August 2022 11:42

IIRC a number of cyber attacks of late have taken place on NHS facilities (to name but one - with thousands of potential sites) who I think continue to use windoze but I’m not aware of the same issue with sites (are there any?) who use Unix/Linux…
Unbelievably, ISTR one NHS authority still using windoze 95 in recent times

Gareth · 26 August 2022 11:50

Just following up to say that I tried Xournal++ and something called Okular. Both allowed me to redact the document but, as @RicePudding said, the underlying text was still there so if you highlighted it you could copy and paste into a text editor. To get around it, I redacted it then opened the exported PDF in Chrome and did “print to PDF” but chose the rasterize option. It’s clunky because it creates a massive file, but in this instance I needed something quick so it had to do.

Thanks for the Xournal++ suggestion, @toryroo, and for the comments everyone