Searching through a bulk of pdf files
from Darkassassin07@lemmy.ca to selfhosted@lemmy.world on 19 Aug 09:03
https://lemmy.ca/post/50048974

I have a pile of part lists for tools I’m maintaining, in pdf format; and I’m looking for a good way to take a part number, search through the collection of pdfs, and output which files contain that number. Essentially letting me match random unknown part numbers to a tool in our fleet.

I’m pretty sure the majority of them are actual text you can select and copy+paste, so searching those shouldn’t be too difficult; but I do know there’s at least a couple in there that are just a string of jpgs packed in a pdf file. They will probably need OCR, but tbh I can probably live with skipping over those altogether.

I’ve been thinking of spinning up an instance of paperless-ngx and stuffing them all in there so I can let it index the contents including using OCR, then use it’s search feature; but that also seems a tad overkill.

I’m wondering if you fine folks have any better ideas. What do you think?

#selfhosted

threaded - newest

tofu@lemmy.nocturnal.garden on 19 Aug 09:09 next collapse

The OCR thing is it’s own task but for just searching a string in PDFs, pdfgrep is very good.

pdfgrep -ri CoolNumber69 /path/to/folder

Darkassassin07@lemmy.ca on 19 Aug 09:26 next collapse

Interesting; that would be much simpler. I’ll give that a shot in the morning, thanks!

hoppolito@mander.xyz on 19 Aug 15:22 collapse

In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.

It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.

Darkassassin07@lemmy.ca on 19 Aug 22:28 collapse

That works magnificently. I added -l so it spits out a list of files instead of listing each matching line in each file, then set it up with an alias. Now I can ssh in from my phone and search the whole collection for any string with a single command.

Thanks again!

tofu@lemmy.nocturnal.garden on 20 Aug 06:20 collapse

Glad to hear that!

Brkdncr@lemmy.world on 19 Aug 14:57 next collapse

In windows you may need to add an ifilter. Adobe’s is pretty good. Then windows search will be able to search contents.

hoppolito@mander.xyz on 19 Aug 15:44 next collapse

For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.

With that tool it should be doable with something like a simple while loop:

find . -type f -name '*.pdf' -print0 |
    while IFS= read -r -d '' file; do
        echo "Processing $file ..."
        ocrmypdf "$file" "$file"
        # ocrmypdf "$file" "${file%.pdf}_ocr.pdf"   # if you want a new file instead of overwriting the old
    done

If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.

Darkassassin07@lemmy.ca on 20 Aug 02:04 collapse

That’s a neat little tool that seems to work pretty well. Turns out the files I thought I’d need it for already have embedded OCR data, so I didn’t end up needing it. Definitely one I’ll keep in mind for the future though.

MysteriousSophon21@lemmy.world on 19 Aug 19:26 next collapse

You might want to check out Docspell - it’s lighter than paperless-ngx but still handles PDF indexing and searching realy well, plus it can do basic OCR on those image-based PDFs without much setup.

lIlIllIlIIIllIlIlII@lemmy.zip on 19 Aug 21:46 collapse

Try paperless-ngx. It can do OCR and has search.