[SOLVED] Using Switch for automatic OCR recognition?

Post Reply
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

[SOLVED] Using Switch for automatic OCR recognition?

Post by magnussandstrom »

Hi, we have a large project were we scan documents and make them searchable with OCR 'overlay' and save as PDF/A. Today we're using Acrobat Pro DC and made an Action in the Action Wizard that batch-run a folder of files. But since the project is so big and I want to automate this process.

Is there a way to setup a Switch flow (2020 spring release) for automatic OCR recognition from scanned pages and output as PDF/A?

After some googling I can't find any obvious solution. But maybe it's possible to do it with Switch and Acrobat PRO DC or maybe install Tesseract with some ingenious scripting? Any idéas?

And sorry if this has been asked before, but I'n not allowed to search for OCR (three letter word) in this forum..

Best regards,
Magnus
Last edited by magnussandstrom on Wed Aug 12, 2020 9:24 am, edited 2 times in total.
freddyp
Advanced member
Posts: 1008
Joined: Thu Feb 09, 2012 3:53 pm

Re: Using Switch for automatic OCR recognition?

Post by freddyp »

Tesseract was also the first thing that sprang to my mind. Ingenious scripting? Tesseract is a command line tool, so you can integrate it with "Execute command". If the options are always the same, and I would think they probably are, you do not need a script. If the choice of options is more involved because of some if-then-else logic, it is not a lot of work for a Switch script developer to do that and if the project is that big, then that would be money well spent.
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

The options is always the same (no ifttt). Tesseract seems to only support image files and we output multipage PDF's from out scanner. This could probably be solved with Ghostscript etc.

Maybe I was hoping for a out-of-the-box solution. I'm guessing I'm not the first to do this in Switch?

I also found a few softwares like OCRvision that seem's to do what I want..
Padawan
Advanced member
Posts: 358
Joined: Mon Jun 12, 2017 8:48 pm
Location: Belgium
Contact:

Re: Using Switch for automatic OCR recognition?

Post by Padawan »

I've automated Tesseract before.

It indeed has the limitation that you can only input jpeg files, which makes that you can't input multipage PDF's. However, you should be able to build a flow which solves this without any scripting:

- Input folder where multi page pdf's are dropped
- Remember the original filename and the amount of pages in private data
- PitStop PDF2Image (or another tool) to convert the PDF to jpegs
- OCR the jpeg's via tesseract
- Use an assemble job with custom scheme. Job identifier is the original PDF filename which is stored in private data and the amount of files is the original amount of pages which is also stored in private data
- Use Merge PDF to merge the PDF's

Output should be a multipage OCR'd PDF.

These are the execute command settings I have:
Screenshot 2020-07-31 at 12.02.43 copy.jpg
Screenshot 2020-07-31 at 12.02.43 copy.jpg (91.85 KiB) Viewed 12773 times


Switch is more a build-it-yourself solution instead of an out-of-the box solution. There is more effort required compared to out of the box solutions, but it can do so much more. That's what makes it so awesome :)
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

Thanks Padawan I will give it a try!
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

I think I almost have figured it out, but I'm stuck att the assemble step.. The PDF-files doesn't contain the private data anymore.
Attachments
assemble.jpg
assemble.jpg (253.5 KiB) Viewed 12757 times
jan_suhr
Advanced member
Posts: 586
Joined: Fri Nov 04, 2011 1:12 pm
Location: Nyköping, Sweden

Re: Using Switch for automatic OCR recognition?

Post by jan_suhr »

Have you tried the property "Merge Metadata" and set it to Yes in the Assemble job.
Jan Suhr
Color Consult AB
Sweden
=============
Check out my apps
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

jan_suhr wrote: Fri Jul 31, 2020 4:15 pm Have you tried the property "Merge Metadata" and set it to Yes in the Assemble job.
Yes and I get the same result. I've tried to ungroup the job after Pitstop2Image and tried to use the Ungroup job in the Assemble-element with no luck as well.
Padawan
Advanced member
Posts: 358
Joined: Mon Jun 12, 2017 8:48 pm
Location: Belgium
Contact:

Re: Using Switch for automatic OCR recognition?

Post by Padawan »

Can you let the job move step by step thru the flow by placing connections on hold and this way test in which element the private data gets lost? You can check the contents of the private data via "Inspect jobs"
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

Padawan wrote: Fri Jul 31, 2020 4:58 pm Can you let the job move step by step thru the flow by placing connections on hold and this way test in which element the private data gets lost? You can check the contents of the private data via "Inspect jobs"
It's working now! I found a typo in one of the PrivateData-parameter in the Assemble-element while checking were the private data got lost... :oops:

Thanks for all your help - Great first impression of this forum!

/ Magnus
jan_suhr
Advanced member
Posts: 586
Joined: Fri Nov 04, 2011 1:12 pm
Location: Nyköping, Sweden

Re: Using Switch for automatic OCR recognition?

Post by jan_suhr »

Great, now you can sleep well on your vacation :D
Jan Suhr
Color Consult AB
Sweden
=============
Check out my apps
User avatar
magnussandstrom
Advanced member
Posts: 345
Joined: Thu Jul 30, 2020 6:34 pm
Location: Sweden
Contact:

Re: Using Switch for automatic OCR recognition?

Post by magnussandstrom »

Thanks Jan! :lol:

Here is the final Flow for anyone looking for a solution to this in the future.

Tesseract (v5.0) installer can be downloaded here: https://github.com/UB-Mannheim/tesseract/wiki
Attachments
tesseract2.jpg
tesseract2.jpg (214.97 KiB) Viewed 12731 times
Post Reply