Merge two text files

Post Reply
User avatar
bkromer
Member
Posts: 74
Joined: Thu Jul 11, 2019 10:41 am
Location: Germany

Merge two text files

Post by bkromer » Thu Dec 10, 2020 3:15 pm

Hello,

I need to run a OCR on PDF Files. I want to use Tesseract. Tesseract need pictures so I build a Flow to convert PDF to JPEG and read them with OCR.
Tesseract spits out txt files. I then want to read in those text Files in my script and use RegEx to sort them by specific identifiers and finally upload them to my Database via API.

In the case where the PDF has multiple pages the "Enfocus PitStop Server PDF2Image"-Modul spits out 2 JPEGs with the Page Number attached to it like "documentname_1.pdf, documentname_2.pdf".

How can I make them one text-File? The Problem as I see it is that switch handles the files one by one.
How can I wait for all the txt-files of the main PDF-File.
Bildschirmfoto 2020-12-10 um 14.43.38.png
Bildschirmfoto 2020-12-10 um 14.43.38.png (123.38 KiB) Viewed 1398 times
Benjamin Kromer

jan_suhr
Advanced member
Posts: 420
Joined: Fri Nov 04, 2011 1:12 pm
Location: Nyköping, Sweden

Re: Merge two text files

Post by jan_suhr » Thu Dec 10, 2020 3:27 pm

There are a few apps that save the text out from PDF-files.
Jan Suhr
Color Consult AB
Sweden
=============
Check out my apps

cstevens
Member
Posts: 100
Joined: Tue Feb 12, 2013 8:42 pm

Re: Merge two text files

Post by cstevens » Thu Dec 10, 2020 11:26 pm

If you can group all the text files together beforehand (using assemble job or something) and provide a folder containing all the files to the script then it would look something like this:

Code: Select all

//Create a Directory object for the incoming folder	
var inFolder = new Dir(job.getPath());
//Get the list of text files in the folder sorted by age
var textFiles = inFolder.entryList("*.txt", Dir.Files, Dir.Time);
//Create a new file path with .txt extension	
var newFileLoc = job.createPathWithExtension(".txt", false);	
//Create a new file at that location and open it in append mode	
var newFile = new File(newFileLoc, 'UTF-8');
newFile.open(File.Append);
//For each text file in the incoming folder open the file and read the contents then write them to the new file.
for (var i=0; i<textFiles.length; i++){
	var inString = File.read(job.getPath() + '/' + textFiles[i]);
	newFile.write(inString);
}
//close the new file and send it to the output connection
newFile.close();
job.sendToSingle(newFileLoc);
I agree that pulling the text directly from the PDF would be easier and more reliable than using OCR though.

Post Reply