Hi all!
I'm trying to split a multipage PDF into parts based on the content of the PDF. Meaning, every page features a number in the same place every time. I'd like to split the PDF whenever that number changes. I'm currently trying to achieve this using the Split PDF Pages app using the Strategy property 'Action list'. This requires me to log the pages with a new number but I can't seem to figure out how to compare pages in that sense. Is this the correct way forward or are there better ones? Any help is much appreciated!
Split PDF based on content
Re: Split PDF based on content
You cannot detect a change in numbers from one page to another in an Action List, so this approach will not work. This is how it can work, but it will require a small script:
- Run PitStop Server with an Action List that uses "Select objects inside or outside region" - "Log text properties" and create a JSON report dataset
- Run a script that reads the JSON dataset, loops over the pages to find the differences and defines a piece of private data with the correct page range in the form 1-3,4-8,9-17,18-42
- Run the Split PDF pages app with the strategy "Page range" and the value coming from the piece of private data
Re: Split PDF based on content
Hi Freddy, I unfortunately do not have the Scripting module available to me. However, I managed it with an Action list. I basically do the first part of your suggestion. Then I read the message PitStop provides through the report, where it specifies the pages it found in the text properties, using a little regex thing to isolate the ranges. Using the TextIndexed property, I can then collect them all and pass them to Split PDF pages. Works consistently so far but I've not tested it with too many files yet. This is the variable I'm using:
Code: Select all
[Metadata.TextIndexed:Dataset="Log",Model="XML",Path="/EnfocusReport/PreflightReport/Informations/PreflightReportItem/Message",Separator=",",Search="\d+-\d+(?=\))"]
Re: Split PDF based on content
Well done!
Just for my understanding: the area where the number is, is only filled for each new split, right? I was thinking that there would always be a number and that the split had to happen when the number changed. In that case every page would be logged and the splitting would not work.
Assuming I am right, you can probably do it with the Action List strategy in the Split PDF pages app anyhow.
Select objects inside or outside region
Select text by key phrase (use a regex to match the number)
AND
Log selection
Just for my understanding: the area where the number is, is only filled for each new split, right? I was thinking that there would always be a number and that the split had to happen when the number changed. In that case every page would be logged and the splitting would not work.
Assuming I am right, you can probably do it with the Action List strategy in the Split PDF pages app anyhow.
Select objects inside or outside region
Select text by key phrase (use a regex to match the number)
AND
Log selection