A little AppleScript to automate splitting pdfs
Category: LaTeX; Category: TeXShop; Category: FileMaker Pro; Category: AppleScript; Category: Mac
Summary: In which Jon surprises himself by writing a working AppleScript that takes LaTeX code from records of a FileMaker Pro database and transfers the code to TeXShop for typesetting. The outcome is a few long PDF files being automatically divided into many smaller, annotated PDF files divided at page numbers specified in the database.
I've never really figured out AppleScript, Apple's English-like scripting language. It looks so simple and easy, yet I haven't come across a simple and easy set of instructions and examples I can use to get into the language. One of these days I may have to have to break down and *buy* a book on the subject. The little example I describe here should illustrate how useful AppleScript can be, and why a little more learning on my part could save me a lot of time in the long run.
At the moment Laura and I are getting all the backissues of the journal Notornis ready for publishing online (launch due June 2006). The Ornithological Society of New Zealand has provided us with PDF files of all backissues of Notornis, one file per issue, with an underlying layer of error-ridden computer-read OCR text. One of our jobs is to get all the citation information for all the articles in all these issues (yawn). Luckily that job has been sped up somewhat by the existence of a 50-year index (although all of that computer-read text still needed proof-reading). Another job now is to split up all the issue PDF files into separate article PDF files. We are certainly not going to do all that by hand for >3,000 articles!
Here's what I've come up with, using AppleScript to combine our citation information in a FileMaker Pro database with the awesome PDF handling capabilities of the LaTeX typesetting language, specifically the package pdfpages as applied within the glorious TeXShop application. There will undoubtedly be all sorts of ways of doing this and I don't claim that my solution is the best. I tend to get something working then move on to the next thing.
The following LaTeX file, when typeset in TeXShop, creates a PDF file of the same name that contains only the selected pages of the original PDF.
There are a few things to note here. The pdfpages package contains the includepdf function that drives everything. The hyperref package is optional. I use it to add annotations to the pdf files (which can been seen with Document Properties in Acrobat Reader and Get Info in the OSX 10.4 version of Preview). Within the includepdf line, note that the pages to be extracted are specified and the path to the PDF file is specified. The ".pdf" extension on the end of the filename in this path is optional. It is important that there are no spaces in the path to the file. (I also found out that you need OSX 10.4 for this to handle PDF files greater than version 1.4.)
Once I had this working, the trick was to automate it for >3,000 articles, one for each record in our FileMaker Pro database of citation information. The first step was to make a field in FileMaker which will generate the above code for each record. Here is the calculation that does this, for a field I called "LaTeX".
You will see that this is nothing more than the above LaTeX code with values included from the database fields. Easy.
The trick then was to automatically typeset this code for each record of the database. The following AppleScript, written in Apple's ScriptEditor, typesets the code from one record of the database.
AppleScript is so cleverly close to English that this doesn't really need explaining. It works, and it's blink-of-an-eye quick, even on my old 600MHz G3 iBook.
That didn't make it easy to write though. My first head-banging session was when I tried to export the LaTeX field from FileMaker into a text file to then be opened by TeXShop. Try as I might, I couldn't get FileMaker to output a clean, simple text file with unix linebreaks that TeXShop could use. Giving up on that approach, I then realised that I could manually copy-and-paste clean code from FileMaker into TeXShop. The above AppleScript is written to do this. I then took a while to realise that "styled text" was essential for the line breaks in the LaTeX code to be retained in the copy-and-paste (really a set-and-set maneuver in AppleScript). Once I'd got over this major hurdle, it took me surprisingly, frustratingly long on Google to figure out how to save the TeXShop document I created so I could typeset it. The code is still a bit clutzy here, since I could only figure out how to save the file by closing it. The AppleScript then has to open it again to typeset it.
Running this script (e.g., via a button in FileMaker) takes the code of the current database record to TeXShop, typesets it, and closes the new TeXShop file. This produces a PDF file of the appropriate pages (with the specified annotations). Nice.
That was the hard part, which I've only just finished. Looping the database to do this for each record should be a simple little FileMaker script. I'll add that here once I've done it. But first, we have to finish the citation database.
Summary: In which Jon surprises himself by writing a working AppleScript that takes LaTeX code from records of a FileMaker Pro database and transfers the code to TeXShop for typesetting. The outcome is a few long PDF files being automatically divided into many smaller, annotated PDF files divided at page numbers specified in the database.
I've never really figured out AppleScript, Apple's English-like scripting language. It looks so simple and easy, yet I haven't come across a simple and easy set of instructions and examples I can use to get into the language. One of these days I may have to have to break down and *buy* a book on the subject. The little example I describe here should illustrate how useful AppleScript can be, and why a little more learning on my part could save me a lot of time in the long run.
At the moment Laura and I are getting all the backissues of the journal Notornis ready for publishing online (launch due June 2006). The Ornithological Society of New Zealand has provided us with PDF files of all backissues of Notornis, one file per issue, with an underlying layer of error-ridden computer-read OCR text. One of our jobs is to get all the citation information for all the articles in all these issues (yawn). Luckily that job has been sped up somewhat by the existence of a 50-year index (although all of that computer-read text still needed proof-reading). Another job now is to split up all the issue PDF files into separate article PDF files. We are certainly not going to do all that by hand for >3,000 articles!
Here's what I've come up with, using AppleScript to combine our citation information in a FileMaker Pro database with the awesome PDF handling capabilities of the LaTeX typesetting language, specifically the package pdfpages as applied within the glorious TeXShop application. There will undoubtedly be all sorts of ways of doing this and I don't claim that my solution is the best. I tend to get something working then move on to the next thing.
The following LaTeX file, when typeset in TeXShop, creates a PDF file of the same name that contains only the selected pages of the original PDF.
\documentclass[11pt,a4paper]{article}
\usepackage{graphicx}
\usepackage[final]{pdfpages}
\usepackage[pdftex,colorlinks]{hyperref}
\hypersetup{%
pdftitle={Title of the Notornis article},
pdfauthor={Ornithological Society of New Zealand},
pdfsubject={Notornis Volume(issue) year, \copyright Ornithological Society of New Zealand},
pdfkeywords={Notornis, New Zealand, birds},
bookmarksnumbered,
pdfstartview={FitH},
urlcolor=cyan,
}%
\begin{document}
\includepdf[pages=2-4]{/Users/jon/NZES/Salisbury/OSNZpart/VOLUME_49-2002/Notornis_49_1.pdf}
\end{document}
There are a few things to note here. The pdfpages package contains the includepdf function that drives everything. The hyperref package is optional. I use it to add annotations to the pdf files (which can been seen with Document Properties in Acrobat Reader and Get Info in the OSX 10.4 version of Preview). Within the includepdf line, note that the pages to be extracted are specified and the path to the PDF file is specified. The ".pdf" extension on the end of the filename in this path is optional. It is important that there are no spaces in the path to the file. (I also found out that you need OSX 10.4 for this to handle PDF files greater than version 1.4.)
Once I had this working, the trick was to automate it for >3,000 articles, one for each record in our FileMaker Pro database of citation information. The first step was to make a field in FileMaker which will generate the above code for each record. Here is the calculation that does this, for a field I called "LaTeX".
"\documentclass[11pt,a4paper]{article}¶
\usepackage{graphicx}¶
\usepackage[final]{pdfpages}¶
¶
\usepackage[pdftex,colorlinks]{hyperref}¶
\hypersetup{%¶
pdftitle={" & title & "},¶
pdfauthor={" & authors & "},¶
pdfsubject={Notornis " & volume_issue_bracketed & ":" & first_page & "--" & last_page & " (" & publications::pub_year & ") \copyright Ornithological Society of New Zealand, Inc.},¶
pdfkeywords={" & If ( IsEmpty ( keywords ) ; "" ; Substitute ( keywords ; ";" ; "," ) ) & ", Ornithological Society of New Zealand, Notornis, Science Journal, New Zealand, birds},¶
bookmarksnumbered,¶
pdfstartview={FitH},¶
urlcolor=cyan,¶
}%¶
¶
\begin{document}¶
¶
\includepdf[pages=" & first_page & "-" & last_page & "]{/Users/jon/NZES/Salisbury/OSNZpart/" & pdf_issue_folder & "/" & pdf_issue_filename & "}¶
¶
\end{document}"
You will see that this is nothing more than the above LaTeX code with values included from the database fields. Easy.
The trick then was to automatically typeset this code for each record of the database. The following AppleScript, written in Apple's ScriptEditor, typesets the code from one record of the database.
tell application "FileMaker Pro"
set mylatex to cell "LaTeX" of current record as styled text
set filename to cell "pdf_filename_trimmed" of current record as styled text
end tell
tell application "TeXShop"
activate
make new document at beginning with properties {name:filename}
set the text of document filename to mylatex
close document filename saving in file ("Salisbury:Users:jon:NZES:Salisbury:OSNZpart:" & filename & ".tex") saving yes
open file ("Salisbury:Users:jon:NZES:Salisbury:OSNZpart:" & filename & ".tex")
typeset document (filename & ".tex")
close document (filename & ".tex") saving no
end tell
AppleScript is so cleverly close to English that this doesn't really need explaining. It works, and it's blink-of-an-eye quick, even on my old 600MHz G3 iBook.
That didn't make it easy to write though. My first head-banging session was when I tried to export the LaTeX field from FileMaker into a text file to then be opened by TeXShop. Try as I might, I couldn't get FileMaker to output a clean, simple text file with unix linebreaks that TeXShop could use. Giving up on that approach, I then realised that I could manually copy-and-paste clean code from FileMaker into TeXShop. The above AppleScript is written to do this. I then took a while to realise that "styled text" was essential for the line breaks in the LaTeX code to be retained in the copy-and-paste (really a set-and-set maneuver in AppleScript). Once I'd got over this major hurdle, it took me surprisingly, frustratingly long on Google to figure out how to save the TeXShop document I created so I could typeset it. The code is still a bit clutzy here, since I could only figure out how to save the file by closing it. The AppleScript then has to open it again to typeset it.
Running this script (e.g., via a button in FileMaker) takes the code of the current database record to TeXShop, typesets it, and closes the new TeXShop file. This produces a PDF file of the appropriate pages (with the specified annotations). Nice.
That was the hard part, which I've only just finished. Looping the database to do this for each record should be a simple little FileMaker script. I'll add that here once I've done it. But first, we have to finish the citation database.

0 Comments:
Post a Comment
<< Home