Searching and Splitting PDF Files in Java

It’s quite fulfilling when you get to make something work when previously, all odds seemed like against the endeavor.

Reading and manipulating text files in Java is no big deal — surely this is one of the basic lab modules any Computer Science student would have to go through. After a few couple of years working with Java, it’s only now though that I ran into problems because we needed to check on the contents of a PDF file. It is possible given the plethora of JARs and open source code available on the Web, but how do I start?

Thankfully there’s Sourceforge around, and Apache Software Foundation. PDFBox, the utility developed in Sourceforge and eventually absorbed by ASF, allows for string extraction, creation and modification of PDFs. The API took me about an entire day or so to understand before I came up with a solution.

I was somewhat worried that this utility might already be running on Java 5 or later. It was confirmed when I saw the current site. But they still have the legacy version for Java 1.4.2, which is good because if you’re tied with Oracle’s WebLogic 8.1, there’s no chance in heaven (or hell) that you can use Java 5 at all.😦

Anyway, the code that I made is shown in the listing below. I’d admit it isn’t very efficient but it gets the job done. I’d appreciate if you can suggest ways to improve it after taking it for a test drive.🙂

For this to work, you will need the PDFBox 0.7.3 and fontbox 0.1.0 packages. If you’re on Java 5 or later, you may choose to use the later versions of PDFBox as well (1.2.1 as of this posting).

/**
 * Truncates the input PDF by removing all pages after the page where the specified
 * searchString is found. This makes use of Apache Software Foundation's PDFBox JAR
 * to manipulate and read PDF files.
 *
 * @author Ronx Ronquillo
 * @param originalFile -- The File from which the PDF is to be taken
 * @param searchString -- The string that will serve as indicator where the new PDF will end.
 */
private File truncatePDF(File originalFile, String searchString) //, String password
{
 // The PDF representations as Java objects
 PDDocument originalPDF = null, newPDF = null;
 File newPDFFile = null;
 int endPage = -1;        // Page where the new PDF will end

 try{
 originalPDF = PDDocument.load(originalFile);

 // If PDF is password-encrypted, you have to uncomment and invoke this
 //doc.decrypt(password);

 List pages = originalPDF.getDocumentCatalog().getAllPages();

 SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd-HHmmss");
 for (int i=0; i<pages.size(); i++){
 File tempFile = new File(System.getProperty("java.io.tmpdir"), "PDFMaker"+sdf.format(new java.util.Date())+".pdf");

 PDPage page = (PDPage)pages.get( i );

 PDStream contents = page.getContents();

 PDDocument temp = new PDDocument();
 temp.addPage(page);
 temp.save(new FileOutputStream(tempFile));
 temp.close();

 PDFParser parser = new PDFParser(new FileInputStream(tempFile));
 parser.parse();

 COSDocument cosDoc = parser.getDocument();
 PDFTextStripper pdfStripper = new PDFTextStripper();
 PDDocument pdDoc = new PDDocument(cosDoc);
 String parsedText = pdfStripper.getText(pdDoc);

 // The last page of the new PDF will be where searchString is found.
 if (parsedText.indexOf(searchString) >= 0) endPage = i;

 // Minimize resource wastage: Delete temp file and close objects properly when done!
 tempFile.delete();
 pdDoc.close();
 cosDoc.close();
 }

 originalPDF.close();

 newPDF = PDDocument.load(originalFile);
 // If PDF is password-encrypted, you have to uncomment and invoke this
 //doc.decrypt(password);

 if (endPage >= 0){
 // Remove pages at the specified point until you're left with what you want
 while (newPDF.removePage(endPage+1));

 newPDFFile = new File(System.getProperty("java.io.tmpdir"), "trunc_" + originalFile.getName());
 newPDF.save(new FileOutputStream(newPDFFile));

 System.out.println("The new file is created successfully.");

 }else{
 System.out.println("No changes were made to the file.");
 }

 }catch(IOException ie){
 System.err.println("Problem with creating / reading the file.");
 }catch(COSVisitorException ce){
 System.err.println("COSVisitorException");
 }
 /*catch(CryptographyException cce){
 // If there is encryption involved later, use these
 // Reading up on BouncyCastle (bouncycastle.org) might be helpful

 System.err.println("Failed decryption.");
 }catch(InvalidPasswordException iie){
 System.err.println("Invalid password!");
 }*/
 finally{
 try{
 if( newPDF != null ) newPDF.close();
 }catch(IOException ie){
 System.err.println("Cannot close newPDF");
 }
 }

 newPDFFile.deleteOnExit();
 return newPDFFile;
}

This is already based from my sandbox code. I’m attempting to return a new File object, a PDF file, that has been striped of any pages after the last occurrence of a certain String searchString. It takes in the original PDF file and the search string. The PDF file is loaded, then I search for the string inside each page (as it is parsed).

Also, there’s a commented part there on passwords. Yes, even PDFs can be password-protected if you’re not aware. Unfortunately, I haven’t gone that far (yet) as to explore that part of PDFBox. As an aside, reading up on BouncyCastle encryption might help😉

After creating the File, it can then be served to the user via direct download. See my previous post on how to go about it.🙂

Thanks also to Prasanna Seshadri’s PDFTextParser.java post for additional clues on PDFBox.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s