Class PDFTextStripperByArea


public class PDFTextStripperByArea extends PDFTextStripper
This will extract text from a specified region in the PDF.
Version:
$Revision: 1.5 $
Author:
Ben Litchfield
  • Constructor Details

    • PDFTextStripperByArea

      public PDFTextStripperByArea() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error loading properties.
    • PDFTextStripperByArea

      public PDFTextStripperByArea(Properties props) throws IOException
      Instantiate a new PDFTextStripperArea object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.
      Parameters:
      props - The properties containing the mapping of operators to PDFOperator classes.
      Throws:
      IOException - If there is an error reading the properties.
    • PDFTextStripperByArea

      public PDFTextStripperByArea(String encoding) throws IOException
      Instantiate a new PDFTextStripperArea object. This object will load properties from PDFTextStripper.properties and will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException - If there is an error reading the properties.
  • Method Details

    • addRegion

      public void addRegion(String regionName, Rectangle2D rect)
      Add a new region to group text by.
      Parameters:
      regionName - The name of the region.
      rect - The rectangle area to retrieve the text from.
    • removeRegion

      public void removeRegion(String regionName)
      Delete a region to group text by. If the region does not exist, this method does nothing.
      Parameters:
      regionName - The name of the region to delete.
    • getRegions

      public List<String> getRegions()
      Get the list of regions that have been setup.
      Returns:
      A list of java.lang.String objects to identify the region names.
    • getTextForRegion

      public String getTextForRegion(String regionName)
      Get the text for the region, this should be called after extractRegions().
      Parameters:
      regionName - The name of the region to get the text from.
      Returns:
      The text that was identified in that region.
    • extractRegions

      public void extractRegions(PDPage page) throws IOException
      Process the page to extract the region text.
      Parameters:
      page - The page to extract the regions from.
      Throws:
      IOException - If there is an error while extracting text.
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Overrides:
      processTextPosition in class PDFTextStripper
      Parameters:
      text - The text to process.
    • writePage

      protected void writePage() throws IOException
      This will print the processed page text to the output stream.
      Overrides:
      writePage in class PDFTextStripper
      Throws:
      IOException - If there is an error writing the text.