Package org.apache.pdfbox.util
Class PDFTextStripperByArea
java.lang.Object
org.apache.pdfbox.util.PDFStreamEngine
org.apache.pdfbox.util.PDFTextStripper
org.apache.pdfbox.util.PDFTextStripperByArea
This will extract text from a specified region in the PDF.
- Version:
- $Revision: 1.5 $
- Author:
- Ben Litchfield
-
Field Summary
Fields inherited from class org.apache.pdfbox.util.PDFTextStripper
charactersByArticle, document, output, outputEncoding, systemLineSeparator
-
Constructor Summary
ConstructorsConstructorDescriptionConstructor.PDFTextStripperByArea
(String encoding) Instantiate a new PDFTextStripperArea object.PDFTextStripperByArea
(Properties props) Instantiate a new PDFTextStripperArea object. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addRegion
(String regionName, Rectangle2D rect) Add a new region to group text by.void
extractRegions
(PDPage page) Process the page to extract the region text.Get the list of regions that have been setup.getTextForRegion
(String regionName) Get the text for the region, this should be called after extractRegions().protected void
This will process a TextPosition object and add the text to the list of characters on a page.void
removeRegion
(String regionName) Delete a region to group text by.protected void
This will print the processed page text to the output stream.Methods inherited from class org.apache.pdfbox.util.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageSeparator, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getText, getWordSeparator, handleLineSeparation, inspectFontEncoding, isParagraphSeparation, matchListItemPattern, matchPattern, processPage, processPages, resetEngine, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageSeparator, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageSeperator, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeText, writeWordSeparator
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
-
Constructor Details
-
PDFTextStripperByArea
Constructor.- Throws:
IOException
- If there is an error loading properties.
-
PDFTextStripperByArea
Instantiate a new PDFTextStripperArea object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.- Parameters:
props
- The properties containing the mapping of operators to PDFOperator classes.- Throws:
IOException
- If there is an error reading the properties.
-
PDFTextStripperByArea
Instantiate a new PDFTextStripperArea object. This object will load properties from PDFTextStripper.properties and will apply encoding-specific conversions to the output text.- Parameters:
encoding
- The encoding that the output will be written in.- Throws:
IOException
- If there is an error reading the properties.
-
-
Method Details
-
addRegion
Add a new region to group text by.- Parameters:
regionName
- The name of the region.rect
- The rectangle area to retrieve the text from.
-
removeRegion
Delete a region to group text by. If the region does not exist, this method does nothing.- Parameters:
regionName
- The name of the region to delete.
-
getRegions
Get the list of regions that have been setup.- Returns:
- A list of java.lang.String objects to identify the region names.
-
getTextForRegion
Get the text for the region, this should be called after extractRegions().- Parameters:
regionName
- The name of the region to get the text from.- Returns:
- The text that was identified in that region.
-
extractRegions
Process the page to extract the region text.- Parameters:
page
- The page to extract the regions from.- Throws:
IOException
- If there is an error while extracting text.
-
processTextPosition
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classPDFTextStripper
- Parameters:
text
- The text to process.
-
writePage
This will print the processed page text to the output stream.- Overrides:
writePage
in classPDFTextStripper
- Throws:
IOException
- If there is an error writing the text.
-