Package org.apache.pdfbox.pdfparser
Class BaseParser
java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
- Direct Known Subclasses:
ConformingPDFParser
,PDFObjectStreamParser
,PDFParser
,PDFStreamParser
,PDFXrefStreamParser
,VisualSignatureParser
This class is used to contain parsing logic that will be used by both the
PDFParser and the COSStreamParser.
- Version:
- $Revision$
- Author:
- Ben Litchfield
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
This is a string constant that will be used for comparisons.protected COSDocument
This is the document that will be parsed.static final byte[]
This is a byte array that will be used for comparisons.static final byte[]
This is a byte array that will be used for comparisons.protected final boolean
Flag to skip malformed or otherwise unparseable input where possible.protected PushBackInputStream
This is the stream that will be read from.static final String
system property allowing to define size of push back buffer. -
Constructor Summary
ConstructorsModifierConstructorDescriptionDefault constructor.protected
BaseParser
(byte[] input) Constructor.BaseParser
(InputStream input) Constructor.BaseParser
(InputStream input, boolean forceParsingValue) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionvoid
Release all used resources.protected boolean
This will tell if the next character is a closing brace( close of PDF array ).protected boolean
isClosing
(int c) This will tell if the next character is a closing brace( close of PDF array ).protected boolean
isEndOfName
(char ch) Determine if a character terminates a PDF name.protected boolean
isEOL()
This will tell if the next byte to be read is an end of line byte.protected boolean
isEOL
(int c) This will tell if the next byte to be read is an end of line byte.protected boolean
This will tell if the next byte is whitespace or not.protected boolean
isWhitespace
(int c) This will tell if the next byte is whitespace or not.protected COSBoolean
This will parse a boolean object from the stream.protected COSArray
This will parse a PDF array object.protected COSDictionary
This will parse a PDF dictionary.protected COSName
This will parse a PDF name from the stream.protected COSStream
parseCOSStream
(COSDictionary dic, RandomAccess file) This will read a COSStream from the input stream.protected COSString
This will parse a PDF string.protected COSString
parseCOSString
(boolean isDictionary) Deprecated.Not needed anymore.protected COSBase
This will parse a directory object from the stream.protected String
readExpectedString
(String theString) This will read bytes until the end of line marker occurs.protected int
This will read a integer from the Stream and throw anIllegalArgumentException
if the integer value has more than the maximum object revision (i.e.protected int
readInt()
This will read an integer from the stream.protected String
readLine()
This will read bytes until the first end of line marker occurs.protected long
readLong()
This will read an long from the stream.protected long
This will read a long from the Stream and throw anIllegalArgumentException
if the long value has more than 10 digits (i.e.protected String
This will read the next string from the stream.protected String
readString
(int length) This will read the next string from the stream up to a certain length.protected final StringBuilder
This method is used to read a token by the readInt() method and the readLong() method.protected void
This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object.void
setDocument
(COSDocument doc) Set the document for this stream.protected void
This will skip all spaces and comments that are present.
-
Field Details
-
PROP_PUSHBACK_SIZE
system property allowing to define size of push back buffer.- See Also:
-
ENDSTREAM
public static final byte[] ENDSTREAMThis is a byte array that will be used for comparisons. -
ENDOBJ
public static final byte[] ENDOBJThis is a byte array that will be used for comparisons. -
DEF
This is a string constant that will be used for comparisons.- See Also:
-
pdfSource
This is the stream that will be read from. -
document
This is the document that will be parsed. -
forceParsing
protected final boolean forceParsingFlag to skip malformed or otherwise unparseable input where possible.
-
-
Constructor Details
-
BaseParser
public BaseParser()Default constructor. -
BaseParser
Constructor.- Parameters:
input
- The input stream to read the data from.forceParsingValue
- flag to skip malformed or otherwise unparseable input where possible- Throws:
IOException
- If there is an error reading the input stream.- Since:
- Apache PDFBox 1.3.0
-
BaseParser
Constructor.- Parameters:
input
- The input stream to read the data from.- Throws:
IOException
- If there is an error reading the input stream.
-
BaseParser
Constructor.- Parameters:
input
- The array to read the data from.- Throws:
IOException
- If there is an error reading the byte data.
-
-
Method Details
-
setDocument
Set the document for this stream.- Parameters:
doc
- The current document.
-
parseCOSDictionary
This will parse a PDF dictionary.- Returns:
- The parsed dictionary.
- Throws:
IOException
- IF there is an error reading the stream.
-
parseCOSStream
This will read a COSStream from the input stream.- Parameters:
file
- The file to write the stream to when reading.dic
- The dictionary that goes with this stream.- Returns:
- The parsed pdf stream.
- Throws:
IOException
- If there is an error reading the stream.
-
readUntilEndStream
This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.- Parameters:
out
- stream we write out to.- Throws:
IOException
-
parseCOSString
Deprecated.Not needed anymore. UseparseCOSString()
instead. PDFBOX-1437This will parse a PDF string.- Parameters:
isDictionary
- indicates if the stream is a dictionary or not- Returns:
- The parsed PDF string.
- Throws:
IOException
- If there is an error reading from the stream.
-
parseCOSString
This will parse a PDF string.- Returns:
- The parsed PDF string.
- Throws:
IOException
- If there is an error reading from the stream.
-
parseCOSArray
This will parse a PDF array object.- Returns:
- The parsed PDF array.
- Throws:
IOException
- If there is an error parsing the stream.
-
isEndOfName
protected boolean isEndOfName(char ch) Determine if a character terminates a PDF name.- Parameters:
ch
- The character- Returns:
true
if the character terminates a PDF name, otherwisefalse
.
-
parseCOSName
This will parse a PDF name from the stream.- Returns:
- The parsed PDF name.
- Throws:
IOException
- If there is an error reading from the stream.
-
parseBoolean
This will parse a boolean object from the stream.- Returns:
- The parsed boolean object.
- Throws:
IOException
- If an IO error occurs during parsing.
-
parseDirObject
This will parse a directory object from the stream.- Returns:
- The parsed object.
- Throws:
IOException
- If there is an error during parsing.
-
readString
This will read the next string from the stream.- Returns:
- The string that was read from the stream.
- Throws:
IOException
- If there is an error reading from the stream.
-
readExpectedString
This will read bytes until the end of line marker occurs.- Parameters:
theString
- The next expected string in the stream.- Returns:
- The characters between the current position and the end of the line.
- Throws:
IOException
- If there is an error reading from the stream or theString does not match what was read.
-
readString
This will read the next string from the stream up to a certain length.- Parameters:
length
- The length to stop reading at.- Returns:
- The string that was read from the stream of length 0 to length.
- Throws:
IOException
- If there is an error reading from the stream.
-
isClosing
This will tell if the next character is a closing brace( close of PDF array ).- Returns:
- true if the next byte is ']', false otherwise.
- Throws:
IOException
- If an IO error occurs.
-
isClosing
protected boolean isClosing(int c) This will tell if the next character is a closing brace( close of PDF array ).- Parameters:
c
- The character to check against end of line- Returns:
- true if the next byte is ']', false otherwise.
-
readLine
This will read bytes until the first end of line marker occurs. Note: if you later unread the results of this function, you'll need to add a newline character to the end of the string.- Returns:
- The characters between the current position and the end of the line.
- Throws:
IOException
- If there is an error reading from the stream.
-
isEOL
This will tell if the next byte to be read is an end of line byte.- Returns:
- true if the next byte is 0x0A or 0x0D.
- Throws:
IOException
- If there is an error reading from the stream.
-
isEOL
protected boolean isEOL(int c) This will tell if the next byte to be read is an end of line byte.- Parameters:
c
- The character to check against end of line- Returns:
- true if the next byte is 0x0A or 0x0D.
-
isWhitespace
This will tell if the next byte is whitespace or not.- Returns:
- true if the next byte in the stream is a whitespace character.
- Throws:
IOException
- If there is an error reading from the stream.
-
isWhitespace
protected boolean isWhitespace(int c) This will tell if the next byte is whitespace or not. These values are specified in table 1 (page 12) of ISO 32000-1:2008.- Parameters:
c
- The character to check against whitespace- Returns:
- true if the next byte in the stream is a whitespace character.
-
skipSpaces
This will skip all spaces and comments that are present.- Throws:
IOException
- If there is an error reading from the stream.
-
readObjectNumber
This will read a long from the Stream and throw anIllegalArgumentException
if the long value has more than 10 digits (i.e. : bigger thanOBJECT_NUMBER_THRESHOLD
)- Returns:
- the object number being read.
- Throws:
IOException
- if an I/O error occurs
-
readGenerationNumber
This will read a integer from the Stream and throw anIllegalArgumentException
if the integer value has more than the maximum object revision (i.e. : bigger thanGENERATION_NUMBER_THRESHOLD
)- Returns:
- the generation number being read.
- Throws:
IOException
- if an I/O error occurs
-
readInt
This will read an integer from the stream.- Returns:
- The integer that was read from the stream.
- Throws:
IOException
- If there is an error reading from the stream.
-
readLong
This will read an long from the stream.- Returns:
- The long that was read from the stream.
- Throws:
IOException
- If there is an error reading from the stream.
-
readStringNumber
This method is used to read a token by the readInt() method and the readLong() method.- Returns:
- the token to parse as integer or long by the calling method.
- Throws:
IOException
- throws by thepdfSource
methods.
-
clearResources
public void clearResources()Release all used resources.
-