Package org.htmlcleaner
Class HtmlTokenizer
- java.lang.Object
-
- org.htmlcleaner.HtmlTokenizer
-
public abstract class HtmlTokenizer extends java.lang.Object
Main HTML tokenizer.It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.
-
-
Field Summary
Fields Modifier and Type Field Description private boolean
_asExpected
private TagToken
_currentTagToken
private DoctypeToken
_docType
private boolean
_isScriptContext
private int
_len
private int
_pos
private java.io.BufferedReader
_reader
private char[]
_saved
private int
_savedLen
private java.util.List<BaseToken>
_tokenList
private char[]
_working
private java.lang.StringBuilder
commonStr
private boolean
isAllowHtmlInsideAttributes
private boolean
isAllowMultiWordAttributes
private boolean
isNamespacesAware
private boolean
isOmitComments
private boolean
isOmitDeprecatedTags
private boolean
isOmitUnknownTags
private boolean
isTreatDeprecatedTagsAsContent
private boolean
isTreatUnknownTagsAsContent
private CleanerProperties
props
private ITagInfoProvider
tagInfoProvider
private CleanerTransformations
transformations
private static int
WORKING_BUFFER_SIZE
-
Constructor Summary
Constructors Constructor Description HtmlTokenizer(java.io.Reader reader, CleanerProperties props, CleanerTransformations transformations, ITagInfoProvider tagInfoProvider)
Constructor - cretes instance of the parser with specified content.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description private boolean
addSavedAsContent()
private void
addToken(BaseToken token)
private java.lang.String
attributeValue()
Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' nameprivate void
comment()
private boolean
content()
(package private) abstract TagNode
createTagNode(java.lang.String name)
private void
doctype()
private char
getCurrentChar()
DoctypeToken
getDocType()
(package private) java.util.List<BaseToken>
getTokenList()
private void
go()
private void
go(int step)
private java.lang.String
identifier()
Parses an identifier from the current position.private void
ignoreUntil(char ch)
private boolean
isAllRead()
Checks if end of the content is reached.private boolean
isChar(char ch)
Checks if character at current runtime position is equal to specified char.private boolean
isChar(int position, char ch)
Checks if character at specified position is equal to specified char.private boolean
isCharEquals(char ch)
private boolean
isCharSimple(char ch)
private boolean
isIdentifierChar()
Checks if character at current runtime position can be identifier part.private boolean
isIdentifierStartChar()
Checks if character at current runtime position can be identifier start.private boolean
isIdentifierStartChar(int position)
Checks if character at specified position can be identifier start.private boolean
isReservedTag(java.lang.String tagName)
Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODYprivate boolean
isValidXmlChar()
private boolean
isValidXmlCharSafe()
private boolean
isWhitespace()
Checks if character at current runtime position is whitespace.private boolean
isWhitespace(int position)
Checks if character at specified position is whitespace.private boolean
isWhitespaceSafe()
(package private) abstract void
makeTree(java.util.List<BaseToken> tokenList)
private void
readIfNeeded(int neededChars)
private void
save(char ch)
Saves specified character to the temporary buffer.private void
saveCurrent()
Saves character at current runtime position to the temporary buffer.private void
saveCurrent(int size)
Saves specified number of characters at current runtime position to the temporary buffer.private void
saveCurrentSafe()
private void
skipWhitespaces()
Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.(package private) void
start()
Starts parsing HTML.private boolean
startsWith(java.lang.String value)
Checks if content starts with specified value at the current position.private boolean
startsWithSimple(java.lang.String value)
private void
tagAttributes()
Parses list tag attributes from the current position.private void
tagEnd()
Parses end of the tag.private void
tagStart()
Parses start of the tag.
-
-
-
Field Detail
-
WORKING_BUFFER_SIZE
private static final int WORKING_BUFFER_SIZE
- See Also:
- Constant Field Values
-
_reader
private java.io.BufferedReader _reader
-
_working
private char[] _working
-
_pos
private transient int _pos
-
_len
private transient int _len
-
_saved
private transient char[] _saved
-
_savedLen
private transient int _savedLen
-
_docType
private transient DoctypeToken _docType
-
_currentTagToken
private transient TagToken _currentTagToken
-
_tokenList
private transient java.util.List<BaseToken> _tokenList
-
_asExpected
private boolean _asExpected
-
_isScriptContext
private boolean _isScriptContext
-
props
private CleanerProperties props
-
isOmitUnknownTags
private boolean isOmitUnknownTags
-
isTreatUnknownTagsAsContent
private boolean isTreatUnknownTagsAsContent
-
isOmitDeprecatedTags
private boolean isOmitDeprecatedTags
-
isTreatDeprecatedTagsAsContent
private boolean isTreatDeprecatedTagsAsContent
-
isNamespacesAware
private boolean isNamespacesAware
-
isOmitComments
private boolean isOmitComments
-
isAllowMultiWordAttributes
private boolean isAllowMultiWordAttributes
-
isAllowHtmlInsideAttributes
private boolean isAllowHtmlInsideAttributes
-
transformations
private CleanerTransformations transformations
-
tagInfoProvider
private ITagInfoProvider tagInfoProvider
-
commonStr
private java.lang.StringBuilder commonStr
-
-
Constructor Detail
-
HtmlTokenizer
public HtmlTokenizer(java.io.Reader reader, CleanerProperties props, CleanerTransformations transformations, ITagInfoProvider tagInfoProvider) throws java.io.IOException
Constructor - cretes instance of the parser with specified content.- Parameters:
reader
-props
-transformations
-tagInfoProvider
-- Throws:
java.io.IOException
-
-
Method Detail
-
addToken
private void addToken(BaseToken token)
-
makeTree
abstract void makeTree(java.util.List<BaseToken> tokenList)
-
createTagNode
abstract TagNode createTagNode(java.lang.String name)
-
readIfNeeded
private void readIfNeeded(int neededChars) throws java.io.IOException
- Throws:
java.io.IOException
-
getTokenList
java.util.List<BaseToken> getTokenList()
-
go
private void go() throws java.io.IOException
- Throws:
java.io.IOException
-
go
private void go(int step) throws java.io.IOException
- Throws:
java.io.IOException
-
startsWith
private boolean startsWith(java.lang.String value) throws java.io.IOException
Checks if content starts with specified value at the current position.- Parameters:
value
-- Returns:
- true if starts with specified value, false otherwise.
- Throws:
java.io.IOException
-
startsWithSimple
private boolean startsWithSimple(java.lang.String value) throws java.io.IOException
- Throws:
java.io.IOException
-
isWhitespace
private boolean isWhitespace(int position)
Checks if character at specified position is whitespace.- Parameters:
position
-- Returns:
- true is whitespace, false otherwise.
-
isWhitespace
private boolean isWhitespace()
Checks if character at current runtime position is whitespace.- Returns:
- true is whitespace, false otherwise.
-
isWhitespaceSafe
private boolean isWhitespaceSafe()
-
isChar
private boolean isChar(int position, char ch)
Checks if character at specified position is equal to specified char.- Parameters:
position
-ch
-- Returns:
- true is equals, false otherwise.
-
isChar
private boolean isChar(char ch)
Checks if character at current runtime position is equal to specified char.- Parameters:
ch
-- Returns:
- true is equal, false otherwise.
-
isCharSimple
private boolean isCharSimple(char ch)
-
getCurrentChar
private char getCurrentChar()
- Returns:
- Current character to be read, but first it must be checked if it exists. This method is made for performance reasons to be used instead of isChar(...).
-
isCharEquals
private boolean isCharEquals(char ch)
-
isIdentifierStartChar
private boolean isIdentifierStartChar(int position)
Checks if character at specified position can be identifier start.- Parameters:
position
-- Returns:
- true is may be identifier start, false otherwise.
-
isIdentifierStartChar
private boolean isIdentifierStartChar()
Checks if character at current runtime position can be identifier start.- Returns:
- true is may be identifier start, false otherwise.
-
isIdentifierChar
private boolean isIdentifierChar()
Checks if character at current runtime position can be identifier part.- Returns:
- true is may be identifier part, false otherwise.
-
isValidXmlChar
private boolean isValidXmlChar()
-
isValidXmlCharSafe
private boolean isValidXmlCharSafe()
-
isAllRead
private boolean isAllRead()
Checks if end of the content is reached.
-
save
private void save(char ch)
Saves specified character to the temporary buffer.- Parameters:
ch
-
-
saveCurrent
private void saveCurrent()
Saves character at current runtime position to the temporary buffer.
-
saveCurrentSafe
private void saveCurrentSafe()
-
saveCurrent
private void saveCurrent(int size) throws java.io.IOException
Saves specified number of characters at current runtime position to the temporary buffer.- Throws:
java.io.IOException
-
skipWhitespaces
private void skipWhitespaces() throws java.io.IOException
Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.- Throws:
java.io.IOException
-
addSavedAsContent
private boolean addSavedAsContent()
-
start
void start() throws java.io.IOException
Starts parsing HTML.- Throws:
java.io.IOException
-
isReservedTag
private boolean isReservedTag(java.lang.String tagName)
Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY- Parameters:
tagName
-- Returns:
-
tagStart
private void tagStart() throws java.io.IOException
Parses start of the tag. It expects that current position is at the "<" after which the tag's name follows.- Throws:
java.io.IOException
-
tagEnd
private void tagEnd() throws java.io.IOException
Parses end of the tag. It expects that current position is at the "<" after which "/" and the tag's name follows.- Throws:
java.io.IOException
-
identifier
private java.lang.String identifier() throws java.io.IOException
Parses an identifier from the current position.- Throws:
java.io.IOException
-
tagAttributes
private void tagAttributes() throws java.io.IOException
Parses list tag attributes from the current position.- Throws:
java.io.IOException
-
attributeValue
private java.lang.String attributeValue() throws java.io.IOException
Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name- Throws:
java.io.IOException
-
content
private boolean content() throws java.io.IOException
- Throws:
java.io.IOException
-
ignoreUntil
private void ignoreUntil(char ch) throws java.io.IOException
- Throws:
java.io.IOException
-
comment
private void comment() throws java.io.IOException
- Throws:
java.io.IOException
-
doctype
private void doctype() throws java.io.IOException
- Throws:
java.io.IOException
-
getDocType
public DoctypeToken getDocType()
-
-