Class HtmlTokenizer


  • public abstract class HtmlTokenizer
    extends java.lang.Object
    Main HTML tokenizer.

    It's task is to parse HTML and produce list of valid tokens: open tag tokens, end tag tokens, contents (text) and comments. As soon as new item is added to token list, cleaner is invoked to clean current list at the end.

    • Field Detail

      • _reader

        private java.io.BufferedReader _reader
      • _working

        private char[] _working
      • _pos

        private transient int _pos
      • _len

        private transient int _len
      • _saved

        private transient char[] _saved
      • _savedLen

        private transient int _savedLen
      • _currentTagToken

        private transient TagToken _currentTagToken
      • _tokenList

        private transient java.util.List<BaseToken> _tokenList
      • _asExpected

        private boolean _asExpected
      • _isScriptContext

        private boolean _isScriptContext
      • isOmitUnknownTags

        private boolean isOmitUnknownTags
      • isTreatUnknownTagsAsContent

        private boolean isTreatUnknownTagsAsContent
      • isOmitDeprecatedTags

        private boolean isOmitDeprecatedTags
      • isTreatDeprecatedTagsAsContent

        private boolean isTreatDeprecatedTagsAsContent
      • isNamespacesAware

        private boolean isNamespacesAware
      • isOmitComments

        private boolean isOmitComments
      • isAllowMultiWordAttributes

        private boolean isAllowMultiWordAttributes
      • isAllowHtmlInsideAttributes

        private boolean isAllowHtmlInsideAttributes
      • commonStr

        private java.lang.StringBuilder commonStr
    • Constructor Detail

      • HtmlTokenizer

        public HtmlTokenizer​(java.io.Reader reader,
                             CleanerProperties props,
                             CleanerTransformations transformations,
                             ITagInfoProvider tagInfoProvider)
                      throws java.io.IOException
        Constructor - cretes instance of the parser with specified content.
        Parameters:
        reader -
        props -
        transformations -
        tagInfoProvider -
        Throws:
        java.io.IOException
    • Method Detail

      • addToken

        private void addToken​(BaseToken token)
      • makeTree

        abstract void makeTree​(java.util.List<BaseToken> tokenList)
      • createTagNode

        abstract TagNode createTagNode​(java.lang.String name)
      • readIfNeeded

        private void readIfNeeded​(int neededChars)
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • getTokenList

        java.util.List<BaseToken> getTokenList()
      • go

        private void go()
                 throws java.io.IOException
        Throws:
        java.io.IOException
      • go

        private void go​(int step)
                 throws java.io.IOException
        Throws:
        java.io.IOException
      • startsWith

        private boolean startsWith​(java.lang.String value)
                            throws java.io.IOException
        Checks if content starts with specified value at the current position.
        Parameters:
        value -
        Returns:
        true if starts with specified value, false otherwise.
        Throws:
        java.io.IOException
      • startsWithSimple

        private boolean startsWithSimple​(java.lang.String value)
                                  throws java.io.IOException
        Throws:
        java.io.IOException
      • isWhitespace

        private boolean isWhitespace​(int position)
        Checks if character at specified position is whitespace.
        Parameters:
        position -
        Returns:
        true is whitespace, false otherwise.
      • isWhitespace

        private boolean isWhitespace()
        Checks if character at current runtime position is whitespace.
        Returns:
        true is whitespace, false otherwise.
      • isWhitespaceSafe

        private boolean isWhitespaceSafe()
      • isChar

        private boolean isChar​(int position,
                               char ch)
        Checks if character at specified position is equal to specified char.
        Parameters:
        position -
        ch -
        Returns:
        true is equals, false otherwise.
      • isChar

        private boolean isChar​(char ch)
        Checks if character at current runtime position is equal to specified char.
        Parameters:
        ch -
        Returns:
        true is equal, false otherwise.
      • isCharSimple

        private boolean isCharSimple​(char ch)
      • getCurrentChar

        private char getCurrentChar()
        Returns:
        Current character to be read, but first it must be checked if it exists. This method is made for performance reasons to be used instead of isChar(...).
      • isCharEquals

        private boolean isCharEquals​(char ch)
      • isIdentifierStartChar

        private boolean isIdentifierStartChar​(int position)
        Checks if character at specified position can be identifier start.
        Parameters:
        position -
        Returns:
        true is may be identifier start, false otherwise.
      • isIdentifierStartChar

        private boolean isIdentifierStartChar()
        Checks if character at current runtime position can be identifier start.
        Returns:
        true is may be identifier start, false otherwise.
      • isIdentifierChar

        private boolean isIdentifierChar()
        Checks if character at current runtime position can be identifier part.
        Returns:
        true is may be identifier part, false otherwise.
      • isValidXmlChar

        private boolean isValidXmlChar()
      • isValidXmlCharSafe

        private boolean isValidXmlCharSafe()
      • isAllRead

        private boolean isAllRead()
        Checks if end of the content is reached.
      • save

        private void save​(char ch)
        Saves specified character to the temporary buffer.
        Parameters:
        ch -
      • saveCurrent

        private void saveCurrent()
        Saves character at current runtime position to the temporary buffer.
      • saveCurrentSafe

        private void saveCurrentSafe()
      • saveCurrent

        private void saveCurrent​(int size)
                          throws java.io.IOException
        Saves specified number of characters at current runtime position to the temporary buffer.
        Throws:
        java.io.IOException
      • skipWhitespaces

        private void skipWhitespaces()
                              throws java.io.IOException
        Skips whitespaces at current position and moves foreward until non-whitespace character is found or the end of content is reached.
        Throws:
        java.io.IOException
      • addSavedAsContent

        private boolean addSavedAsContent()
      • start

        void start()
            throws java.io.IOException
        Starts parsing HTML.
        Throws:
        java.io.IOException
      • isReservedTag

        private boolean isReservedTag​(java.lang.String tagName)
        Checks if specified tag name is one of the reserved tags: HTML, HEAD or BODY
        Parameters:
        tagName -
        Returns:
      • tagStart

        private void tagStart()
                       throws java.io.IOException
        Parses start of the tag. It expects that current position is at the "<" after which the tag's name follows.
        Throws:
        java.io.IOException
      • tagEnd

        private void tagEnd()
                     throws java.io.IOException
        Parses end of the tag. It expects that current position is at the "<" after which "/" and the tag's name follows.
        Throws:
        java.io.IOException
      • identifier

        private java.lang.String identifier()
                                     throws java.io.IOException
        Parses an identifier from the current position.
        Throws:
        java.io.IOException
      • tagAttributes

        private void tagAttributes()
                            throws java.io.IOException
        Parses list tag attributes from the current position.
        Throws:
        java.io.IOException
      • attributeValue

        private java.lang.String attributeValue()
                                         throws java.io.IOException
        Parses a single tag attribute - it is expected to be in one of the forms: name=value name="value" name='value' name
        Throws:
        java.io.IOException
      • content

        private boolean content()
                         throws java.io.IOException
        Throws:
        java.io.IOException
      • ignoreUntil

        private void ignoreUntil​(char ch)
                          throws java.io.IOException
        Throws:
        java.io.IOException
      • comment

        private void comment()
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • doctype

        private void doctype()
                      throws java.io.IOException
        Throws:
        java.io.IOException