Class PDFText2HTML


public class PDFText2HTML extends PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
  • Field Details

  • Constructor Details

    • PDFText2HTML

      public PDFText2HTML() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error during initialization.
  • Method Details

    • writeHeader

      @Deprecated protected void writeHeader() throws IOException
      Deprecated.
      Write the header to the output document. Now also writes the tag defining the character encoding.
      Throws:
      IOException - If there is a problem writing out the header to the document.
    • startDocument

      protected void startDocument(PDDocument document) throws IOException
      Description copied from class: PDFTextStripper
      This method is available for subclasses of this class. It will be called before processing of the document start.
      Overrides:
      startDocument in class PDFTextStripper
      Parameters:
      document - The PDF document that is being processed.
      Throws:
      IOException - If an IO error occurs.
    • endDocument

      public void endDocument(PDDocument document) throws IOException
      This method is available for subclasses of this class. It will be called after processing of the document finishes.
      Overrides:
      endDocument in class PDFTextStripper
      Parameters:
      document - The PDF document that is being processed.
      Throws:
      IOException - If an IO error occurs.
    • getTitle

      protected String getTitle()
      This method will attempt to guess the title of the document using either the document properties or the first lines of text.
      Returns:
      returns the title.
    • startArticle

      protected void startArticle(boolean isLTR) throws IOException
      Write out the article separator (div tag) with proper text direction information.
      Overrides:
      startArticle in class PDFTextStripper
      Parameters:
      isLTR - true if direction of text is left to right
      Throws:
      IOException - If there is an error writing to the stream.
    • endArticle

      protected void endArticle() throws IOException
      Write out the article separator.
      Overrides:
      endArticle in class PDFTextStripper
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String text, List<TextPosition> textPositions) throws IOException
      Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      text - The text to write to the stream.
      textPositions - the corresponding text positions
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String chars) throws IOException
      Write a string to the output stream and escape some HTML characters.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      chars - String to be written to the stream
      Throws:
      IOException - If there is an error writing to the stream.
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Writes the paragraph end "</p>" to the output. Furthermore, it will also clear the font state. Write something (if defined) at the end of a paragraph.
      Overrides:
      writeParagraphEnd in class PDFTextStripper
      Throws:
      IOException - if something went wrong
    • escape

      private static String escape(String chars)
      Escape some HTML characters.
      Parameters:
      chars - String to be escaped
      Returns:
      returns escaped String.
    • appendEscaped

      private static void appendEscaped(StringBuilder builder, char character)