Sharkysoft home

lava.text.html
Class HtmlParser

java.lang.Object
  |
  +--lava.text.html.HtmlParser

public class HtmlParser
extends java.lang.Object

Parses HTML source.

Details: This class parses HTML source by separating the source components into tags, text, and comments. HtmlParser reads text from a PushbackReader and returns a stream of objects representing parsed entities. Each of the objects is an instance of HtmlComponent, which has many subclasses (refer to the see-also section).

To gain an appreciation for the manner in which HtmlParser is able to parse and tokenize HTML source, the following sample program is provided. Try this program on your favorite URL.

import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import lava.io.UnlimitedPushbackReader;
import lava.io.UrlInputStream;
import lava.string.StringEncoder;
import lava.text.ParallelColumnsWriter;
import lava.text.html.HtmlComponent;
import lava.text.html.HtmlParser;



class parseHtml
{



    public static void main (String[] args) throws Exception
    {
        if (args . length != 1)
        {
            System.out . println ("usage: parseHtml <url>");
            return;
        }
        HtmlParser hp = new HtmlParser
        (
            new UnlimitedPushbackReader
            (
                new InputStreamReader
                (
                    new UrlInputStream (args [0])
                )
            )
        );
        ParallelColumnsWriter pcw = new ParallelColumnsWriter
        (
            new OutputStreamWriter (System.out),
            new int[] {15, 60}
        );
        while (true)
        {
            HtmlComponent c = hp . parse ();
            if (c == null)
                break;
            String clname = c . getClass () . getName ();
            clname = clname . substring (clname . lastIndexOf ('.') + 1);
            pcw . writeln
            (
                new String[]
                {
                    clname,
                    StringEncoder.encodeAsciiJavaString (c . getSource ())
                }
            );
        }
        pcw . close ();
        hp . close ();
    }



}

Click here to download source.

Changes:

2002.04.18
Values in name=value pairs may be enclosed in single quotes.
2000.12.21
Added peek ().

Version:
2002.04.18
See Also:
HtmlComponent, HtmlText, HtmlRegularTag, HtmlOpenTag, HtmlCloseTag, HtmlSpecialTag, HtmlComment, HtmlError

Constructor Summary
HtmlParser(java.io.PushbackReader in)
          Sets HTML source.
 
Method Summary
 void close()
          Closes source input stream.
static boolean isCloseTag(HtmlComponent c, java.lang.String type)
           
static boolean isOpenTag(HtmlComponent c, java.lang.String type)
           
 HtmlComponent parse()
          Parses one HTML element.
 HtmlComponent peek()
          Peeks at next component without consuming.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlParser

public HtmlParser(java.io.PushbackReader in)
Sets HTML source.

Details: This constructor sets the PushbackReader from which this HtmlParser reads.

Parameters:
in - the InputStream
Method Detail

parse

public HtmlComponent parse()
                    throws java.io.IOException
Parses one HTML element.

Details: This method parses one element from the HTML source stream and returns it. Use the instanceof operator to determine the type of element that was parsed. parse returns null if no more elements can be parsed.

Returns:
the parsed element
Throws:
java.io.IOException - if the source stream cannot be read

peek

public HtmlComponent peek()
                   throws java.io.IOException
Peeks at next component without consuming.

Details: This method determines the next component without consuming it. The object returned by this method is the same physical object that will be returned by parse the next time it is called.

Returns:
the next component
Throws:
java.io.IOException - if an I/O error occurs
Since:
2000.12.21

close

public void close()
           throws java.io.IOException
Closes source input stream.

Details: This method closes the HTML source input stream. Of course, no more HTML tokens can be parsed after this method is called.

Throws:
java.io.IOException - if an I/O error occurs

isOpenTag

public static boolean isOpenTag(HtmlComponent c,
                                java.lang.String type)

isCloseTag

public static boolean isCloseTag(HtmlComponent c,
                                 java.lang.String type)

Sharkysoft home