Sharkysoft home

Package lava.text.html

HTML parsing.


Class Summary
HtmlCloseTag Closing HTML tag.
HtmlComment HTML comment.
HtmlComponent Parsed HTML element.
HtmlEntities Translate HTML entities.
HtmlError Malformed HTML source.
HtmlOpenTag Opening HTML tag.
HtmlParser Parses HTML source.
HtmlRegularTag HTML tag.
HtmlSpecialTag Unusual HTML tag.
HtmlText Uninterpreted HTML text.

Package lava.text.html Description

HTML parsing.

Details: This package is useful for parsing simple HTML documents. It allows you to turn an HTML text stream into a stream of Java objects that represent the parsed contents of the stream. The classes of objects that can occur in the object stream are shown below. (Click on any of the class boxes to view class-specific documentation.)

All objects are instances of HtmlComponent, but since HtmlComponent is an abstract class, each object in the stream also belongs to one of the shown subclasses as well.

HtmlParser is responsible for converting the text stream into the stream of objects. This package is ideal for applications which must extract information from machine-generated web pages.

Here is a simple example that reads a web page and prints the (relative) URLs of all of the images displayed in it.*

import; import; import; import lava.text.html.HtmlComponent; import lava.text.html.HtmlOpenTag; import lava.text.html.HtmlParser; class listimages { public static void main (String[] args) throws Exception { if (args . length != 1) { System.out . println ("usage: listimages <url>"); return; } HtmlParser parser = null; try { parser = new HtmlParser ( new UnlimitedPushbackReader ( new AsciiInputStreamReader ( new UrlInputStream (args [0]) ) ) ); while (true) { HtmlComponent comp = parser . parse (); if (comp == null) break; if ( (comp instanceof HtmlOpenTag) && ((HtmlOpenTag) comp) . getType () . equals ("IMG") ) { String location = ((HtmlOpenTag) comp) . getAttribute ("SRC"); if (location != null) System.out . println (location); } } } finally { if (parser != null) parser . close (); } } }


*Actually, this claim is not entirely true. The example only lists images displayed using the <IMG> tag, and does not include the background image (which comes from the <BODY> tag, or other images brought in by other means. We wanted to keep the example simple.

Sharkysoft home