Instructions
Assignment 5 (due October 29, 2008)There are two main models for parsing XML documents: DOM and SAX. In this assignment you are asked to write a program which uses the DOM to parse, access and navigate through the nodes of the input XML file. Your program should provide the following functions:
1. Provide satistical information about the input XML document such as:
* the total number of nodes of the input XML document
(count only nodes of type Element, Attribute, Text, CDATA, Comment, and ProcessingInstruction)
* the number of nodes of each node type (Element / Attribute / Text) in the input XML document
* the maximal height of the XML tree (longest path from root to a leaf; only considering element and text nodes)
* the maximal length of any sibling list (maximal number of children of any node, only considerly element and text nodes)
* the number of distinct element names
* the number of distinct attribute names
2. Serialize the DOM tree back to an XML document using the following three user options:
* no linebreaks/whitespace at all between the tree nodes (option "p1")
* seperate line for each item (option "p2")
* pretty print with identation for (nested) tree nodes (option "p3"); this means that after N start-element tags (and before the corresponding end-element tag) each line should start with N indentation characters ("\t"), and after each start/end-element tag or (non-empty, non-attribute) node there is a return ("\n").
Note: For s, p1, p2, p3: Remove/ignore all empty text nodes. A text node is empty if it contains no character, or only blank characters and/or '\n' and '\t' characters.
For p1, p2, p3: Remove all leading and trailing whitespace in all text nodes.
For s, as total number of nodes, do not count the extra document node provided by DOM (only count element, proc. instruction, comment, attribute, text, and CDATA nodes). For p2, an "item" means an opening/closing tag, or some text.
For p3, indent using "\t" (TAB).
For p1,p2,p3: It is OK to always print as first line " Your program should take a command line option which is one of [s|p1|p2|p3] (with "s" being the default) and as argument the file name of an XML document.
Write your program in Java and use the correspondig Xerces DOM parser libraries. Run your program on the following sample XML file and submit the results attached together with the source code.
Class Info
* Syllabus
* Schedule & Notes
* Examples
NOTE: the program should take 2 parameters by default
i.e. > java RunDOMParser arizona.xml p1
Answers
- Emailed to prof
