src/reportlab/lib/rparsexml.py
author rgbecker
Wed, 03 Sep 2008 16:10:51 +0000
changeset 2964 32352db0d71e
parent 2945 reportlab/lib/rparsexml.py@a6fdc0a2035b
child 3028 082f5208644e
permissions -rw-r--r--
reportlab-2.2: second stage of major re-org
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     1
"""Radically simple xml parsing
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     2
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     3
Example parse
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     4
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     5
<this type="xml">text <b>in</b> xml</this>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     6
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     7
( "this",
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     8
  {"type": "xml"},
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
     9
  [ "text ",
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    10
    ("b", None, ["in"], None),
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    11
    " xml"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    12
    ]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    13
   None )
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    14
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    15
{ 0: "this"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    16
  "type": "xml"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    17
  1: ["text ",
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    18
      {0: "b", 1:["in"]},
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    19
      " xml"]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    20
}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    21
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    22
Ie, xml tag translates to a tuple:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    23
 (name, dictofattributes, contentlist, miscellaneousinfo)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    24
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    25
where miscellaneousinfo can be anything, (but defaults to None)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    26
(with the intention of adding, eg, line number information)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    27
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    28
special cases: name of "" means "top level, no containing tag".
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    29
Top level parse always looks like this
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    30
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    31
   ("", list, None, None)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    32
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    33
 contained text of None means <simple_tag\>
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    34
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    35
In order to support stuff like
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    36
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    37
   <this></this><one></one>
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    38
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    39
AT THE MOMENT &amp; ETCETERA ARE IGNORED. THEY MUST BE PROCESSED
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    40
IN A POST-PROCESSING STEP.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    41
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    42
PROLOGUES ARE NOT UNDERSTOOD.  OTHER STUFF IS PROBABLY MISSING.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    43
"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    44
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    45
RequirePyRXP = 0        # set this to 1 to disable the nonvalidating fallback parser.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    46
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    47
import string
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    48
try:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    49
    #raise ImportError, "dummy error"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    50
    simpleparse = 0
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    51
    import pyRXPU
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    52
    def warnCB(s):
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    53
        print s
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    54
    pyRXP_parser = pyRXPU.Parser(
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    55
                        ErrorOnValidityErrors=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    56
                        NoNoDTDWarning=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    57
                        ExpandCharacterEntities=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    58
                        ExpandGeneralEntities=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    59
                        warnCB = warnCB,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    60
                        srcName='string input',
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    61
                        ReturnUTF8 = 1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    62
                        )
2945
a6fdc0a2035b rparsexml: allow override of parse parameters
rgbecker
parents: 2575
diff changeset
    63
    def parsexml(xmlText, oneOutermostTag=0,eoCB=None,entityReplacer=None,parseOpts={}):
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    64
        pyRXP_parser.eoCB = eoCB
2945
a6fdc0a2035b rparsexml: allow override of parse parameters
rgbecker
parents: 2575
diff changeset
    65
        p = pyRXP_parser.parse(xmlText,**parseOpts)
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    66
        return oneOutermostTag and p or ('',None,[p],None)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    67
except ImportError:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    68
    simpleparse = 1
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    69
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    70
NONAME = ""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    71
NAMEKEY = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    72
CONTENTSKEY = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    73
CDATAMARKER = "<![CDATA["
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    74
LENCDATAMARKER = len(CDATAMARKER)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    75
CDATAENDMARKER = "]]>"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    76
replacelist = [("&lt;", "<"), ("&gt;", ">"), ("&amp;", "&")] # amp must be last
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    77
#replacelist = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    78
def unEscapeContentList(contentList):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    79
    result = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    80
    from string import replace
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    81
    for e in contentList:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    82
        if "&" in e:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    83
            for (old, new) in replacelist:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    84
                e = replace(e, old, new)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    85
        result.append(e)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    86
    return result
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    87
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
    88
def parsexmlSimple(xmltext, oneOutermostTag=0,eoCB=None,entityReplacer=unEscapeContentList):
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    89
    """official interface: discard unused cursor info"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    90
    if RequirePyRXP:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    91
        raise ImportError, "pyRXP not found, fallback parser disabled"
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
    92
    (result, cursor) = parsexml0(xmltext,entityReplacer=entityReplacer)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    93
    if oneOutermostTag:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    94
        return result[2][0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    95
    else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    96
        return result
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    97
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    98
if simpleparse:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    99
    parsexml = parsexmlSimple
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   100
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   101
def parseFile(filename):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   102
    raw = open(filename, 'r').read()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   103
    return parsexml(raw)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   104
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   105
verbose = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   106
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   107
def skip_prologue(text, cursor):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   108
    """skip any prologue found after cursor, return index of rest of text"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   109
    ### NOT AT ALL COMPLETE!!! definitely can be confused!!!
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   110
    from string import find
2176
9ca69354d4ca Add some support for <!-- in the prolog stuff
rgbecker
parents: 1988
diff changeset
   111
    prologue_elements = ("!DOCTYPE", "?xml", "!--")
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   112
    done = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   113
    while done is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   114
        #print "trying to skip:", repr(text[cursor:cursor+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   115
        openbracket = find(text, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   116
        if openbracket<0: break
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   117
        past = openbracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   118
        found = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   119
        for e in prologue_elements:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   120
            le = len(e)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   121
            if text[past:past+le]==e:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   122
                found = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   123
                cursor = find(text, ">", past)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   124
                if cursor<0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   125
                    raise ValueError, "can't close prologue %s" % `e`
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   126
                cursor = cursor+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   127
        if found is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   128
            done=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   129
    #print "done skipping"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   130
    return cursor
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   131
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   132
def parsexml0(xmltext, startingat=0, toplevel=1,
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   133
        # snarf in some globals
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   134
        strip=string.strip, split=string.split, find=string.find, entityReplacer=unEscapeContentList,
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   135
        #len=len, None=None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   136
        #LENCDATAMARKER=LENCDATAMARKER, CDATAMARKER=CDATAMARKER
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   137
        ):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   138
    """simple recursive descent xml parser...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   139
       return (dictionary, endcharacter)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   140
       special case: comment returns (None, endcharacter)"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   141
    #from string import strip, split, find
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   142
    #print "parsexml0", `xmltext[startingat: startingat+10]`
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   143
    # DEFAULTS
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   144
    NameString = NONAME
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   145
    ContentList = AttDict = ExtraStuff = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   146
    if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   147
        #if verbose: print "at top level"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   148
        #if startingat!=0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   149
        #    raise ValueError, "have to start at 0 for top level!"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   150
        xmltext = strip(xmltext)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   151
    cursor = startingat
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   152
    #look for interesting starting points
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   153
    firstbracket = find(xmltext, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   154
    afterbracket2char = xmltext[firstbracket+1:firstbracket+3]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   155
    #print "a", `afterbracket2char`
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   156
    #firstampersand = find(xmltext, "&", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   157
    #if firstampersand>0 and firstampersand<firstbracket:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   158
    #    raise ValueError, "I don't handle ampersands yet!!!"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   159
    docontents = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   160
    if firstbracket<0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   161
            # no tags
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   162
            #if verbose: print "no tags"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   163
            if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   164
                #D = {NAMEKEY: NONAME, CONTENTSKEY: [xmltext[cursor:]]}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   165
                ContentList = [xmltext[cursor:]]
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   166
                if entityReplacer: ContentList = entityReplacer(ContentList)
1988
71d7483dac55 Attempt at fixing the silly return wrong tuple size problem
rgbecker
parents: 1984
diff changeset
   167
                return (NameString, AttDict, ContentList, ExtraStuff), len(xmltext)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   168
            else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   169
                raise ValueError, "no tags at non-toplevel %s" % `xmltext[cursor:cursor+20]`
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   170
    #D = {}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   171
    L = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   172
    # look for start tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   173
    # NEED to force always outer level is unnamed!!!
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   174
    #if toplevel and firstbracket>0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   175
    #afterbracket2char = xmltext[firstbracket:firstbracket+2]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   176
    if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   177
            #print "toplevel with no outer tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   178
            NameString = name = NONAME
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   179
            cursor = skip_prologue(xmltext, cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   180
            #break
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   181
    elif firstbracket<0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   182
            raise ValueError, "non top level entry should be at start tag: %s" % repr(xmltext[:10])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   183
    # special case: CDATA
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   184
    elif afterbracket2char=="![" and xmltext[firstbracket:firstbracket+9]=="<![CDATA[":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   185
            #print "in CDATA", cursor
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   186
            # skip straight to the close marker
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   187
            startcdata = firstbracket+9
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   188
            endcdata = find(xmltext, CDATAENDMARKER, startcdata)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   189
            if endcdata<0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   190
                raise ValueError, "unclosed CDATA %s" % repr(xmltext[cursor:cursor+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   191
            NameString = CDATAMARKER
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   192
            ContentList = [xmltext[startcdata: endcdata]]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   193
            cursor = endcdata+len(CDATAENDMARKER)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   194
            docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   195
    # special case COMMENT
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   196
    elif afterbracket2char=="!-" and xmltext[firstbracket:firstbracket+4]=="<!--":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   197
            #print "in COMMENT"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   198
            endcommentdashes = find(xmltext, "--", firstbracket+4)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   199
            if endcommentdashes<firstbracket:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   200
                raise ValueError, "unterminated comment %s" % repr(xmltext[cursor:cursor+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   201
            endcomment = endcommentdashes+2
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   202
            if xmltext[endcomment]!=">":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   203
                raise ValueError, "invalid comment: contains double dashes %s" % repr(xmltext[cursor:cursor+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   204
            return (None, endcomment+1) # shortcut exit
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   205
    else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   206
            # get the rest of the tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   207
            #if verbose: print "parsing start tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   208
            # make sure the tag isn't in doublequote pairs
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   209
            closebracket = find(xmltext, ">", firstbracket)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   210
            noclose = closebracket<0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   211
            startsearch = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   212
            pastfirstbracket = firstbracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   213
            tagcontent = xmltext[pastfirstbracket:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   214
            # shortcut, no equal means nothing but name in the tag content
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   215
            if '=' not in tagcontent:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   216
                if tagcontent[-1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   217
                    # simple case
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   218
                    #print "simple case", tagcontent
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   219
                    tagcontent = tagcontent[:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   220
                    docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   221
                name = strip(tagcontent)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   222
                NameString = name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   223
                cursor = startsearch
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   224
            else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   225
                if '"' in tagcontent:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   226
                    # check double quotes
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   227
                    stop = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   228
                    # not inside double quotes! (the split should have odd length)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   229
                    if noclose or len(split(tagcontent+".", '"'))% 2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   230
                        stop=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   231
                    while stop is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   232
                        closebracket = find(xmltext, ">", startsearch)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   233
                        startsearch = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   234
                        noclose = closebracket<0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   235
                        tagcontent = xmltext[pastfirstbracket:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   236
                        # not inside double quotes! (the split should have odd length)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   237
                        if noclose or len(split(tagcontent+".", '"'))% 2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   238
                            stop=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   239
                if noclose:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   240
                    raise ValueError, "unclosed start tag %s" % repr(xmltext[firstbracket:firstbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   241
                cursor = startsearch
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   242
                #cursor = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   243
                # handle simple tag /> syntax
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   244
                if xmltext[closebracket-1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   245
                    #if verbose: print "it's a simple tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   246
                    closebracket = closebracket-1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   247
                    tagcontent = tagcontent[:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   248
                    docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   249
                #tagcontent = xmltext[firstbracket+1:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   250
                tagcontent = strip(tagcontent)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   251
                taglist = split(tagcontent, "=")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   252
                #if not taglist:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   253
                #    raise ValueError, "tag with no name %s" % repr(xmltext[firstbracket:firstbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   254
                taglist0 = taglist[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   255
                taglist0list = split(taglist0)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   256
                #if len(taglist0list)>2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   257
                #    raise ValueError, "bad tag head %s" % repr(taglist0)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   258
                name = taglist0list[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   259
                #print "tag name is", name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   260
                NameString = name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   261
                # now parse the attributes
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   262
                attributename = taglist0list[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   263
                # put a fake att name at end of last taglist entry for consistent parsing
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   264
                taglist[-1] = taglist[-1]+" f"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   265
                AttDict = D = {}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   266
                taglistindex = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   267
                lasttaglistindex = len(taglist)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   268
                #for attentry in taglist[1:]:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   269
                while taglistindex<lasttaglistindex:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   270
                    #print "looking for attribute named", attributename
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   271
                    attentry = taglist[taglistindex]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   272
                    taglistindex = taglistindex+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   273
                    attentry = strip(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   274
                    if attentry[0]!='"':
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   275
                        raise ValueError, "attribute value must start with double quotes" + repr(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   276
                    while '"' not in attentry[1:]:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   277
                        # must have an = inside the attribute value...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   278
                        if taglistindex>lasttaglistindex:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   279
                            raise ValueError, "unclosed value " + repr(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   280
                        nextattentry = taglist[taglistindex]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   281
                        taglistindex = taglistindex+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   282
                        attentry = "%s=%s" % (attentry, nextattentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   283
                    attentry = strip(attentry) # only needed for while loop...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   284
                    attlist = split(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   285
                    nextattname = attlist[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   286
                    attvalue = attentry[:-len(nextattname)]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   287
                    attvalue = strip(attvalue)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   288
                    try:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   289
                        first = attvalue[0]; last=attvalue[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   290
                    except:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   291
                        raise ValueError, "attvalue,attentry,attlist="+repr((attvalue, attentry,attlist))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   292
                    if first==last=='"' or first==last=="'":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   293
                        attvalue = attvalue[1:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   294
                    #print attributename, "=", attvalue
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   295
                    D[attributename] = attvalue
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   296
                    attributename = nextattname
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   297
    # pass over other tags and content looking for end tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   298
    if docontents is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   299
        #print "now looking for end tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   300
        ContentList = L
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   301
    while docontents is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   302
            nextopenbracket = find(xmltext, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   303
            if nextopenbracket<cursor:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   304
                #if verbose: print "no next open bracket found"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   305
                if name==NONAME:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   306
                    #print "no more tags for noname", repr(xmltext[cursor:cursor+10])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   307
                    docontents=None # done
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   308
                    remainder = xmltext[cursor:]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   309
                    cursor = len(xmltext)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   310
                    if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   311
                        L.append(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   312
                else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   313
                    raise ValueError, "no close bracket for %s found after %s" % (name,repr(xmltext[cursor: cursor+20]))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   314
            # is it a close bracket?
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   315
            elif xmltext[nextopenbracket+1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   316
                #print "found close bracket", repr(xmltext[nextopenbracket:nextopenbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   317
                nextclosebracket = find(xmltext, ">", nextopenbracket)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   318
                if nextclosebracket<nextopenbracket:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   319
                    raise ValueError, "unclosed close tag %s" % repr(xmltext[nextopenbracket: nextopenbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   320
                closetagcontents = xmltext[nextopenbracket+2: nextclosebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   321
                closetaglist = split(closetagcontents)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   322
                #if len(closetaglist)!=1:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   323
                    #print closetagcontents
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   324
                    #raise ValueError, "bad close tag format %s" % repr(xmltext[nextopenbracket: nextopenbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   325
                # name should match
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   326
                closename = closetaglist[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   327
                #if verbose: print "closetag name is", closename
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   328
                if name!=closename:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   329
                    prefix = xmltext[:cursor]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   330
                    endlinenum = len(split(prefix, "\n"))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   331
                    prefix = xmltext[:startingat]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   332
                    linenum = len(split(prefix, "\n"))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   333
                    raise ValueError, \
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   334
                       "at lines %s...%s close tag name doesn't match %s...%s %s" %(
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   335
                       linenum, endlinenum, `name`, `closename`, repr(xmltext[cursor: cursor+100]))
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   336
                remainder = xmltext[cursor:nextopenbracket]
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   337
                if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   338
                    #if verbose: print "remainder", repr(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   339
                    L.append(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   340
                cursor = nextclosebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   341
                #print "for", name, "found close tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   342
                docontents = None # done
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   343
            # otherwise we are looking at a new tag, recursively parse it...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   344
            # first record any intervening content
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   345
            else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   346
                remainder = xmltext[cursor:nextopenbracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   347
                if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   348
                    L.append(remainder)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   349
                #if verbose:
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   350
                #    #print "skipping", repr(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   351
                #    #print "--- recursively parsing starting at", xmltext[nextopenbracket:nextopenbracket+20]
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   352
                (parsetree, cursor) = parsexml0(xmltext, startingat=nextopenbracket, toplevel=None, entityReplacer=entityReplacer)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   353
                if parsetree:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   354
                    L.append(parsetree)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   355
        # maybe should check for trailing garbage?
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   356
        # toplevel:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   357
        #    remainder = strip(xmltext[cursor:])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   358
        #    if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   359
        #        raise ValueError, "trailing garbage at top level %s" % repr(remainder[:20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   360
    if ContentList:
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   361
        if entityReplacer: ContentList = entityReplacer(ContentList)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   362
    t = (NameString, AttDict, ContentList, ExtraStuff)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   363
    return (t, cursor)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   364
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   365
import types
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   366
def pprettyprint(parsedxml):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   367
    """pretty printer mainly for testing"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   368
    st = types.StringType
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   369
    if type(parsedxml) is st:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   370
        return parsedxml
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   371
    (name, attdict, textlist, extra) = parsedxml
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   372
    if not attdict: attdict={}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   373
    join = string.join
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   374
    attlist = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   375
    for k in attdict.keys():
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   376
        v = attdict[k]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   377
        attlist.append("%s=%s" % (k, `v`))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   378
    attributes = join(attlist, " ")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   379
    if not name and attributes:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   380
        raise ValueError, "name missing with attributes???"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   381
    if textlist is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   382
        # with content
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   383
        textlistpprint = map(pprettyprint, textlist)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   384
        textpprint = join(textlistpprint, "\n")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   385
        if not name:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   386
            return textpprint # no outer tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   387
        # indent it
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   388
        nllist = string.split(textpprint, "\n")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   389
        textpprint = "   "+join(nllist, "\n   ")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   390
        return "<%s %s>\n%s\n</%s>" % (name, attributes, textpprint, name)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   391
    # otherwise must be a simple tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   392
    return "<%s %s/>" % (name, attributes)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   393
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   394
dump = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   395
def testparse(s):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   396
    from time import time
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   397
    from pprint import pprint
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   398
    now = time()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   399
    D = parsexmlSimple(s)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   400
    print "DONE", time()-now
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   401
    if dump&4:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   402
        pprint(D)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   403
    #pprint(D)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   404
    if dump&1:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   405
        print "============== reformatting"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   406
        p = pprettyprint(D)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   407
        print p
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   408
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   409
def test():
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   410
    testparse("""<this type="xml">text &lt;&gt;<b>in</b> <funnytag foo="bar"/> xml</this>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   411
                 <!-- comment -->
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   412
                 <![CDATA[
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   413
                 <this type="xml">text <b>in</b> xml</this> ]]>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   414
                 <tag with="<brackets in values>">just testing brackets feature</tag>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   415
                 """)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   416
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   417
filenames = [ #"../../reportlab/demos/pythonpoint/pythonpoint.xml",
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   418
              "samples/hamlet.xml"]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   419
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   420
#filenames = ["moa.xml"]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   421
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   422
dump=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   423
if __name__=="__main__":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   424
    test()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   425
    from time import time
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   426
    now = time()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   427
    for f in filenames:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   428
        t = open(f).read()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   429
        print "parsing", f
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   430
        testparse(t)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   431
    print "elapsed", time()-now