src/reportlab/lib/rparsexml.py
author rptlab
Tue, 30 Apr 2013 14:28:14 +0100
branchpy33
changeset 3723 99aa837b6703
parent 3721 0c93dd8ff567
child 3782 bb8cb5194b0f
permissions -rw-r--r--
second stage of port to Python 3.3; working hello world
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
3029
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     1
"""Very simple and fast XML parser, used for intra-paragraph text.
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     2
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     3
Devised by Aaron Watters in the bad old days before Python had fast
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     4
parsers available.  Constructs the lightest possible in-memory
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     5
representation; parses most files we have seen in pure python very
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     6
quickly.
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     7
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     8
The output structure is the same as the one produced by pyRXP,
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
     9
our validating C-based parser, which was written later.  It will
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
    10
use pyRXP if available.
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
    11
eded59f94021 adding docstrings to lib
andy
parents: 3028
diff changeset
    12
This is used to parse intra-paragraph markup.
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    13
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    14
Example parse::
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    15
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    16
    <this type="xml">text <b>in</b> xml</this>
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    17
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    18
    ( "this",
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    19
      {"type": "xml"},
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    20
      [ "text ",
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    21
        ("b", None, ["in"], None),
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    22
        " xml"
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    23
        ]
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    24
       None )
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    25
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    26
    { 0: "this"
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    27
      "type": "xml"
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    28
      1: ["text ",
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    29
          {0: "b", 1:["in"]},
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    30
          " xml"]
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    31
    }
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    32
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    33
Ie, xml tag translates to a tuple:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    34
 (name, dictofattributes, contentlist, miscellaneousinfo)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    35
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    36
where miscellaneousinfo can be anything, (but defaults to None)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    37
(with the intention of adding, eg, line number information)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    38
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    39
special cases: name of "" means "top level, no containing tag".
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    40
Top level parse always looks like this::
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    41
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    42
    ("", list, None, None)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    43
3138
3c1f87352b7b fixed a docstring causing a test to fail
andy
parents: 3029
diff changeset
    44
 contained text of None means <simple_tag/>
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    45
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    46
In order to support stuff like::
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    47
3028
082f5208644e docstring modifications to adhere to restructuredtext
damian
parents: 2964
diff changeset
    48
    <this></this><one></one>
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    49
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    50
AT THE MOMENT &amp; ETCETERA ARE IGNORED. THEY MUST BE PROCESSED
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    51
IN A POST-PROCESSING STEP.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    52
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    53
PROLOGUES ARE NOT UNDERSTOOD.  OTHER STUFF IS PROBABLY MISSING.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    54
"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    55
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    56
RequirePyRXP = 0        # set this to 1 to disable the nonvalidating fallback parser.
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    57
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    58
import string
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    59
try:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    60
    #raise ImportError, "dummy error"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    61
    simpleparse = 0
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    62
    import pyRXPU
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    63
    def warnCB(s):
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
    64
        print(s)
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    65
    pyRXP_parser = pyRXPU.Parser(
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    66
                        ErrorOnValidityErrors=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    67
                        NoNoDTDWarning=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    68
                        ExpandCharacterEntities=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    69
                        ExpandGeneralEntities=1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    70
                        warnCB = warnCB,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    71
                        srcName='string input',
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    72
                        ReturnUTF8 = 1,
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    73
                        )
2945
a6fdc0a2035b rparsexml: allow override of parse parameters
rgbecker
parents: 2575
diff changeset
    74
    def parsexml(xmlText, oneOutermostTag=0,eoCB=None,entityReplacer=None,parseOpts={}):
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    75
        pyRXP_parser.eoCB = eoCB
2945
a6fdc0a2035b rparsexml: allow override of parse parameters
rgbecker
parents: 2575
diff changeset
    76
        p = pyRXP_parser.parse(xmlText,**parseOpts)
2575
0cba68b93555 reportlab-utf8 moved to trunk
rgbecker
parents: 2178
diff changeset
    77
        return oneOutermostTag and p or ('',None,[p],None)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    78
except ImportError:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    79
    simpleparse = 1
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
    80
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    81
NONAME = ""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    82
NAMEKEY = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    83
CONTENTSKEY = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    84
CDATAMARKER = "<![CDATA["
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    85
LENCDATAMARKER = len(CDATAMARKER)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    86
CDATAENDMARKER = "]]>"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    87
replacelist = [("&lt;", "<"), ("&gt;", ">"), ("&amp;", "&")] # amp must be last
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    88
#replacelist = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    89
def unEscapeContentList(contentList):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    90
    result = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    91
    from string import replace
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    92
    for e in contentList:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    93
        if "&" in e:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    94
            for (old, new) in replacelist:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    95
                e = replace(e, old, new)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    96
        result.append(e)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    97
    return result
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
    98
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
    99
def parsexmlSimple(xmltext, oneOutermostTag=0,eoCB=None,entityReplacer=unEscapeContentList):
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   100
    """official interface: discard unused cursor info"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   101
    if RequirePyRXP:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   102
        raise ImportError("pyRXP not found, fallback parser disabled")
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   103
    (result, cursor) = parsexml0(xmltext,entityReplacer=entityReplacer)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   104
    if oneOutermostTag:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   105
        return result[2][0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   106
    else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   107
        return result
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   108
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   109
if simpleparse:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   110
    parsexml = parsexmlSimple
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   111
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   112
def parseFile(filename):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   113
    raw = open(filename, 'r').read()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   114
    return parsexml(raw)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   115
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   116
verbose = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   117
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   118
def skip_prologue(text, cursor):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   119
    """skip any prologue found after cursor, return index of rest of text"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   120
    ### NOT AT ALL COMPLETE!!! definitely can be confused!!!
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   121
    from string import find
2176
9ca69354d4ca Add some support for <!-- in the prolog stuff
rgbecker
parents: 1988
diff changeset
   122
    prologue_elements = ("!DOCTYPE", "?xml", "!--")
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   123
    done = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   124
    while done is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   125
        #print "trying to skip:", repr(text[cursor:cursor+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   126
        openbracket = find(text, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   127
        if openbracket<0: break
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   128
        past = openbracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   129
        found = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   130
        for e in prologue_elements:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   131
            le = len(e)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   132
            if text[past:past+le]==e:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   133
                found = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   134
                cursor = find(text, ">", past)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   135
                if cursor<0:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   136
                    raise ValueError("can't close prologue %r" % e)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   137
                cursor = cursor+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   138
        if found is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   139
            done=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   140
    #print "done skipping"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   141
    return cursor
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   142
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   143
def parsexml0(xmltext, startingat=0, toplevel=1,
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   144
        # snarf in some globals
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   145
        strip=string.strip, split=string.split, find=string.find, entityReplacer=unEscapeContentList,
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   146
        #len=len, None=None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   147
        #LENCDATAMARKER=LENCDATAMARKER, CDATAMARKER=CDATAMARKER
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   148
        ):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   149
    """simple recursive descent xml parser...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   150
       return (dictionary, endcharacter)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   151
       special case: comment returns (None, endcharacter)"""
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   152
    #from string import strip, split, find
3326
ce725978d11c Initial Python3 compatibility fixes
damian
parents: 3138
diff changeset
   153
    #print "parsexml0", repr(xmltext[startingat: startingat+10])
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   154
    # DEFAULTS
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   155
    NameString = NONAME
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   156
    ContentList = AttDict = ExtraStuff = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   157
    if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   158
        #if verbose: print "at top level"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   159
        #if startingat!=0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   160
        #    raise ValueError, "have to start at 0 for top level!"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   161
        xmltext = strip(xmltext)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   162
    cursor = startingat
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   163
    #look for interesting starting points
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   164
    firstbracket = find(xmltext, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   165
    afterbracket2char = xmltext[firstbracket+1:firstbracket+3]
3326
ce725978d11c Initial Python3 compatibility fixes
damian
parents: 3138
diff changeset
   166
    #print "a", repr(afterbracket2char)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   167
    #firstampersand = find(xmltext, "&", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   168
    #if firstampersand>0 and firstampersand<firstbracket:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   169
    #    raise ValueError, "I don't handle ampersands yet!!!"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   170
    docontents = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   171
    if firstbracket<0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   172
            # no tags
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   173
            #if verbose: print "no tags"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   174
            if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   175
                #D = {NAMEKEY: NONAME, CONTENTSKEY: [xmltext[cursor:]]}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   176
                ContentList = [xmltext[cursor:]]
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   177
                if entityReplacer: ContentList = entityReplacer(ContentList)
1988
71d7483dac55 Attempt at fixing the silly return wrong tuple size problem
rgbecker
parents: 1984
diff changeset
   178
                return (NameString, AttDict, ContentList, ExtraStuff), len(xmltext)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   179
            else:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   180
                raise ValueError("no tags at non-toplevel %s" % repr(xmltext[cursor:cursor+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   181
    #D = {}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   182
    L = []
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   183
    # look for start tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   184
    # NEED to force always outer level is unnamed!!!
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   185
    #if toplevel and firstbracket>0:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   186
    #afterbracket2char = xmltext[firstbracket:firstbracket+2]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   187
    if toplevel is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   188
            #print "toplevel with no outer tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   189
            NameString = name = NONAME
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   190
            cursor = skip_prologue(xmltext, cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   191
            #break
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   192
    elif firstbracket<0:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   193
            raise ValueError("non top level entry should be at start tag: %s" % repr(xmltext[:10]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   194
    # special case: CDATA
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   195
    elif afterbracket2char=="![" and xmltext[firstbracket:firstbracket+9]=="<![CDATA[":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   196
            #print "in CDATA", cursor
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   197
            # skip straight to the close marker
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   198
            startcdata = firstbracket+9
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   199
            endcdata = find(xmltext, CDATAENDMARKER, startcdata)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   200
            if endcdata<0:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   201
                raise ValueError("unclosed CDATA %s" % repr(xmltext[cursor:cursor+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   202
            NameString = CDATAMARKER
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   203
            ContentList = [xmltext[startcdata: endcdata]]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   204
            cursor = endcdata+len(CDATAENDMARKER)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   205
            docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   206
    # special case COMMENT
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   207
    elif afterbracket2char=="!-" and xmltext[firstbracket:firstbracket+4]=="<!--":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   208
            #print "in COMMENT"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   209
            endcommentdashes = find(xmltext, "--", firstbracket+4)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   210
            if endcommentdashes<firstbracket:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   211
                raise ValueError("unterminated comment %s" % repr(xmltext[cursor:cursor+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   212
            endcomment = endcommentdashes+2
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   213
            if xmltext[endcomment]!=">":
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   214
                raise ValueError("invalid comment: contains double dashes %s" % repr(xmltext[cursor:cursor+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   215
            return (None, endcomment+1) # shortcut exit
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   216
    else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   217
            # get the rest of the tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   218
            #if verbose: print "parsing start tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   219
            # make sure the tag isn't in doublequote pairs
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   220
            closebracket = find(xmltext, ">", firstbracket)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   221
            noclose = closebracket<0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   222
            startsearch = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   223
            pastfirstbracket = firstbracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   224
            tagcontent = xmltext[pastfirstbracket:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   225
            # shortcut, no equal means nothing but name in the tag content
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   226
            if '=' not in tagcontent:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   227
                if tagcontent[-1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   228
                    # simple case
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   229
                    #print "simple case", tagcontent
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   230
                    tagcontent = tagcontent[:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   231
                    docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   232
                name = strip(tagcontent)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   233
                NameString = name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   234
                cursor = startsearch
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   235
            else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   236
                if '"' in tagcontent:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   237
                    # check double quotes
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   238
                    stop = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   239
                    # not inside double quotes! (the split should have odd length)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   240
                    if noclose or len(split(tagcontent+".", '"'))% 2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   241
                        stop=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   242
                    while stop is None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   243
                        closebracket = find(xmltext, ">", startsearch)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   244
                        startsearch = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   245
                        noclose = closebracket<0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   246
                        tagcontent = xmltext[pastfirstbracket:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   247
                        # not inside double quotes! (the split should have odd length)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   248
                        if noclose or len(split(tagcontent+".", '"'))% 2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   249
                            stop=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   250
                if noclose:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   251
                    raise ValueError("unclosed start tag %s" % repr(xmltext[firstbracket:firstbracket+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   252
                cursor = startsearch
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   253
                #cursor = closebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   254
                # handle simple tag /> syntax
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   255
                if xmltext[closebracket-1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   256
                    #if verbose: print "it's a simple tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   257
                    closebracket = closebracket-1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   258
                    tagcontent = tagcontent[:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   259
                    docontents = None
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   260
                #tagcontent = xmltext[firstbracket+1:closebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   261
                tagcontent = strip(tagcontent)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   262
                taglist = split(tagcontent, "=")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   263
                #if not taglist:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   264
                #    raise ValueError, "tag with no name %s" % repr(xmltext[firstbracket:firstbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   265
                taglist0 = taglist[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   266
                taglist0list = split(taglist0)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   267
                #if len(taglist0list)>2:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   268
                #    raise ValueError, "bad tag head %s" % repr(taglist0)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   269
                name = taglist0list[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   270
                #print "tag name is", name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   271
                NameString = name
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   272
                # now parse the attributes
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   273
                attributename = taglist0list[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   274
                # put a fake att name at end of last taglist entry for consistent parsing
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   275
                taglist[-1] = taglist[-1]+" f"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   276
                AttDict = D = {}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   277
                taglistindex = 1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   278
                lasttaglistindex = len(taglist)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   279
                #for attentry in taglist[1:]:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   280
                while taglistindex<lasttaglistindex:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   281
                    #print "looking for attribute named", attributename
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   282
                    attentry = taglist[taglistindex]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   283
                    taglistindex = taglistindex+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   284
                    attentry = strip(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   285
                    if attentry[0]!='"':
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   286
                        raise ValueError("attribute value must start with double quotes" + repr(attentry))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   287
                    while '"' not in attentry[1:]:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   288
                        # must have an = inside the attribute value...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   289
                        if taglistindex>lasttaglistindex:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   290
                            raise ValueError("unclosed value " + repr(attentry))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   291
                        nextattentry = taglist[taglistindex]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   292
                        taglistindex = taglistindex+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   293
                        attentry = "%s=%s" % (attentry, nextattentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   294
                    attentry = strip(attentry) # only needed for while loop...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   295
                    attlist = split(attentry)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   296
                    nextattname = attlist[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   297
                    attvalue = attentry[:-len(nextattname)]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   298
                    attvalue = strip(attvalue)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   299
                    try:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   300
                        first = attvalue[0]; last=attvalue[-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   301
                    except:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   302
                        raise ValueError("attvalue,attentry,attlist="+repr((attvalue, attentry,attlist)))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   303
                    if first==last=='"' or first==last=="'":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   304
                        attvalue = attvalue[1:-1]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   305
                    #print attributename, "=", attvalue
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   306
                    D[attributename] = attvalue
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   307
                    attributename = nextattname
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   308
    # pass over other tags and content looking for end tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   309
    if docontents is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   310
        #print "now looking for end tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   311
        ContentList = L
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   312
    while docontents is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   313
            nextopenbracket = find(xmltext, "<", cursor)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   314
            if nextopenbracket<cursor:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   315
                #if verbose: print "no next open bracket found"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   316
                if name==NONAME:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   317
                    #print "no more tags for noname", repr(xmltext[cursor:cursor+10])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   318
                    docontents=None # done
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   319
                    remainder = xmltext[cursor:]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   320
                    cursor = len(xmltext)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   321
                    if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   322
                        L.append(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   323
                else:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   324
                    raise ValueError("no close bracket for %s found after %s" % (name,repr(xmltext[cursor: cursor+20])))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   325
            # is it a close bracket?
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   326
            elif xmltext[nextopenbracket+1]=="/":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   327
                #print "found close bracket", repr(xmltext[nextopenbracket:nextopenbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   328
                nextclosebracket = find(xmltext, ">", nextopenbracket)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   329
                if nextclosebracket<nextopenbracket:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   330
                    raise ValueError("unclosed close tag %s" % repr(xmltext[nextopenbracket: nextopenbracket+20]))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   331
                closetagcontents = xmltext[nextopenbracket+2: nextclosebracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   332
                closetaglist = split(closetagcontents)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   333
                #if len(closetaglist)!=1:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   334
                    #print closetagcontents
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   335
                    #raise ValueError, "bad close tag format %s" % repr(xmltext[nextopenbracket: nextopenbracket+20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   336
                # name should match
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   337
                closename = closetaglist[0]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   338
                #if verbose: print "closetag name is", closename
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   339
                if name!=closename:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   340
                    prefix = xmltext[:cursor]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   341
                    endlinenum = len(split(prefix, "\n"))
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   342
                    prefix = xmltext[:startingat]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   343
                    linenum = len(split(prefix, "\n"))
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   344
                    raise ValueError("at lines %s...%s close tag name doesn't match %s...%s %s" %(
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   345
                       linenum, endlinenum, repr(name), repr(closename), repr(xmltext[cursor: cursor+100])))
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   346
                remainder = xmltext[cursor:nextopenbracket]
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   347
                if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   348
                    #if verbose: print "remainder", repr(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   349
                    L.append(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   350
                cursor = nextclosebracket+1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   351
                #print "for", name, "found close tag"
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   352
                docontents = None # done
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   353
            # otherwise we are looking at a new tag, recursively parse it...
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   354
            # first record any intervening content
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   355
            else:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   356
                remainder = xmltext[cursor:nextopenbracket]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   357
                if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   358
                    L.append(remainder)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   359
                #if verbose:
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   360
                #    #print "skipping", repr(remainder)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   361
                #    #print "--- recursively parsing starting at", xmltext[nextopenbracket:nextopenbracket+20]
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   362
                (parsetree, cursor) = parsexml0(xmltext, startingat=nextopenbracket, toplevel=None, entityReplacer=entityReplacer)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   363
                if parsetree:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   364
                    L.append(parsetree)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   365
        # maybe should check for trailing garbage?
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   366
        # toplevel:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   367
        #    remainder = strip(xmltext[cursor:])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   368
        #    if remainder:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   369
        #        raise ValueError, "trailing garbage at top level %s" % repr(remainder[:20])
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   370
    if ContentList:
1984
daa064c0eeb1 Allow for no entuty replacement
rgbecker
parents: 1771
diff changeset
   371
        if entityReplacer: ContentList = entityReplacer(ContentList)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   372
    t = (NameString, AttDict, ContentList, ExtraStuff)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   373
    return (t, cursor)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   374
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   375
import types
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   376
def pprettyprint(parsedxml):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   377
    """pretty printer mainly for testing"""
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   378
    st = bytes
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   379
    if type(parsedxml) is st:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   380
        return parsedxml
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   381
    (name, attdict, textlist, extra) = parsedxml
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   382
    if not attdict: attdict={}
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   383
    join = string.join
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   384
    attlist = []
3723
99aa837b6703 second stage of port to Python 3.3; working hello world
rptlab
parents: 3721
diff changeset
   385
    for k in attdict.keys():
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   386
        v = attdict[k]
3326
ce725978d11c Initial Python3 compatibility fixes
damian
parents: 3138
diff changeset
   387
        attlist.append("%s=%s" % (k, repr(v)))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   388
    attributes = join(attlist, " ")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   389
    if not name and attributes:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   390
        raise ValueError("name missing with attributes???")
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   391
    if textlist is not None:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   392
        # with content
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   393
        textlistpprint = list(map(pprettyprint, textlist))
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   394
        textpprint = join(textlistpprint, "\n")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   395
        if not name:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   396
            return textpprint # no outer tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   397
        # indent it
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   398
        nllist = string.split(textpprint, "\n")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   399
        textpprint = "   "+join(nllist, "\n   ")
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   400
        return "<%s %s>\n%s\n</%s>" % (name, attributes, textpprint, name)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   401
    # otherwise must be a simple tag
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   402
    return "<%s %s/>" % (name, attributes)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   403
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   404
dump = 0
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   405
def testparse(s):
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   406
    from time import time
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   407
    from pprint import pprint
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   408
    now = time()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   409
    D = parsexmlSimple(s)
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   410
    print("DONE", time()-now)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   411
    if dump&4:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   412
        pprint(D)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   413
    #pprint(D)
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   414
    if dump&1:
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   415
        print("============== reformatting")
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   416
        p = pprettyprint(D)
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   417
        print(p)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   418
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   419
def test():
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   420
    testparse("""<this type="xml">text &lt;&gt;<b>in</b> <funnytag foo="bar"/> xml</this>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   421
                 <!-- comment -->
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   422
                 <![CDATA[
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   423
                 <this type="xml">text <b>in</b> xml</this> ]]>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   424
                 <tag with="<brackets in values>">just testing brackets feature</tag>
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   425
                 """)
1771
105572a4222f Whitespace and tab character cleanup
andy_robinson
parents: 1724
diff changeset
   426
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   427
filenames = [ #"../../reportlab/demos/pythonpoint/pythonpoint.xml",
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   428
              "samples/hamlet.xml"]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   429
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   430
#filenames = ["moa.xml"]
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   431
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   432
dump=1
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   433
if __name__=="__main__":
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   434
    test()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   435
    from time import time
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   436
    now = time()
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   437
    for f in filenames:
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   438
        t = open(f).read()
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   439
        print("parsing", f)
1724
75b6a5a0b406 added to support new paragraph
aaron_watters
parents:
diff changeset
   440
        testparse(t)
3721
0c93dd8ff567 initial changes from 2to3-3.3
rptlab
parents: 3326
diff changeset
   441
    print("elapsed", time()-now)