!

Dette materialet blir ikke lenger vedlikeholdt. Du vil finne oppdatert materiale på siden: http://borres.hiof.no/wep/

DOM
Børre Stenseth
Python >DOM i Python

DOM i Python

Hva
DOM-programmering i Python, Minidom

Python har et apparat for å håndtere DOM. Det vil si at Python har definert et programmeringsgrensesnitt som lar oss etablere og operere på DOM-trær. Python har ikke implementert full DOM, men har implementert det de kaller minidom. For å få tilgang til full DOM-funksjonalitet i henhold til DOM spesifikasjonen fra W3C. må vi importere en slik implementasjon, f.eks. lxml [1] .

Minidom i Python imlementerer en del av W3C's definisjon av programmeringsgrensesnittet mot DOM og løser de fleste av de praktiske oppgavene vi skal løse, selv om det noen ganger blir litt omstendelig.

Eksempel: Olympiade-data

Vi tar for oss resultatfila fra olympiade-eksempelet, se modulene: Olympiade og Noen datasett . De aktuelle resultatene er ordnet i en XML-fil: all_results.xml

Vi skal gjøre to øvelser på denne fila

  1. Produsere en HTML-fil. Dette er i prinsipp samme transformasjon som den som gjøres ved XSLT i modulen: XML2HTML
  2. Søke etter en bestemt deltager i alle øvelser i begge olymiader.

Øvelse 1

Vi tar utgangspunkt i følgende Python program:

import xml.dom.minidom
"""
 Simple demo of dom.
 produce rudimetary html from xml-file with IOC-results
 B. Stenseth  2009
 Use:
 DoIt(infile,outfile)
 See default files below
"""
#-----------------------
# file io
def getTextFile(filename):
    try:
        file=open(filename,'r')
        intext=file.read()
        file.close()
        return intext
    except:
        print 'Error reading file ',filename
        return None
def storeTextFile(filename,txt):
    try:
        outfile=open(filename,'w')
        outfile.write(txt)
        outfile.close()
    except:
        print 'Error writing file ',filename
eol='\n'
#---------------------------
# collect all text in a node
def getText(nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            t=node.data.encode('ISO-8859-1')
            rc += t
    return rc
HTMLFile="""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <META http-equiv="Content-Type" content="text/html; 
        charset=iso-8859-1\">
  <title>Olympiade</title>
</head>
<body>
%s
</body>
</html>
"""
 
def handleIOC(doc):
    S=''
    games=doc.getElementsByTagName("OlympicGame")
    for game in games:
        S+=handleGame(game)
        S+=eol
    return S
def handleGame(game):
    S= '<h2>%s</h2>\n' %game.getAttribute('place').encode('ISO-8859-1') 
    events=game.getElementsByTagName("event")
    for event in events:
        S+=handleEvent(event)
        S+=eol
    return S
    
def handleEvent(event):
    S= '<h3>%s</h3>\n' %event.getAttribute('dist').encode('ISO-8859-1')  
    participants=event.getElementsByTagName("athlet")
    for athlet in participants:
        S+=handleAthlet(athlet)
        S+=eol
    return S
def handleAthlet(athlet):
    name=athlet.getElementsByTagName("name")[0]
    S= "<p>Name:%s<br/>" %getText(name.childNodes)
    result=athlet.getElementsByTagName("result")[0]
    S+= "Result:%s</p>" %getText(result.childNodes)
    return S
# default file for demopurposes, change it
def doit(infile,outfile):
    document=getTextFile(infile)
    if(document!=None):
        dom = xml.dom.minidom.parseString(document)    
        T=handleIOC(dom)
        storeTextFile(outfile,HTMLFile%T)
    else:
        print "sorry, something went wrong"
        

    # clean up
    dom.unlink()
# basic testing 
if __name__=="__main__":
    doit('c:\\web\\dw\\pydom\\all_results.xml',
         'c:\\web\\dw\\pydom\\py_results1.html')

Programmet foretar en enkel transformasjon av en xml-struktur til en rudimentær html-string. Sammenlign denne koden med en tilsvarende XSLT-transformasjon som er beskrevet i Olympiade-eksempelet:

_olymptohtml.xslt
Resultatet http://www.it.hiof.no/~borres/dw/pydom/py_results1.html

Øvelse 2

Vi skriver et program som tar for seg våre olympiske data og forsøker å besvare spørsmålet: "I hvilke øvelser har nn deltatt i de aktuelle olympiadene". Dette innebærer at vi må gå ned og opp i treet. Først må vi lokalisere alle forekomstene av den aktuelle løperen, for deretter å gå opp i treet for å finne øvelse og olympiade.

import xml.dom.minidom
"""
 Simple demo of dom.
 find: report which events an athlet has participated in
 B. Stenseth  2009
 Use:
 Find(athlet,file)
 See default parametes below
"""
#-------------------------------------------------------------
# file io
def getTextFile(filename):
    try:
        file=open(filename,'r')
        intext=file.read()
        file.close()
        return intext
    except:
        print 'Error reading file ',filename
        return None
# collect all text in a node
def getText(nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            t=node.data.encode('ISO-8859-1')
            rc += t
    return rc
def searchIOC(doc,theName):
    athletnamelist=doc.getElementsByTagName("name")
    for athletname in athletnamelist:
        txtname=getText(athletname.childNodes)
        if txtname==theName:
            event=athletname.parentNode.parentNode
            game=event.parentNode
            print game.getAttribute('place').encode('ISO-8859-1')
            print ' - '+event.getAttribute('dist').encode('ISO-8859-1')
# default parameters for demopurposes
def find(runner,afile):
    document=getTextFile(afile)
    if(document!=None):
        dom = xml.dom.minidom.parseString(document)
        searchIOC(dom,runner)
    else:
        print "something went wrong"
# basic testing 
if __name__=="__main__":
    find('Frank Fredericks','c:\\web\\dw\\pydom\\all_results.xml')

Eksempel: Bok-data

Datagrunnlaget er en tekstfil med bokbeskrivelser, en bok på hver linje. Bokdataene er beskrevet i modulen Noen datasett . Bokliste som tekst bokliste.xml. Tomme linjer og linjer som begynner med // skal ignoreres.

Vi skal gjøre to øvelser på disse dataene:

  1. Bygge en XML-fil fra textfila (csv-fila)
  2. Endre strukturen på den fila vi bygger i øvelse 1.

Øvelse 1

Vi lager et Pythonprogram som tar for seg en tekstfil med bokbeskrivelser og lager en XML-fil.

import StringIO,xml.dom.minidom,codecs
"""
 Demo of MINIDOM.
 Building a DOM-tree based on a text-file, writing result as XML
 Building each node and inserting it into the tree
 Data is described on
 http://www.ia.hiof.no/~borres/ml/pydom/p-pydom.html
 Usage: doit(textfilename,xmlfilename)
 B. Stenseth  2009
"""
#-----------------------
# file io
def getTextFile(filename):
    try:
        file=open(filename,'r')
        intext=file.read()
        file.close()
        return intext
    except:
        print 'Error reading file ',filename
        return None
def storeTextFile(filename,txt):
    try:
        outfile=open(filename,'w')
        outfile.write(txt)
        outfile.close()
    except:
        print 'Error writing file ',filename    
#------------------------
# the job
def doit(infile,outfile):
    txt=getTextFile(infile)
    if(txt==None):
        return
    # prepare this string for unicode in a domtree
    txt=txt.decode('ISO-8859-1')
    lines=txt.split('\n')
    # set up basic document
    doc=xml.dom.minidom.Document()
    root_elt=doc.createElement('booklist')
    doc.appendChild(root_elt)
    # walk the linelist
    linecount=0
    for line in lines:
        line=line.strip()
        # skip the blanks and the comments
        if len(line) <3:
            continue
        if line[0:2]=="//":
            continue
        # we will use it
        # title,author,publisher,year,isbn,pages,course,category,comment
        pieces=line.split(',');
        if len(pieces)!=9:
            # bad line
            print "ignore: " + line
            continue
        
        # make book
        book_elt_node=doc.createElement('book')
        book_elt_node.setAttribute('isbn',pieces[4])
        book_elt_node.setAttribute('pages',pieces[5])
        root_elt.appendChild(book_elt_node)
        new_elt_node=doc.createElement('title')
        new_elt_node.appendChild(doc.createTextNode(pieces[0]))
        book_elt_node.appendChild(new_elt_node)
        new_elt_node=doc.createElement('course')
        new_elt_node.appendChild(doc.createTextNode(pieces[6]))
        book_elt_node.appendChild(new_elt_node)
        new_elt_node=doc.createElement('category')
        new_elt_node.appendChild(doc.createTextNode(pieces[7]))
        book_elt_node.appendChild(new_elt_node)
    
        new_elt_node=doc.createElement('author')
        new_elt_node.appendChild(doc.createTextNode(pieces[1]))
        book_elt_node.appendChild(new_elt_node)
        new_elt_node=doc.createElement('publisher')
        new_elt_node.appendChild(doc.createTextNode(pieces[2]))
        book_elt_node.appendChild(new_elt_node)
        new_elt_node=doc.createElement('year')
        new_elt_node.appendChild(doc.createTextNode(pieces[3]))
        book_elt_node.appendChild(new_elt_node)
        new_elt_node=doc.createElement('comment')
        new_elt_node.appendChild(doc.createTextNode(pieces[8]))
        book_elt_node.appendChild(new_elt_node)

    # raw print while testing
    # print doc.toxml().encode('ISO-8859-1')
    # get it on file
    # need the domtree, doc, as a ISO-8859-1 encoded string
    s=StringIO.StringIO()
    doc.writexml(codecs.getwriter('ISO-8859-1')(s))
    # some dirty formatting, take care
    s=s.getvalue().replace('>','>\n')
    s=s.replace('<book','\n\n<book')
    # fix prolog
    prolog="""<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE booklist SYSTEM "bokdok.dtd">"""
    s=s.replace('<?xml version="1.0" ?>',prolog)
    # while testing
    #print s
    
    storeTextFile(outfile,s)
    doc.unlink()
# basic testing 
if __name__=="__main__":
    doit('c:\\web\\dw\\pydom\\bokliste.txt',
         'c:\\web\\dw\\pydom\\bokliste2.xml')

Dette gjøres ved å bygge opp et DOM-tre og ved å sette inn noder som genereres fra teksten. Denne Pythonkoden gjør i prinsipp det samme som koden som er beskrevet i modulen: HTML og XML . Der beskrives et preogram som gjør det samme som ren tekstbehandling, uten bruk av DOM,

Resultatet http://www.it.hiof.no/~borres/dw/pydom/bokliste2.xml

Øvelse 2

Vi lager et program som tar for seg en XML-fil som bygget i øvelse 1 og endrer strukturen på denne, et element gjøres om til attributt og en attributt gjøres om til element.

import StringIO,codecs,xml.dom.minidom
"""
 Demo of MINIDOM.
 Changing the structure of a XML-file
 Data is described on
 http://www.ia.hiof.no/~borres/ml/python/p-python.html
 change it to make:
 all titles an attribute in stead of an element
 all pages an element in stead of an attribute
 B. Stenseth  2002
 Use: doit(infile,outfile)
"""

#-----------------------
# file io
def getTextFile(filename):
    try:
        file=open(filename,'r')
        intext=file.read()
        file.close()
        return intext
    except:
        print 'Error reading file ',filename
        return None
def storeTextFile(filename,txt):
    try:
        outfile=open(filename,'w')
        outfile.write(txt)
        outfile.close()
    except:
        print 'Error writing file ',filename    
# collect all text in a node
def getText(nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            t=node.data.encode('ISO-8859-1')
            rc += t
    return rc
def getStrippedText(nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            t=node.data
            t=t.strip()
            t=node.data.encode('ISO-8859-1')
            if t!='\n':
                rc += t.strip()
    return rc
    
def doit(infile,outfile):
    txt=getTextFile(infile)
    if(txt==None):
        return
    # prepare this string for unicode in a domtree
    # txt=txt.decode('ISO-8859-1')
    doc = xml.dom.minidom.parseString(txt)
    books=doc.getElementsByTagName('book')
    for book in books:
        # pick up the title-element
        title_elt=book.getElementsByTagName('title')[0]
        title_str=getStrippedText(title_elt.childNodes)
        # make the title an attribute
        book.setAttribute('title',title_str.decode('ISO-8859-1'))
        # remove the title element
        book.removeChild(title_elt)
        # pick up the pages-attribute
        page_str=book.getAttribute('pages')
        # make the element
        page_elt=doc.createElement('pages')
        # make the text child node
    page_elt.appendChild(doc.createTextNode(page_str))
    book.appendChild(page_elt)
    # remove pages-attribute
    book.removeAttribute('pages')
    # get it on file
    # need the domtree, doc, as a ISO-8859-1 encoded string
    s=StringIO.StringIO()
    doc.writexml(codecs.getwriter('ISO-8859-1')(s))
    s=s.getvalue()
    # fix prolog
    prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>'
    s=s.replace('<?xml version="1.0" ?>',prolog)
    s=s.replace('bokdok.dtd','bokdok2.dtd')
    # while testing
    # print s
    
    storeTextFile(outfile,s)
    doc.unlink()
# basic testing 
if __name__=="__main__":
    doit('c:\\web\\dw\\pydom\\bokliste2.xml',
         'c:\\web\\dw\\pydom\\bokliste3.xml')
Resultatet http://www.it.hiof.no/~borres/dw/pydom/bokliste3.xml

Eksempel: Skøyte-data

Tema er skøyteløp med egne tekstfiler som angir resultater fra 500m, 1500m, 5000m og 10000m. Disse tekstfilene er svært enkle og inneholder ett navn og ett resultat på hver linje. Filene heter henholdsvis s500.txt, s1500.txt, s5000.txt, s10000.txt. Vi skriver et program som gjør følgende:

  1. Leser de fire filene og etablerer et DOM-tre for hver av dem
  2. Slår sammen de fire trærne til ett
  3. Beregner samlet poengsum for hver løper
  4. Sorterer alle løpernodene etter beregnet resultat
  5. Lager en HTML-fil der løperne vises sortert på resultat

Dette er neppe noe optimal måte å løse problemet på, men kan tjene som en DOM-øvelse. De 5 stegene er markert i Pythonkoden.

import StringIO,xml.dom.minidom,codecs
"""
 Demo of MINIDOM.
 NOTE that this may not be the smartest or fastest way to
 solve this problem. It is written to demonstrate minidom
 function makeCompleteXML(catalog)
 Building a XML-file based on three text-files:
 Results from 500, 1500, 5000, 10000 m speedskating
 each with lines of the form(not sorted):
 name,result
 Filenames are s500.txt, s1500.txt, s5000.txt, s10000.txt
 Returns a tree with following structure:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<skatingevent>
<skater name="olsen">
   <res500>40.00</res500>
   <res1500>1.50.00</res1500>
   <res5000>6.40.00</res5000>
   <res10000>13.40.00</res10000>
   <points>87559</points>
</skater>
 ...
</skatingevent>
 Function doit(catalog)
 calls storeXMLFile and produce a sorted html-file: 
  skaters.html
 Job is done in 4 commented steps:
   Read the 4 txtfiles and establish a DOM-tree for each 
   Joins the 4 trees to one tree 
   Calculates aggregated points for each skater 
   Sort skaters on points 
   Make an HTML-file of sorted skaters 
 Usage: doit(catalog)
 B. Stenseth  2009
"""
def getText(nodelist):
    # collect all text in a node
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            t=node.data.encode('ISO-8859-1')
            rc += t
    return rc

def makeTree(catalog,distanse):
    # read a file from the catalog and establish tree
    try:
        filename=catalog+'\\s'+distanse+'.txt'
        # sample: c:\myskatingfiles\s500.txt
        file=open(filename,'r')
        intxt=file.read()
        file.close()
        intxt=intxt.decode('ISO-8859-1')
        doc=xml.dom.minidom.Document()
        root_elt=doc.createElement('skatingevent')
        doc.appendChild(root_elt)
        lines=intxt.split('\n')
        for line in lines:
            pieces=line.split(',')
            if len(pieces)==2:
                skater_elt=doc.createElement('skater')
                skater_elt.setAttribute('name',pieces[0])
                result_elt=doc.createElement('res'+distanse)
                result_elt.appendChild(doc.createTextNode(pieces[1]))
                skater_elt.appendChild(result_elt)
                root_elt.appendChild(skater_elt)
        return doc
    except:
        print 'Error building: '+distanse
        return ''
        
def storeXMLFile(filename,doc):
    # storing an xmlfile from a tree  
    s=StringIO.StringIO()
    doc.writexml(codecs.getwriter('ISO-8859-1')(s))
    t=s.getvalue()
    # some dirty formatting, take care
    t=t.replace('<skater','\n<skater')
    t=t.replace('<res','\n<res')
    # fix prolog
    prolog='<?xml version="1.0" encoding="ISO-8859-1" ?>'
    t=t.replace('<?xml version="1.0" ?>',prolog)
    # print while storing if you want to test
    # print t
    try:
        outf=open(filename,'w')
        outf.write(t)
        outf.close()
    except:
        print 'Error in writing tree at:'+ filename
    
def makeCompleteXML(catalog='c:\\articles\\ml\\dom'):
    # produce the complete tree with results from all distances
    # strategy is to make a tree for each distance and then join them
    # make a tree for each distance
    
    #--------------------------------------
    # STEP 1 make 4 DOM-trees
    t500=makeTree(catalog,'500')
    t1500=makeTree(catalog,'1500')
    t5000=makeTree(catalog,'5000')
    t10000=makeTree(catalog,'10000')
    #--------------------------------------
    # store them and print them while testing
    #storeXMLFile(catalog+'\\xml500.xml',t500)
    #storeXMLFile(catalog+'\\xml1500.xml',t1500)
    #storeXMLFile(catalog+'\\xml5000.xml',t5000)
    #storeXMLFile(catalog+'\\xml10000.xml',t10000)
    #--------------------------------------
    # STEP 2 join trees to one tree
    # use t500 as master and assemble results from the three others
    d500=t500.getElementsByTagName('skater')
    d1500=t1500.getElementsByTagName('skater')
    d5000=t5000.getElementsByTagName('skater')
    d10000=t10000.getElementsByTagName('skater')
    for p500 in d500:
        name500=p500.getAttribute('name')
        for p1500 in d1500:
            if name500==p1500.getAttribute('name'):
                p500.appendChild(p1500.getElementsByTagName('res1500')[0] )
                break
        for p5000 in d5000:
            if name500==p5000.getAttribute('name'):
                p500.appendChild(p5000.getElementsByTagName('res5000')[0] )
                break
        for p10000 in d10000:
            if name500==p10000.getAttribute('name'):
                p500.appendChild(p10000.getElementsByTagName('res10000')[0] )
                break
    #--------------------------------------
            
    # now we have all results in t500
    # write it if you want
    # storeXMLFile(catalog+'\\sall.xml',t500)
    
    #--------------------------------------
    # STEP 3 calculated aggregated points for each skater
    # we want to calculate points for each skater and add
    # an element points to each skater
    skaters=t500.getElementsByTagName('skater')
    for skater in skaters:
        # calculate timepoints in 1/100 seconds
        s=getText(skater.getElementsByTagName('res500')[0].childNodes)
        hsecs=makeSeconds(s)
        s=getText(skater.getElementsByTagName('res1500')[0].childNodes)
        hsecs+=makeSeconds(s)/3.0
        s=getText(skater.getElementsByTagName('res5000')[0].childNodes)
        hsecs+=makeSeconds(s)/10.0
        s=getText(skater.getElementsByTagName('res10000')[0].childNodes)
        hsecs+=makeSeconds(s)/20.0
        points='%.3f' %(hsecs/100.0)
                
        point_elt=t500.createElement('points')
        skater.appendChild(point_elt)
        point_elt.appendChild(t500.createTextNode(points))
    #--------------------------------------
    # and you may save it again while testing
    # storeXMLFile(catalog+'\\sallpoints.xml',t500)
    
    # clean up
    t1500.unlink()
    t5000.unlink()
    t10000.unlink()
    return t500
def makeSeconds(s):
    # calculate 1/100 seconds from s
    # s in form mm.ss.hh ( minutes, seconds, 1/100 seconds)
    # print s
    parts=s.split('.')
    hsecs=0
    if len(parts)==3:
        hsecs=6000*int(parts[0])+100*int(parts[1])+int(parts[2])
    elif len(parts)==2:
        hsecs=100*int(parts[0])+int(parts[1])
    else:
        print 'error in timeformat: ' + s
        hsecs=9999999
    return hsecs

def compareSkaters(s1,s2):
    # used while sorting
    s1pnt=s1.getElementsByTagName('points')[0]
    s2pnt=s2.getElementsByTagName('points')[0]
    v1=int(float(getText(s1pnt.childNodes)))
    v2=int(float(getText(s2pnt.childNodes)))
    return v1 - v2
    
def doit(catalog):
    # make the complete job from 4 text-files to html-file   
    # first we build the complete xml-tree
    # including points and all results and calculated points
    #--------------------------------------
    # STEPS 1,2,3 as commented on top of script
    doc=makeCompleteXML(catalog)
    #--------------------------------------
    
    # and we may save it just to test
    storeXMLFile(catalog+'\\sallpoints.xml',doc)
    #--------------------------------------
    # STEP 4 sort skaters according to calculated points   
    # we want to sort on points
    skaters=doc.getElementsByTagName('skater')
    skaters.sort(compareSkaters)
    #--------------------------------------
    #--------------------------------------
    # STEP 5 produce a HTML-page
    # now we want to produce some html-output with results
    T="""<html>
    <head> <title>resultater</title>
    <body>
    <h1 style="font-size:14px">Resultater</h1>
    """
    # run through sorted list of skaters
    T+='<table cellpadding="2">\n'
    for skater in skaters:
        T+='<tr><td style="font-size:12px">'
        T+=skater.getAttribute('name').encode('ISO-8859-1')
        T+='</td><td style="font-size:12px">'
        T+=getText(skater.getElementsByTagName('points')[0].childNodes)
        T+='</td></tr>\n'
        
    T+='</tr>\n</table>\n</body>\n</html>\n'
    filename=catalog+'\\skaters.html'
    outf=open(filename,'w')
    outf.write(T)
    outf.close()
    #--------------------------------------
    doc.unlink()
# basic testing 
if __name__=="__main__":
    doit('c:\\web\\dw\\pydom')
Resultatet http://www.it.hiof.no/~borres/dw/pydom/skaters.html
Referanser
  1. lxml - XML and HTML with Python lxml.de/ 03-08-2011
  1. Python og XML sourceforge.net pyxml.sourceforge.net/ 14-03-2010
  1. 4Suite XML i Python 4Suite.org 4suite.org/index.xhtml 14-03-2010

Python kode og rådata er sitert i teksten, unntatt skøyteresultatene: skoytefiler.zip

Vedlikehold
Børre Stenseth, nov 2002. Revidert pythonkode feb 2003, utvidet mars 2003
( Velkommen ) Python >DOM i Python ( SAX i Python )