Sep 4, 2009

xml and python

最简单的 xml 规则:

<?xml version="1.0" encoding='UTF-8'?>
<painting>
  <img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
  <caption>This is Raphael's "Foligno" Madonna, painted in
  <date>1511</date>-<date>1512</date>.</caption>
</painting>

Tag

A markup construct that begins with "<" and ends with ">". Tags come in three flavors: start-tags, for example <section>, end-tags, for example </section>, and empty-element tags, for example <line-break/>.

Element

A logical component of a document which either begins with a start-tag and ends with an end-tag, or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example of an element is <Greeting>Hello, world.</Greeting>. Another is <line-break/>.

Attribute

A markup construct consisting of a name/value pair that exists within a start-tag or empty-element tag. In this example, the name of the attribute is "number" and the value is "3": <step number="3">Connect A to B.</step> This element has two attributes, src and alt: <img src="madonna.jpg" alt='by Raphael'/> An element must not have two attributes with the same name.

XML in python
分为两种模式,event-based SAX and object-based DOM.
可参考 python in a nutshell
DOM:
The xml.dom.minidom Module
下面讲 DOM module
最主要的是 node class
document, element, attribute, text content 等都是 node
<excerpt>
      <!-- Framespan 1:5030 -->
      <filename>
        MCTTR0902h.mov.deint.mpeg
      </filename>
      <begin>0.0</begin>
      <duration>
        201.20
      </duration>
      <sample_rate>
        25
      </sample_rate>
      <language>
        english
      </language>
      <source_type>
        surveillance
      </source_type>
    </excerpt>
注意,
Node.nodeType
An integer representing the node type. Symbolic constants for the types are on the Node object: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, NOTATION_NODE
text content 也是 node,它的 nodeValue 是我们要的 string,另外像如上的 element,断行号都算在 text content 里面
代码

import sys
import os
import subprocess
from xml.dom import minidom

xmldoc = minidom.parse('E:/eclipse/xml/expt_2009_retroED_EVAL09_ENG_s-camera_NIST_2.xml')

reflist = xmldoc.getElementsByTagName('excerpt')

of = open('video.txt','w')
              
            

for ref in reflist:
    filelist  = ref.getElementsByTagName('filename')
    for file in filelist: 
        #print file.firstChild.nodeValue
        of.write(file.firstChild.nodeValue.strip())
        #print file.firstChild.nodeName
        #print file.firstChild.nodeType
    of.write('\t');
    filelist  = ref.getElementsByTagName('begin')
    for file in filelist: 
        #print file.firstChild.nodeValue
        of.write(file.firstChild.nodeValue.strip())
        #print file.firstChild.nodeName
        #print file.firstChild.nodeType
    of.write('\t');
    
    filelist  = ref.getElementsByTagName('duration')
    for file in filelist: 
        #print file.firstChild.nodeValue
        #print file.firstChild.nodeName
        #print file.firstChild.nodeType
        of.write(file.firstChild.nodeValue.strip())
        of.write('\t');
        framespan = int(float(file.firstChild.nodeValue.strip())*25);
        of.write(str(framespan))
    of.write('\n');


of.close() 
经验:
  • 要对需处理的 xml 有足够的了解
  • 使用 getElementsByTagName,如果确认只有一个 child,则可以用 [0]
  • 如果不是,则处理一个 list
  • 使用 nodeType 进行判断

0 comments: