Help on package rt: NAME rt FILE d:\ivan\workspace\infoget\src\rt\__init__.py DESCRIPTION rt (Reverse Template) is module that enables extracting data from HTML page. The principle how to get data from an existing page is similar to the way how (dynamic) pages are created in most of current environments. There is a template page, which contains HTML markup augumented with some special symbols/markup and it is filled from some data model, where special symbols are replaced by actual data. Result is a HTML page that can be shown a browser. Reverse Templates are used in opposite case - we already have a HTML page and we want to extract data that we inserted into page. Also in Reverse Template we use basic HTML markup and special symbols (in this case addtitional tag attributes with specific prefixes). Special symbols in this case are used to capture data from places indetified by surrounding markup in the template. Or to express it in other words Reverse Template is matched against a HTML page, trying to follow similar markup and from places, where template has additional specific symbol (attribute), data are extracted and made available. Before going into more details here is small sample how Reverse Templates work: page="""

Sample

Home ' """ template="""

Sample

Home ' """ ready_template=rt.read_template(template) databag=rt.read_page(page, ready_template) print databag And the result would be: {'header': 'Sample', 'link': 'http://localhost'} rt module has two main functions: read_template - which reads a Reverse Template from string or stream read_page - which reads a HTML page, matches it against given template and extracts data into databag object. Reverse templates are created from pages by attributing them with special attributes with two distinguish prefixes: rt - marks attributes that initiates some data gathering action rtf - means atrributes used to create a filter that modifies captured data Both data gathering actions and filters are easily extensible and others can be defined in modules Actions.py and Filters.py respectively. After template is created it can be loaded by read_template function, that parses template into memory structure, where there is a document tree with actions and filtters attached to particular nodes/tags. This can be used later by read_page function, which tries to match page to template tag by tag and if there are any actions attached to tag in template execute them with data from the page. Tags are matched by name and value of these three attributes (if they are present in template) - id, name, class. Values of attributes in template can contain a wildcart char *, which will be matched to any substring when comparing to value of attribute at the page. Special action attributes: rt:conditional (no value) - marks that containing tag is conditional and can be skip if not present in page rt:match_attribs (no value) - in order for tag containing this attribute to match, all attributes and their values AFTER this attribute must match on the page. Again wildcart * can be used. Regular action atributes: rt:dummy (no value) - just prints out track messages on stdout rt:loop (value is name/key in databag) - indicates a loop, app will try reapeatedly parse this elements and all its children. In databag a list accessible via name indicated in attribute value is created. Each item of this list is a dictionary containing values from particular repetition. rt:loop_start (value is name/key in databag) - indicates loop that spans over several sibling elements until rt:loop_end is met. Behaviour is otherwise similar to rt:loop rt:loop_end (no value) - indicates end of loop that spans over several sibling elements rt:read_text (value is name/key in databag) - reads all text contained in that elements and saves it in databag under key indicated as value of this attribute. rt:read_text_end (name of action) - if any parent element of this element has started read_text action of the name indicated as value of this attribute, this action will end reading now, so the rest of the text with parent elemnts will not be included into read_text value rt:read_text_start (value is name/key in databag) - will start read_text action on the parent element. Can be used to skip some text/markup at the beging of element content rt:read_attrib (value is name/key in databag) - reads the value of attribute imediatelly following this attribute and saves it in datbag under key indicated as value of this attribute Values of actions can be modified by so called filters, which are expressed as attributes starting with rtf: prefix that might follow action attribute. Following filter attributes are available: rtf:lower (no_value) - converts string to lower case rtf:upper (no_value) - converts string to upper case rtf:strip (no_value) - strips whitespaces from beginning and end of captured string rtf:regex (value is regular expesion) - tries to find regular expression provided as value of this attributed and within action value and returns matching value for that regex rtf:group (value is number or name of group within regex) - if follows rtf:regex filter, that this will return value of indicated group rather then full match As noted previously function read_page returns a databag object, which basicaly python dictionary that can contain lists or other dictionaries. Databag contains values captured from the html page available under keys indicated in actions definitions. Few samples how to access values in databag: If rt:read_text="name1" is defined defined outside of loop, then captured value can be accessed in databag like db['name1']. If rt:read_text="name1" is defined inside loop rt:loop="myloop" then: * value on text in first repetion of loop is db['myloop'][0]['name1'] * number of repetition of loop is len(db['myloop']) * iteration over all repetitions of the loop: for item in db['myloop']: print item['name1'] PACKAGE CONTENTS Actions DataRepr Entity Filters Parsers test (package) FUNCTIONS read_page(source, template, encoding=None) read_template(source, encoding=None) set_trace(trace=True) DATA __all__ = ['read_template', 'read_page', 'set_trace'] __version__ = '0.1.1' VERSION 0.1.1