Help on package rt:
NAME
rt
FILE
d:\ivan\workspace\infoget\src\rt\__init__.py
DESCRIPTION
rt (Reverse Template) is module that enables extracting data from HTML page.
The principle how to get data from an existing page is similar to the
way how (dynamic) pages are created in most of current environments. There is
a template page, which contains HTML markup augumented with some special
symbols/markup and it is filled from some data model, where special symbols
are replaced by actual data. Result is a HTML page that can be shown a browser.
Reverse Templates are used in opposite case - we already have a HTML page and
we want to extract data that we inserted into page.
Also in Reverse Template we use basic HTML markup and special symbols (in
this case addtitional tag attributes with specific prefixes). Special symbols
in this case are used to capture data from places indetified by surrounding
markup in the template. Or to express it in other words Reverse Template is
matched against a HTML page, trying to follow similar markup and from places,
where template has additional specific symbol (attribute), data are extracted
and made available.
Before going into more details here is small sample how Reverse Templates work:
page="""
Sample
Home
'
"""
template="""
Sample
Home
'
"""
ready_template=rt.read_template(template)
databag=rt.read_page(page, ready_template)
print databag
And the result would be:
{'header': 'Sample', 'link': 'http://localhost'}
rt module has two main functions:
read_template - which reads a Reverse Template from string or stream
read_page - which reads a HTML page, matches it against given template
and extracts data into databag object.
Reverse templates are created from pages by attributing them with
special attributes
with two distinguish prefixes:
rt - marks attributes that initiates some data gathering action
rtf - means atrributes used to create a filter that modifies captured
data
Both data gathering actions and filters are easily extensible and others can
be defined in modules Actions.py and Filters.py respectively.
After template is created it can be loaded by read_template function, that
parses template into memory structure, where there is a document tree
with actions and filtters attached to particular nodes/tags. This can be used
later by read_page function, which tries to match page to template tag by tag
and if there are any actions attached to tag in template execute them with
data from the page. Tags are matched by name and value of these three
attributes (if they are present in template) - id, name, class. Values
of attributes in template can contain a wildcart char *, which will be matched
to any substring when comparing to value of attribute at the page.
Special action attributes:
rt:conditional (no value) - marks that containing tag is conditional and
can be skip if not present in page
rt:match_attribs (no value) - in order for tag containing this attribute
to match, all attributes and their values AFTER this
attribute must match on the page. Again wildcart * can
be used.
Regular action atributes:
rt:dummy (no value) - just prints out track messages on stdout
rt:loop (value is name/key in databag) - indicates a loop, app will
try reapeatedly parse this elements and all its children.
In databag a list accessible via name indicated in attribute
value is created. Each item of this list is a dictionary
containing values from particular repetition.
rt:loop_start (value is name/key in databag) - indicates loop that spans
over several sibling elements until rt:loop_end is met.
Behaviour is otherwise similar to rt:loop
rt:loop_end (no value) - indicates end of loop that spans over several
sibling elements
rt:read_text (value is name/key in databag) - reads all text contained
in that elements and saves it in databag under key indicated
as value of this attribute.
rt:read_text_end (name of action) - if any parent element of this element has
started read_text action of the name indicated as value of
this attribute, this action will end reading now, so the rest
of the text with parent elemnts will not be included into
read_text value
rt:read_text_start (value is name/key in databag) - will start read_text action
on the parent element. Can be used to skip some text/markup at
the beging of element content
rt:read_attrib (value is name/key in databag) - reads the value of attribute
imediatelly following this attribute and saves it in datbag
under key indicated as value of this attribute
Values of actions can be modified by so called filters, which are expressed as
attributes starting with rtf: prefix that might follow action attribute.
Following filter attributes are available:
rtf:lower (no_value) - converts string to lower case
rtf:upper (no_value) - converts string to upper case
rtf:strip (no_value) - strips whitespaces from beginning and end of
captured string
rtf:regex (value is regular expesion) - tries to find regular
expression provided as value of this attributed and within
action value and returns matching value for that regex
rtf:group (value is number or name of group within regex) - if follows
rtf:regex filter, that this will return value of indicated
group rather then full match
As noted previously function read_page returns a databag object, which basicaly
python dictionary that can contain lists or other dictionaries. Databag contains
values captured from the html page available under keys indicated in actions
definitions. Few samples how to access values in databag:
If rt:read_text="name1" is defined defined outside of loop, then captured value
can be accessed in databag like db['name1'].
If rt:read_text="name1" is defined inside loop rt:loop="myloop" then:
* value on text in first repetion of loop is db['myloop'][0]['name1']
* number of repetition of loop is len(db['myloop'])
* iteration over all repetitions of the loop:
for item in db['myloop']:
print item['name1']
PACKAGE CONTENTS
Actions
DataRepr
Entity
Filters
Parsers
test (package)
FUNCTIONS
read_page(source, template, encoding=None)
read_template(source, encoding=None)
set_trace(trace=True)
DATA
__all__ = ['read_template', 'read_page', 'set_trace']
__version__ = '0.1.1'
VERSION
0.1.1