Reverse Template

Reverse Template is a handy Python module that enables to extract data from HTML pages. Several times in a past I needed to process some HTML pages and get some data from them into program. I've used various techniques like regular expressiosn, HTML parsers etc., which aways required some specific programming to extract data. Recently I decided to create a small framework, that will enable to do this quickly, effectively and easily with minimal programming.

An approach that looked as valuable to me was to use a template that will indentify, which data need to be read from the page. Template should be close to page HTML code , so it should be easy to create it.

So steps would be:

  • Load page HTML source
  • Edit this text and identify there, what data should be captured by adding some simple extenssions to HTML code (decided to use additional attributes)
  • Load this template to Python program
  • And use it to grab data from live page(s) - gather data are available in "easy to work with" Python object - an extention to regular dictionary object.

So this approach is somehow opposite to the way how templates work for dynamic pages, so that why it is called Reverse Template.

Documentation

Here is available extract from Python docstring.

 

Sample Reverse Templates

RT for Google

RT for IMDb

Source code

Again available under GPL license.

Current version is 0.1.1 (beta quality)

Source code is available here.

Last site update on 30/08/2013