Thursday, May 18, 2006

Parsing Wikipedia

On spanoflife, I set up a MySQL database, and started to set it up according to the ERD I built, then realized I should really prototype some interface ideas. So I went looking for sample data and discovered that it is really hard to find any data in a machine-readable format. Most biographies and timelines are in the format of paragraph text. The richest source of this data seems to be wikipedia. So. The real next step is to build a parser which can take a page on, say, Galileo, and read out the events in his life into database entries. I guess I'll use python just because I have it on hand. I found a parsing library, so I can turn things like:

In 1611, he went to Rome, where he joined the Accademia dei Lincei and observed sunspots. In 1612, opposition arose to the Copernican theories, which Galileo supported. In 1614, from the pulpit of Santa Maria Novella, Father Tommaso Caccini (1574-1648) denounced Galileo's opinions on the motion of the Earth, judging them dangerous and close to heresy.

Into:

EVENT_START_TIME EVENT_NAME
1611 He went to Rome, where he joined the Accademia dei Lincei and observed sunspots.

And so on and so forth.

No comments: