For this Web site (W3M) I wanted a search engine so one could search through all my data. However, everything out there was either too specific or too bulky, plus I already had some infrastructure in place (as a RDF). So I figured it wouldn't be too hard to write a search engine.
It turns out that it isn't, but I had to write it in two steps. First, I had to create a program that would read through all my
This code was ugly in version 0.1, so I rewrote it. What now happens I recur through all the directories and, when I find a file whose name begins with meta.
and ends with .xml
, I send that to a parser. The parser returns an array of indexed data about a file, one file per array element. These are written to the search database.
The parser was the ugliest part of version 0.1. Now, through the use of Perl's hash data type, it is somewhat more beautiful. All hash elements are initially set to nothing. Then I go through the RDF line-by-line. After removing any excess newlines, element endings (/>
), or extra spaces, I check to see if all quotes match. Each key in the hash corresponds to a RDF element, so if I am able to substitute the RDF element with nothing I set the hash value to the current line. If the quotes don't match, I set the nextline
flag.
If, on a run through the next line of input, the nextline
flag is set, I check to make sure that no other RDF element is in the current line of input and append the current line to the hash. I then set the nextline
flag to nothing if all quotes match.
Then the data are assembled in order of keywords, title, description, author, last updated, language, then URL. All of these assembled data are pushed onto an array, which is returned.
Finally I output whether or not it worked.
All of this greatly simplifies searching; most searching is just grepping, which is a geeky way of saying looking for a word in a group of words (well, actually there is more behind the word grep, but that is the general idea). All I do is input the query, open the search database, print the results page, and grep the search database, line-by-line. If the line contains something that I am querying, I get the data from the search database, remove any newlines, assemble it in the order I want, and print it. Quite simple, really.
Download it in version 0.2.
Download version 0.1 to see my mistakes.
Wish I had recovered this file so I could see my XML parser written using nothing but regular expressions.