Legeon

This is an attempt to build an open-source, community-driven, distributed, non-commercial search engine as an alternative to Google.

Legeon is an Old Russian number meaning 105 (or 1012). That's the difference from Google which comes from googol meaning 10100: while Google is aiming at larger quantity of results at the cost of quality, Legeon should provide smaller number of higher quality results.

The code of the search engine will be developed in accordance with standards opposite to some modern trends in software development. These standards are tentatively named Craft Programming (“craft” as in “beer”) and will be also explained here.

The implementation of this initiative is influenced by ideas of ALVIS project and WAIS. It includes a crawler suite and web interface in Perl and Zebra database backend.

The implementation’s rationale

Can you imagine, for example, traffic regulations changing every year or even quarterly? Here in Russia we can also compare it with continuous constitution changes and the squall of legislative inventions distressing all decent people in the last decade. Can we imagine new orthography rules being put in force several times a year? Would anyone like if they are forced to upgrade a car or a flat every month or even week because of any slight chance of “vulnerability” or “insecurity”?

Unfortunately, this is what happens now in the world of programming languages, Internet standards and software.

For the stability reasons, we therefore accept only mature languages and technologies for development. I would actually prefer only languages and technologies that are no less than 20—25 years old. My personal favourites are Perl, C, Lisp, and Z39.50 as query protocol (not to mention gopher as a supported UI).

There's also a commercial consideration: unlike mainstream commercial search engines, I cannot afford and do not wish to have a staff of hired workers to maintain the code; and this consideration is also behind the whole craft programming philosophy, which is focused on non-commercial, non-payment-oriented programming.

As in future I plan to employ some distributed storage, which realm seems to have achieved considerable progress during last years, other technologies still can be adopted, but I would still prefer a C implementation of such a system.

Why perl?

Although modern Perl with its new development strategy cannot be called stable anymore (with 20 minor releases over about 12 years, lots of new unsuccessful additions, later deprecated, worse (but more “secure”) implementations of things that had been already working well, etc.), yet the backward compatibility of both the language and third-party modules is on a decent level, and creating code compatible with both older and newer versions does not pose a problem. (For the same reason Python cannot be considered stable and acceptable: recent switch to backward-incompatible Python 3 and deprecation of Python 2 caused immense amount of work to be done only to bring up the existing enourmous code base, otherwise mature and stable enough, as Python proponents have always claimed, to the new standard. But even with Python 2, incompatibilities of specific software with some specific minor version were quite ubiquitous.) Also it's always possible to switch to more robust forks like stableperl, cperl or rperl. The very existence of these shows understanding of the problem by at least some of the developers.

Perl also has been especially widely used in text processing area, including search engines, since early 1990-s. So it's also due to tradition.

Why not Sphinx?

At the time of active development of the search engine (2007—2012) Sphinx could not be called stable enough: new releases appeared too often, with lots of new features and possibly backward incompatible changes. And I could not afford to maintain the thing properly, keeping it up to date with the latest releases; especially compatibility of the Perl interface, updated not so regularly as Sphinx itself, was problematic. And I needed software, which could be once installed and then work for decades.

Now the things may have been changed, and given the impressive preformance and sane architecture of Sphinx, I may come to reconsidering the possibility of its adoption.

Another reason was that Sphinx did not support Z39.50. The latter is required as a standard of interaction with the search server. Practically, two standards are needed: of a search query, sent by the end user (and possibly composed by the web frontend) to the web server, and of the query sent to the search engine backend by the web server. The former, as far as I know, is not yet standardized at all, despite several attempts; for the latter, nothing really superior has been invented since the times of Z39.50. Again, these standards should be protected from “developments” and changes, which would have made me spend the rest of my life in keeping already well performing code up to date with changing “standards” (how can it be called a standard then?). The best solution would be to take a solid, aged and proven protocol, such as Z39.50. Nowadays it's used mainly in library interfaces, but its version employed by WAIS had been commonly adopted to build general purpose search engines until web search engines finally won.

Differences from other search engines