MG4J search engine ================== http://mg4j.dsi.unimi.it/ A tutorial with a small document base of short documents (ca. 90K docs, 6M words) Index: 1) download 2) search using UNIX tools 3) index 4) query 5) change scorer 6) changing indexing 7) last words IMPORTANT: Please answer to numbered QUESTIONS in a single text file and submit to your teacher (subject: 1) DOWNLOAD ----------- HINT: you can copy (ctrl-c), and then paste the commands into the terminal (ctrl-shift-v) - create a working directory $ mkdir mg4j $ cd mg4j - download MG4J: $ wget http://ixa2.si.ehu.es/~jibotusa/mg4j/mg4j_lib.tar.gz - uncompress (creating a "lib" directory): $ tar xzvf mg4j_lib.tar.gz - create "documents" directory, download documents and uncompress $ mkdir documents $ cd documents $ wget http://ixa2.si.ehu.es/~jibotusa/mg4j/09405a_Ag_UK_ELocal.tar.gz $ tar xzf 09405a_Ag_UK_ELocal.tar.gz 2) Statistics and search using UNIX tools ----------------------------------------- - Statistics of document base (guess what the numbers mean, consult "man" pages, search Internet to guess what each command is doing: find, wc, egrep, ...) $ cd .. $ ls -hl documents/09405a_Ag_UK_ELocal.tar.gz $ find documents -name '*.html' -print | wc -l $ find documents -name '*.html' -print | xargs cat | wc -w - Searching for documents without any index using UNIX tools $ egrep -R --color 'Picasso' documents $ egrep -R --color 'Picasso' documents | wc -l $ egrep -R --color 'Leeds' documents $ egrep -R --color 'Leeds' documents | wc -l - Seems like enough even for queries with two terms (¿¿¿!!!???) $ egrep -R --color 'Picasso' documents | egrep --color 'Greece' $ egrep -R --color 'Picasso.*Greece' documents $ egrep -R --color 'Greece.*Picasso' documents - Plese think on what are the limitations? - For instance, is it fast? Try the following: $ time egrep -R --color 'the' documents | tail $ time egrep -R --color 'the' documents/09405a_Ag_UK_ELocal/3 | tail - How much would it take to grep 100 million short documents? HINT: use the result of time above and the number of docs in the "3" directory: $ find documents09405a_Ag_UK_ELocal/3 -name '*.html' -print | wc -l 3) index using MG4J ------------------- - create "index" directory $ cd .. $ mkdir index - add paths of all .jar files into CLASSPATH $ export CLASSPATH=lib/classpathx-jaf-1.0.jar:lib/colt-1.2.0.jar:lib/dsiutils-1.0.8.jar:lib/fastutil5-5.1.5.jar:lib/hugoUtils_0.jar:lib/jakarta-commons-collections-3.2.jar:lib/jakarta-commons-configuration-1.4.jar:lib/jakarta-commons-digester-1.8.jar:lib/jakarta-commons-io-1.4.jar:lib/jakarta-commons-lang-2.3.jar:lib/jakarta-commons-logging-1.1.jar:lib/jal-20031117.jar:lib/javacc-4.0.jar:lib/jetty5-5.1.12.jar:lib/jsap-2.0.jar:lib/junit-3.8.2.jar:lib/log4j-1.2.14.jar:lib/mailapi-1.3.1.jar:lib/mg4j.jar:lib/pdfbox-0.7.1.jar:lib/sux4j-1.0.1.jar:lib/tomcat5-servlet-2.4-api-5.5.25.jar:lib/velocity-1.5.jar:lib/velocity-tools-1.3.jar:lib/xalan-j2-serializer-2.7.0.jar - compile list of files that we would like to index $ find documents/09405a_Ag_UK_ELocal/ -iname \*.html -type f > index/europeana.files - create index (two steps) $ java it.unimi.dsi.mg4j.document.FileSetDocumentCollection -f HtmlDocumentFactory -p encoding=UTF-8 index/europeana.collection < index/europeana.files $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder --downcase -S index/europeana.collection index/europeana - run query server $ java it.unimi.dsi.mg4j.query.Query -h -i FileSystemItem -c index/europeana.collection index/europeana-title index/europeana-text 4) run queries -------------- There are two ways to run queries A) Directly on query server (command line) {title, text}> $ (to see all possibilites, more on this later) B) Use the web server Type the following in a web browser: http://localhost:4242/Query (or http://127.0.0.1:4242/Query) Documentation: http://mg4j.di.unimi.it/man/manual Specifically, querying: http://mg4j.di.unimi.it/man/manual/ch01s04.html (Read the first section, up to "more sophisticated queries") More details: http://mg4j.di.unimi.it/docs/it/unimi/di/mg4j/search/package-summary.html * EXERCISES *********** A.- explore the contents of the collection using queries (you can also open the files directly in the documents directory) These records are a simplified version of a sample of the contents in the following website: http://www.europeana.eu B.- Try several query options, and check if they do what you expect: AND, OR, NOT, phrase, proximity, ordered AND, wildcard search, (), index specifiers, range queries C. QUESTION I: Does the engine follow a boolean or vector space model? D. QUESTION II: Does the engine index stopwords? What about stemming (chimney OR chimneys)? * END EXERCISES *************** - Check the files created by the indexer (open with a text editor like emacs) index/europeana-text.terms (Note that there is no stemming: chimney, chimneys) index/europeana-text.stats index/europeana-text.properties 6) changing indexing -------------------- - index without downcasing (create new index) $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder -S index/europeana.collection index/europeana2 - index using a stemmer (create yet another index) $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder -t PorterStemmer -S index/europeana.collection index/europeana3 * EXERCISES *********** A. QUESTION III: compare the three indexes and list 3 differences in each, explaining the source of the differences: index/europeana-text.terms index/europeana2-text.terms index/europeana3-text.terms B. QUESTION IV: Assume that your information need is the following: "Names of women whom Henry VIII divorced". Make the following query: Henry OR VIII OR divorced. Evaluate precision@3 for the query above using each of the weighting schemens. In order to try each index, kill the server (Ctrl-C) and run the server with a different index, e.g.: $ java it.unimi.dsi.mg4j.query.Query -h -i FileSystemItem -c index/europeana.collection index/europeana2-title index/europeana2-text * END EXERCISES *************** 7) changing the scorer -------------------------------- - type $ and see the options available in the query server {title, text}> $ $ prints this help. $mode [time|short|long|snippet|trec ] chooses display mode. $select [ ] [all] installs or removes an interval selector. $limit output at most results per query. $divert [] diverts output to or to stdout. $weight {index:weight} set index weights (unspecified weights are set to 1). $mplex [|] set/unset multiplex mode. $score {(,...)[:]} order documents according to . $expand {(,...)} expand terms and prefixes according to . $quit quits. * EXERCISES *********** Back to the main index: europeana A.- We can give more weight to one index. Try some values and see how the results change for the query: henry OR viii OR divorce . normal values (title:1 text:1) . > $weight title:1 text:10 . > $weight title:10 text:1 B. QUESTION V: Assume that your information need is the following: "Names of women whom Henry VIII divorced". Evaluate precision@3 for the query above using each of the weighting schemens. * END EXERCISES *************** - When starting the query, MG4J reports the default parameters: Welcome to the MG4J query class (setup with $mode snippet, $score BM25Scorer VignaScorer, $mplex on, $equalize 1000, $select 4 40) - Check the different scorers available: http://mg4j.di.unimi.it/docs/it/unimi/di/mg4j/search/score/package-summary.html - For instance $score TfIdfScorer log(N / f) * c / l check and think about the following scores: kandinsky 2 times idf = log_10(93000/2) first document for kandinsky 0.12 ~ idf*1/30 second document for kandinsky 0.06 ~ idf*1/60 check the scores for: picasso OR kandinsky picasso 6 times log_10(93000/6) * EXERCISES *********** A. check the following scorers and think of some queries where one would do better than the other $score CountScorer $score TfIdfScorer $score BM25Scorer B. QUESTION VI: Which scorer would you like best? Show some examples that support your opinion. * END EXERCISES *************** 8) Last words ------------- This self-contained search engine is great for education and research (no dependencies except java, open source, java). But... industrial applications need more robust software, still open source: LUCENE / SOLR: http://lucene.apache.org/solr/