MG4J search engine
==================

http://mg4j.dsi.unimi.it/

A tutorial with a small document base of short documents (ca. 90K docs, 6M words)

Index:

1) download
2) search using UNIX tools
3) index 
4) query
5) change scorer 
6) changing indexing
7) last words

IMPORTANT: Please answer to numbered QUESTIONS in a single text file and submit to your teacher (subject: 

1) DOWNLOAD 
-----------

  HINT: you can copy (ctrl-c), and then paste the commands into the
        terminal (ctrl-shift-v)

- create a working directory
  $ mkdir mg4j
  $ cd mg4j

- download MG4J:
  $ wget http://ixa2.si.ehu.es/~jibotusa/mg4j/mg4j_lib.tar.gz

- uncompress (creating a "lib" directory):
  $ tar xzvf mg4j_lib.tar.gz

- create "documents" directory, download documents and uncompress
  $ mkdir documents
  $ cd documents
  $ wget http://ixa2.si.ehu.es/~jibotusa/mg4j/09405a_Ag_UK_ELocal.tar.gz
  $ tar xzf 09405a_Ag_UK_ELocal.tar.gz

2) Statistics and search using UNIX tools
-----------------------------------------

- Statistics of document base (guess what the numbers mean, consult "man" pages, 
  search Internet to guess what each command is doing: find, wc, egrep, ...)

  $ cd ..
  $ ls -hl documents/09405a_Ag_UK_ELocal.tar.gz
  $ find documents -name '*.html' -print | wc -l
  $ find documents -name '*.html' -print | xargs cat | wc -w 

- Searching for documents without any index using UNIX tools

  $ egrep -R --color 'Picasso' documents
  $ egrep -R --color 'Picasso' documents | wc -l

  $ egrep -R --color 'Leeds' documents
  $ egrep -R --color 'Leeds' documents | wc -l

- Seems like enough even for queries with two terms (¿¿¿!!!???)

  $ egrep -R --color 'Picasso' documents | egrep --color 'Greece'  
  $ egrep -R --color 'Picasso.*Greece' documents 
  $ egrep -R --color 'Greece.*Picasso' documents 

- Plese think on what are the limitations?

- For instance, is it fast? Try the following:

  $ time egrep -R --color 'the' documents | tail
  $ time egrep -R --color 'the' documents/09405a_Ag_UK_ELocal/3 | tail

- How much would it take to grep 100 million short documents?

  HINT: use the result of time above and the number of docs in the "3"
      directory:

  $ find documents09405a_Ag_UK_ELocal/3 -name '*.html' -print | wc -l

3) index using MG4J 
-------------------

- create "index" directory

  $ cd ..
  $ mkdir index

- add paths of all .jar files into CLASSPATH
  $ export CLASSPATH=lib/classpathx-jaf-1.0.jar:lib/colt-1.2.0.jar:lib/dsiutils-1.0.8.jar:lib/fastutil5-5.1.5.jar:lib/hugoUtils_0.jar:lib/jakarta-commons-collections-3.2.jar:lib/jakarta-commons-configuration-1.4.jar:lib/jakarta-commons-digester-1.8.jar:lib/jakarta-commons-io-1.4.jar:lib/jakarta-commons-lang-2.3.jar:lib/jakarta-commons-logging-1.1.jar:lib/jal-20031117.jar:lib/javacc-4.0.jar:lib/jetty5-5.1.12.jar:lib/jsap-2.0.jar:lib/junit-3.8.2.jar:lib/log4j-1.2.14.jar:lib/mailapi-1.3.1.jar:lib/mg4j.jar:lib/pdfbox-0.7.1.jar:lib/sux4j-1.0.1.jar:lib/tomcat5-servlet-2.4-api-5.5.25.jar:lib/velocity-1.5.jar:lib/velocity-tools-1.3.jar:lib/xalan-j2-serializer-2.7.0.jar

- compile list of files that we would like to index
  $ find documents/09405a_Ag_UK_ELocal/ -iname \*.html -type f > index/europeana.files

- create index (two steps)
  $ java it.unimi.dsi.mg4j.document.FileSetDocumentCollection -f HtmlDocumentFactory -p encoding=UTF-8 index/europeana.collection < index/europeana.files

  $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder --downcase -S index/europeana.collection index/europeana

- run query server
  $ java it.unimi.dsi.mg4j.query.Query -h -i FileSystemItem -c index/europeana.collection index/europeana-title index/europeana-text

4) run queries 
--------------

There are two ways to run queries

A) Directly on query server (command line)

  {title, text}> 
  $ (to see all possibilites, more on this later)

B) Use the web server

Type the following in a web browser: http://localhost:4242/Query
  (or http://127.0.0.1:4242/Query)

Documentation:  http://mg4j.di.unimi.it/man/manual

  Specifically, querying:

      http://mg4j.di.unimi.it/man/manual/ch01s04.html
      (Read the first section, up to "more sophisticated queries")

      More details: http://mg4j.di.unimi.it/docs/it/unimi/di/mg4j/search/package-summary.html

* EXERCISES
***********

  A.- explore the contents of the collection using queries
      (you can also open the files directly in the documents directory)
    
      These records are a simplified version of a sample of the
      contents in the following website:

        http://www.europeana.eu

  B.- Try several query options, and check if they do what you expect:
      AND, OR, NOT, phrase, proximity, ordered AND, wildcard search, 
      (), index specifiers, range queries

  C. QUESTION I: Does the engine follow a boolean or vector space model?

  D. QUESTION II: Does the engine index stopwords? What about stemming (chimney OR chimneys)?


* END EXERCISES
***************

- Check the files created by the indexer (open with a text editor like emacs)

  index/europeana-text.terms (Note that there is no stemming: chimney, chimneys)     
  index/europeana-text.stats
  index/europeana-text.properties

6) changing indexing
--------------------

- index without downcasing (create new index)

  $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder -S index/europeana.collection index/europeana2

- index using a stemmer (create yet another index)

  $ java -server it.unimi.dsi.mg4j.tool.IndexBuilder -t PorterStemmer -S index/europeana.collection index/europeana3

* EXERCISES
***********

  A. QUESTION III: compare the three indexes and list 3 differences in
     each, explaining the source of the differences:

     index/europeana-text.terms
     index/europeana2-text.terms
     index/europeana3-text.terms

  B. QUESTION IV: Assume that your information need is the following:
     "Names of women whom Henry VIII divorced". Make the following
     query: Henry OR VIII OR divorced. Evaluate precision@3 for the
     query above using each of the weighting schemens.

     In order to try each index, kill the server (Ctrl-C) and run the
     server with a different index, e.g.: 

     $ java it.unimi.dsi.mg4j.query.Query -h -i FileSystemItem -c index/europeana.collection index/europeana2-title index/europeana2-text

* END EXERCISES
***************

7) changing the scorer
--------------------------------

- type $ and see the options available in the query server

  {title, text}> $

  $                                                       prints this help.
  $mode [time|short|long|snippet|trec <topicNo> <runTag>] chooses display mode.
  $select [<maxIntervals> <maxLength>] [all]              installs or removes an interval selector.
  $limit <max>                                            output at most <max> results per query.
  $divert [<filename>]                                    diverts output to <filename> or to stdout.
  $weight {index:weight}                                  set index weights (unspecified weights are set to 1).
  $mplex [<on>|<off>]                                     set/unset multiplex mode.
  $score {<scorerClass>(<arg>,...)[:<weight>]}            order documents according to <scorerClass>.
  $expand {<expanderClass>(<arg>,...)}                    expand terms and prefixes according to <expanderClass>.
  $quit                                                   quits.

* EXERCISES
***********

  Back to the main index: europeana

  A.- We can give more weight to one index. Try some values and see
      how the results change for the query: henry OR viii OR divorce

      . normal values (title:1 text:1)
      . > $weight title:1 text:10
      . > $weight title:10 text:1

  B. QUESTION V: Assume that your information need is the following:
     "Names of women whom Henry VIII divorced". Evaluate precision@3
     for the query above using each of the weighting schemens.

* END EXERCISES
***************
  
- When starting the query, MG4J reports the default parameters:

     Welcome to the MG4J query class (setup with $mode snippet, $score
        BM25Scorer VignaScorer, $mplex on, $equalize 1000, $select 4
        40)

- Check the different scorers available:

     http://mg4j.di.unimi.it/docs/it/unimi/di/mg4j/search/score/package-summary.html

- For instance

  $score TfIdfScorer
 
  log(N / f) * c / l

  check and think about the following scores:
     kandinsky 2 times               idf = log_10(93000/2)

     first document for kandinsky     0.12 ~ idf*1/30
     second document for kandinsky    0.06 ~ idf*1/60

  check the scores for:  picasso OR kandinsky
     picasso 6 times     log_10(93000/6)

* EXERCISES
***********

  A. check the following scorers and think of some queries where one
     would do better than the other

     $score CountScorer
     $score TfIdfScorer
     $score BM25Scorer

  B. QUESTION VI: Which scorer would you like best? Show some examples
     that support your opinion.

* END EXERCISES
***************

8) Last words
-------------

This self-contained search engine is great for education and research
(no dependencies except java, open source, java).

But... industrial applications need more robust software, still open
source:

   LUCENE / SOLR:   http://lucene.apache.org/solr/