Multilingual Central Repository 3.0
-----------------------------------

Version 3.0 of the Multilingual Central Repository (MCR 3.0) is the
result of the 5th Framework MEANING project (IST-2001-34460) and
Spanish government KNOW (TIN2006-15049-C03) and KNOW2
(TIN2009-14715-C04-01) projects.

The MCR 3.0 integrates in the same EuroWordNet framework wordnets from
five different languages: English, Spanish, Catalan, Basque and
Galician. The Inter-Lingual-Index (ILI) allows the connection from
words in one language to their equivalent translations in any of the other
languages. The current ILI version corresponds to Princeton WordNet
3.0. Furthermore, the MCR is enriched with the semantically tagged glosses:
http://wordnet.princeton.edu/glosstag.shtml

The MCR also integrates WordNet Domains, new versions of the Base
Concepts and the Top Ontology, and the AdimenSUMO ontology. In that
way, the MCR constitutes a natural multilingual large-scale semantic
resource for a number of semantic processes that need large amount of
multilingual knowledge to be effective tools.

The current content of the MCR 3.0 can be consulted using the Web
EuroWordNet Interface (WEI):
http://adimen.si.ehu.es/web/MCR

For more details on the MCR 3.0 contents, including references to the
original resources, please consult the following paper:

Gonzalez-Agirre A., Laparra E. and Rigau G. Multilingual Central
Repository version 3.0: upgrading a very large lexical knowledge
base. In Proceedings of the Sixth International Global WordNet
Conference (GWC’12). Matsue, Japan. January, 2012.

which can be downloaded at:
http://adimen.si.ehu.es/~rigau/publications/gwc12-glr.pdf


Contents of the distribution
----------------------------

The current distribution of the MCR 3.0 consists of the following
directories and files:

AdimenSUMO/		Mappings from WN to AdimenSUMO classes
catWN/			Catalan WordNet
data/			ILI, relations, relations groups, lexnames
Domains/		Mappings from WN to WN Domains labels
engWN/			English WordNet
eusWN/			Basque WordNet
glgWN/			Galician Wordnet
LICENCE.txt		Licenses
Marks/			Variant and Synset marks
README.txt		README file
spaWN/ 			Spanish WordNet
sql/			Instructions to create the database in both mysql and postgreSQL
TopOntology/		Mappings from WN to Top Ontology properties

The MCR 3.0 includes the WordNets for five languages, namely, English
(from Princeton WordNet 3.0), Catalan, Basque, Galician and Spanish
WordNets.

Variants
========

WN	 Nouns	Verbs	Adjectives	Adverbs	Synsets

catWN	 51605	11577 	7679		2	46033
engWN	 147358	25051	30004		5580	118435
eusWN	 40939	9470	148		0	30615
glgWN	 18949	1416	6773		0	19312
spaWN	 39142	10824	6967		1051	38702

Glosses
=======

WN	Nouns	Verbs	Adjectives	Adverbs	Synsets

catWN	6294	44	840		1	7179
engWN	82383	13767	18156		3621	117927
eusWN	2690	2	0		0	2692
glgWN	4997	2	3111		0	8111
spaWN	12533	3325	1917		670	18445

Ontologies
==========

AdimenSUMO	121181 assignments to 896 AdimenSUMO classes.
Top Ontology 	339582 assignments to 66 Top Ontology properties.
WordNet Domains 146905 assignments to 170 domain labels.


Database design of the MCR 3.0
------------------------------

The MCR 3.0 is structured as a relational database consisting of 39
tables. The main table of the MCR 3.0, wei_ili_record, is in the data directory
and it provides the Inter-Lingual Index (ILI).

-  wei_ili_record: Contains the ILI identifier, in the format
   'ili-30-xxxxxxxx-y', where "xxxxxxxx" is a 8 digit offset number and
   "y" represents the part of speech: 'n' corresponds to noun, 'v' to
   verb, 'a' to adjective and 'r' to adverb. Each entry also displays
   the source of the ILI (the WordNet of origin), whether it is a base
   concept or not, the lexicographic file from WordNet, and whether it
   is an instance.

The rest of the tables in the data directory are the following:

-  wei_relations: This table contains the relations offered by the MCR
   3.0. Every relation has an identifier, name, properties and a note
   (optional). Other attributes indicates the inverse of the relation (if
   any) and to which group the relations does belong. The ID that
   appears in this table is later used in the 'wei_$LANG-30_relation'
   tables to identify each relation.

   For more details of the EuroWordNet relations consult the following paper:

   Piek Vossen. EuroWordNet General Document. Version 3.
   Deliverables D032D033. EuroWordNet project.

   which can be downloaded at:
   http://www.vossen.info/docs/2002/EWNGeneral.pdf

   The props atribute is a four character string coding four different
   properties: 't' means that the relation is transitive, 's' that the
   original relation in WordNet is between word senses, 'i' that the
   relation has an inverse relation (appearing in the inverse atribute)
   and 'n' to indicate when the relation is not encoded in the database
   (when it is encoded its inverse relation). For instance, 'has_hyponym'
   is transitive, appears between synsets, its inverse relation is
   'has_hyperonym' and it is the one encoded in the database. Thus,
   'has_hyperonym' is not encoded.

-  wei_relations_group: This table stores the supergroups of relations
   (synonyms, Hyperonyms, Meronyms, Causes, ...). The supergroup to
   which each relations corresponds is used in the "wei_relations"
   table described above.

-  wei_lexnames: This table indicates the WordNet lexicographic files.
   Each entry has a code which is later used in the 'semf' attribute
   indicated in 'wei_ili_record' table plus a descriptive name.

Every language included in the MCR 3.0 (including English) is linked
to the ILI. Each WordNet is composed of 5 tables. Each language has
its own 3-letter code, indicated by $LANG, in the following tables:

-  wei_$LANG-30_to_ili: It establishes a correspondence between the ILI
   with the synset offset for each the 5 languages of the MCR 3.0. This
   way, all 5 languages are connected.

-  wei_$LANG-30_relation: This table contains the relations for each
   language. Each relation has the following attributes: the type of relation,
   as indicated in the catalogue of relations listed in the table 'wei_relations',
   the direction of the relation (source synset and target synset),
   the value of the confidence score, and the WordNet of origin.

-  wei_$LANG-30_synset: Properties of every synset for each language
   including an identifier, total number of descendants, gloss (if any),
   maximum number of levels in its hierarchy, the level number counting
   from the top, and finally the mark of the synset.

-  wei_$LANG-30_variant: The variants are stored in this table. Each
   entry represents a single variant and stores the following
   information: word, sense, the synset offset, the confidence score,
   the experiment it comes from (optional), and finally the mark and
   the note of the variant. The confidence score ranges from 49 to 99
   and it establishes a value for the association between the variant
   and the synset and it depends on the method used to acquire the
   association. Manually revised associations usually have a confidence
   score of 99.

-  wei_$LANG-30_examples: This table contains examples (if any) for
   each synset. Each example is identified by the synset offset, pos,
   word and sense.

Anyone interested in adding a new language to the MCR 3.0 needs to
create the 5 tables contained in the directories of the 5 WordNets.
The tables should follow the same naming patterns plus 3-letter code
to represent it. The three letter codes follow:

http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Furthermore, the MCR 3.0 integrates ontological knowledge from three
different sources: AdimenSUMO, Top Ontology and WordNet Domains.
The mappings between these ontological resources and the ILI is
language independent.

Domains:

-  wei_domains: This table represents the WordNet domains hierarchy
   using source-target tuples.

-  wei_ili_to_domains:  Each entry links a domain label to an ILI. It
   also indicates the WordNet of origin. This table is unique for all
   languages. In other words, that information related to domains is
   general and not language-dependent.

AdimenSUMO:

-  wei_sumo_relations: This table represent the AdimenSUMO hierarchy
   using source-target tuples. It also has a field that indicates
   whether it is a subclass.

-  wei_ili_to_sumo:  Each entry links an AdimenSUMO label to an ILI. It
   also indicates the WordNet of origin. This information is language
   independent.

Top Ontology:

-  wei_to_relations: This table represents the Top Ontology hierarchy
   using source-target tuples. It also has a field that indicates the
   type of the relation.

-  wei_ili_to_to: Each entry establishes a correspondence between an
   ILI and a property in the Top Ontology and the source WordNet.

-  wei_to_record:  It offers a short description for each type of
   Top Ontology property.

Marks:

-  mark_values_synset: Possible values for synsets marks as well as its
   description.

-  mark_values_variant: Possible values for variant marks as well as
   its description.


Additional information
----------------------

Ongoing development work on the MCR is done by a small group of
researchers. Since our resources are VERY limited, we request that
you please confine correspondence to the MCR topics only. Please
check carefully this documentation and other resources to answer
to your question or problem before contacting us.

English Princeton WordNet:
http://wordnet.princeton.edu

EuroWordNet project:
http://www.illc.uva.nl/EuroWordNet

WordNet Domains:
http://wndomains.fbk.eu

AdimenSUMO:
http://adimen.si.ehu.es/web/adimenSUMO

Meaning project:
http://nlp.lsi.upc.edu/projectes/meaning

KNOW project:
http://ixa.si.ehu.es/know

KNOW2 project:
http://ixa.si.ehu.es/know2

Multilingual Central Repository:
http://adimen.si.ehu.es/web/MCR


Research groups involved
------------------------

GRIAL 	http://grial.uab.es
IULA	http://www.iula.upf.edu
IXA	http://ixa.si.ehu.es
NLPG	http://nlp.lsi.upc.edu
SLI 	http://webs.uvigo.es/sli

Contact information
-------------------

German Rigau
IXA Group
University of the Basque Country
E-20018 San Sebastián

mcr-users@googlegroups.com