We divide the project into the following tasks briefly described along with the researchers involved in them:

  1. Corpora construction: Moreira, Laranjeira, Boos, Ramisch, Besacier, Caseli, Vieira, Seno, Polastri
    1. acquisition of parallel corpora (speech­text, text­text) ­ e.g. the TED talks corpus
    2. acquisition of monolingual and comparable corpora from the web
    3. cleaning and sharing the corpora
  2. Protocols for translation evaluation: Esperança­Rodier , Besacier , Caseli, Martins
    1. design of an evaluation protocol for the translation of expressions in context
    2. validation of the protocol in different language pairs and types of MWE
      1. Evaluation for the shared task
        1. In vs out of context, manually vs automatically, with synonyms/paraphrasing,

To be discussed in the Brazilian trip in São Carlos with Helena

  1. Pre­processing of corpora: Wilkens, Prestes, Laranjeira, Zanette, Gamboa, Boos, Correa
    1. data cleaning and filtering
    2. part­of­speech tagging and/or parsing
    3. formatting and standardisation
    4. release of the resulting corpora
  2. Lexical and ontological resource construction for French, Portuguese and English: Ramisch, Villavicencio, Wilkens, Prestes, Vicari, Primo, Penteado, Caseli, Taba, Ribeiro, Vieira, Hruschka, Dinarelli, Poibeau, Tellier, Barchi, Gala
    1. automatic induction of monolingual and bilingual lexicons
      1. Carlos, Aline, Marco, Thierry, Silvio, Sandra, Pautr (English, Portuguese, French)
    2. semi­automatic data­driven construction of ontologies
  3. Cognitive and data intensive models of MWE learning: Favre, Dinarelli, Poibeau, Tellier, Omodei, Dupont, Idiart, Villavicencio, Machado, Wilkens, Hruschka, Barchi
    1. analysis of MWE acquisition in naturalistic child directed datasets

Marco, Aline ???

    1. machine learning models of MWE learning

Benoit, Fred, Carlos, Thierry and Marco D.

    1. evaluation of the resulting models

Benoit, Fred, Carlos, Thierry and Marco D.

  1. Lexical models of MWEs: Ramisch, Villavicencio, Idiart, Prestes, Caseli, Vieira, Seno, Dinarelli, Poibeau, Tellier, Polastri, Gala
    1. representation in dictionaries
    2. identification of expressions in text
        1. Thierry, Marco D., Carlos, Aline, Nuria ??
  2. Syntactic models of MWEs: Nasr, Ramisch, Zanette, Wilkens
    1. characterise syntactically flexible expressions in a cross­lingual framework

Alexis, Carlos

    1. integrate such representations in a parser (e.g. in the MICA toolkit developed by the LIF 

Alexis, Carlos

  1. Semantic models of MWEs: Ramisch, Villavicencio, Idiart, Padró, Prestes, Gamboa, Correa, Castellanos, Caseli, Taba, Ribeiro, Seno, Dinarelli, Poibeau, Tellier, Polastri, Gala
    1. use of very large corpora to model the semantics of MWEs

Carlos, Marco, Aline

    1. application of distributional methods

Carlos, Marco, Aline, Kassius ?, Thierry ?

    1. application of paraphrasing methods

Sandra, Carlos, Aline, Marco

    1. evaluation of the resulting models


    1. integration with lower linguistic levels (lexical, syntactic)
  1. Automatic speech recognition systems: Bechet, Besacier (??, depending on internship availability, 3rd year)
    1. development of speech recognition systems for the languages of the project
    2. quantitative study on the recognition of multiword units
    3. integration of MWEs in the language models and acoustic models
    4. evaluation

10. Machine translation: Besacier, Bechet, Favre, Boitet, Ramisch, Caseli, Vieira, Martinsdevelopment of machine translation systems for the target language pairs

  1. quantitative study on the translation of MWEs

Carlos, Laurent,

  1. integration of MWEs in the language models and translation models
  2. evaluation

11. Reports, papers and articles: French and Brazilian groups a. writing of reports and scientific papers

MWE Types

  • *VPCs  Carlos, Aline, Marco, Laurent
    • Languages: English to French and Portuguese
  • *Compounds  Aline, Marco, Thierry
    • Languages: English, French and Portuguese
  • *Light Verb Constructions  Carlos
    • Languages: English to French and Portuguese
  • *Idiomatic expressions  Laurent, Carlos, Aline
    • Languages: English to French and Portuguese


  • Parallel Corpora: Subtitles (English, French, Portuguese)
  • Monolingual Corpora: brWac, ukWac, frWac, BNF (restricted access), Wikipedias (French, English, Portuguese??)


  • Type Annotation in corpora (4 months, Sílvio)
  • Token identification
  • MWE alignment


Lattice  student or Marco Dinarelli, Isabelle Tellier (to be confirmed)

Marseille  Fred ??, Carlos (Cameleon)

Grenoble  Christian (Cameleon), Laurent



  • Carlos from 24/10
  • Laurent from 29/10 (30, 31 in São Carlos, 3 and 4/11 in POA)
  • Helena is unavailable from 1/11??

Joint short term projects


  1. mwetoolkit  Carlos and Aline
  2. MWE in SMT quality  Laurent and Helena/Aline
    1. *corpus already collected
    2. *focus on methods
  3. Light verb constructions  student: Sandra, Carlos
    1. Variability  permutation variation