Research Internships

AIM-WEST partners propose below a list of internship projects for French or Brasilian Master students. They are all co-supervised by a Brazilian and a French professor. Please read and contact us if interested !!

1) Statistical Machine Translation (SMT) of Subtitles for English, French and Portuguese Languages and better handling of Multi Word Expressions in SMT
Supervisors : L. Besacier & H. Caseli (Laurent.BesacierATimag.fr – helenacaseliATdc.ufscar.br)
This intership is done in the context of a larger project which aims at proposing innovative and efficient methods for handling Multi-Word-Expressions (MWEs) in Statistical Machine Translation (SMT). A corpus for specifically evaluating this aspect will first have to be collected for SMT between English, French anf Portuguese languages.

This corpus could be extracted from TED corpus (of subtitles – https://wit3.fbk.eu/ ) and first machine translation systems will be built using Moses toolkit.  After that, some research work to provide better handling of Multi Word Expressions in SMT will follow. Some ideas around this project are the following : work on data pre-processing to improve word-alignment and translation models, propose sparse features (related to MWEs) in the SMT model, work on N-best list re-ranking or graph re-decoding, etc.

2) Extending the MWEtoolkit with token identification – TAKEN
Supervisors: Carlos Ramisch, Aline Villavicencio (carlos.ramischATlif.univ-mrs.fr – avillavicencioATinf.ufrgs.br)
The mwetoolkit is a tool for extracting lists of MWEs from texts (http://mwetoolkit.sourceforge.net). The goal is to extend it so that the output is identical to the input corpus, but with the MWEs marked using a special markup. The expected outcome is a fully functional tool that allows to obtain a corpus with token MWE annotation (as opposed to MWE lists independent from the text, as it is currently the case).
Tasks:

  • Understand the mwetoolkit and run some toy experiments
  • Define an output format for the markup, something like <mwe id=”1″>get</mwe> it <mwe id=”1″>off</mwe>
  • Develop a tool “project.py”, which takes as an input the corpus and the list of MWEs (possibly obtained using candidates.py with -S option) and projects/annotates the MWEs on the corpus.
  • Evaluate the annotation on a sample corpus of phrasal verbs in English

3) Extending the MWE annotation of parallel corpora to noun compounds/idioms
Supervisors: Carlos Ramisch, Aline Villavicencio (carlos.ramischATlif.univ-mrs.fr – avillavicencioATinf.ufrgs.br)
The project has already some parallel data on phrasal verbs. The idea of this internship would be to annotate the same parallel corpora with other types of MWEs, namely noun compounds and idioms. Noun compounds are sequences of nouns (and/or other elements) that represent a single noun, like telephone booth, wasing machine, cable car. Idioms are expressions which cannot be interpreted semantically word-by-word; we will focus on verb-noun idioms like foot the bill, spill the beans, kick the bucket. The student will use some pre-existing tool like mwetoolkit and the respective patterns. We focus on a small set of expressions. Some manual validation might be required on the test sets created.

4) Modelling the variability of support verb constructions
Supervisors: Carlos Ramisch, Aline Villavicencio (carlos.ramischATlif.univ-mrs.fr – avillavicencioATinf.ufrgs.br)
Support verb constructions like take a picture, pay attention, make a call allow less variability than similar constructions like take a bus, pay money, make a cake. The idea of the internship is to automatically generate variants for support verb constructions and then calculate the entropy of this set of variants. It is based on similar work developed for VPCs (Villavicencio et al. 2008 CONLL). Variability information can be used as a feature for MWE identification and even for better translation.

5) Creating a dataset for phrasa verb compositionality based on wordnet synonyms
Supervisors: Carlos Ramisch, Aline Villavicencio (carlos.ramischATlif.univ-mrs.fr – avillavicencioATinf.ufrgs.br)
Annotating compositionality for a phrasal verb like take off is barely impossible out of context or using a numeric scale or enumeration of classes. However, finding synonyms, paraphrases and equivalents is a more natural task that native speakers and linguists are comfortable with. Our goal is to assign a compositionality judgement to phrasal verbs in context (sentences) implicitly, by asking the annotator to replace it by other verbs.

6) Evaluation protocol for assessing MWEs in Automatic Speech Recognition 
Supervisors : L. Besacier & A. Villavicencio (Laurent.BesacierATimag.fr –avillavicencioATinf.ufrgs.br)

This work would consist in building an evaluation protocol for assessing MWEs in Automatic Speech recognition. To the best of our knowledge, it has not been done before. Our idea would be to record a specific read speech corpus in english containing a large number of phrasal verbs. This could be done by having 3 speakers each recording the 500 utterances of LIG-LIDILEM phrasal verb corpus (and recording simultaneously similar sentences without phrasal verbs). After that, LIG ASR system (based on KALDI speech recognition toolkit) would be applied and word error rates (WER) between the PV and non-PV corpora would be compared. If any, differences in performance would be investigated more deeply afterwards.

7) Unsupervised extraction of MWEs from word alignments and integration with MWEtoolkit
Supervisors : C. Ramisch & H. Caseli (carlos.ramischATlif.univ-mrs.fr – helenacaseliATdc.ufscar.br)

The idea is to develop a tool – integrated into the mwetoolkit – that takes as input parallel text automatically word-aligned (e.g. the output produced by GIZA++) and detect patterns that indicate the presence of multiword expressions. The intuition behind this idea is that MWEs are generally not translated word by word. At a first moment, the student can reimplement the simple methods proposed by Caseli et al. (2010) and Tsvetkov and Wintner (2011). Then, he/she can use some contextual measure to detect when some words in source language are systematically aligned with some words in the target language in a specific context and not in other contexts. The focus will probably be on idiomatic constructions since they are generally not translated word by word.