MWE MT Eval Workshop Notes Thursday 30th October

Short Notes

-Laurent presented Zied’s corpus in more details => this has to be cleaned ; maybe we should pass the sentences into MWE toolkit to filter it and detect MWEs into the sentences

-The shared data will be put on the web site’s intranet

-Agnès presented in more details the French annotated corpus (in MWEs)
=>could be useful to have it in Portuguese, following the same methodology
=>Agnès also mentionned that we could extract bilingual dictionaries from Wiktionary

-Christian presented in Pt_Fr dictionary which is available in .epub format… He would like to recover this dict in an electronic form ; a demonstration of AxiMAG was also done and could be used for AIMWEST website + as a post-editing platform ; finally, a Pt parser was presented by Christian & Paltonio

-Carlos shortly presented evolutions of MWE toolkit

-Helena presented the LALIC machine translation portal
Pt-En a d En-Pt systems have been developped on corpora of scientific texts
=>maybe the same methodology (as the one we followed on french) of annotation on the portuguese side could be followed

-Possible ways to collect Pt-BR-Fr parallel corpora were discussed
=> WIT (TED) corpus : see https://wit3.fbk.eu/ and http://hltshare.fbk.eu/EAMT2012/html/Papers/59.pdf
=> OPENSUBTITLES corpus

-Isabelle presented her work (with Matthieu Constant) on jointly segmenting and POS tagging MWEs

Work Plan for 2015

Task 1 Release a usable Fr-En-Pt-BR corpus : could be extracted from TED , then useful to train first SMT systems, include them in LALIC portal => Summer Internship from Polytech’ Grenoble (RICM4) co-supervised by Helena & Laurent on this topic

Task 2 Organize internally a Fr-En MT shared task : goal evaluate the behavior of several MT systems on translating several types of MWEs
a) Choose a corpus (potentially annotated in MWEs) :  we can take the French corpus annotated in Grenoble – it has 548 sentences – 700 MWEs
(might be good to separate sentences by MWE categories – if there are multiple MWEs in a single sentence, then we duplicate it)

(christian’s idea: translate the corpus to Portuguese)

b) Translate it with several systems (LIF, LIG, Systran, Google)

c) Define metrics (auto. ; man.) & evaluate – use Sectra_W for PE plateform for instance
(En is the target language so we can use TERpa and METEOR stuff…)
auto. eval: LIG try to come up with something
post-edition: get posteditions and then back to automatic evaluation metric but using PEditions as reference
manual: Agnes thinks we need it ;

Possible Schedule :
-January 2015 – Train/Test data released
-30th March 2015 – Collecting MT systems output
-April-May 2015 – Postediting system outputs (Sectra_W && discuss with Lingxiao)
-AIM-WEST workshop in October 2015 (Porto-Alegre)

Task 3 extend MWE annotation done in French to English (Carlos+Agnès) and Portuguese (Helena’s group) using similar typology (Agnès will provide the Eng typology + exemples of expressions in several languages)

Task 4  POS tagging or morphological analysis of non contiguous MWEs

-extend Tellier & Constant’s work to non contiguous MWEs
-Christian & Paltonio work on handling non contiguouas MWE on morphological analysis of portuguese

Speech Processing aspects are reported to 2016 !! (to prepare this, start thinking of taken into account lattices in MWE toolkit?)

Task 5 Using word alignments to detect MWEs in parallel corpora

The idea is to propose and internship to extend the work Helena, Aline and Carlos started in 2010 (LRE special issue paper). We would like to use automatic word alignment to detect MWEs automatically and integrate it in MWEtoolkit. A detailed internship subject will be posted online soon.