2nd AiM-WEst Detailed Program

Workshop on Analysis and Integration of MultiWord Expressions in Speech and Translation

October 20-21

Room 220, building 43412 – 65

Institute of Informatics

Federal University of Rio Grande do Sul, Porto Alegre

October 20, 2015 – Day 1

Recent Work at LIG on the AIM-WEST Project

Laurent Besacier (LIG, Université Joseph Fourier, France)

Abstract: I will present the work during the past year on the AIM-WEST project. This mostly includes: (a) Organization of an internal MWE Translation shared task for French-English language pair (b) An annotation guide for MWE annotation (c) Proposition of a new MT evaluation metric based on LIG semantic resource (DBnary)

About the Speaker: Laurent Besacier is Professor at UJF since September 2009. He defended his PhD thesis in Computer Science in April 1998 on “A parallel model for automatic speaker recognition” at the University of Avignon (France). Then he spent one and a half year at IMT (Switzerland) as an associate researcher working on M2VTS European project (Multimodal Person Authentication). Since September 1999 he is associate professor (professor since 2009) in Computer Science at the University Joseph Fourier (Grenoble, France). From September 2005 to October 2006, he was invited scientist at IBMWatson Research Center working on Speech to Speech Translation. His research interests can be divided in two main parts: 1. multilingual speech recognition and 2. (more recently)machine translation.  Laurent Besacier has supervised or co-supervised 15 PhD students and 20 Master students. Finally, he has been involved in several national and international projects: among others, NESPOLE European project on speech-to-speech translation, M2VTS European project on multimodal biometrics, as well as evaluation campaigns organized by NIST, DARPA or other organizations: RT, TRECvid, TRANSTAC, WMT, IWSLT.

On the interplay between complex function words and parsing

Carlos Ramisch (LIF, Aix-Marseille Université, France)

Abstract: In this talk I will describe recent work of the LIF-TALEP team on modeling complex function words in treebanks and corpora. I will first describe experiments done for French adv-que and de-det constructions (Nasr et al. 2015). Then I will describe the generalization of this work to other constructions and languages. Finally, I will discuss topics on the orchestration of MWE identification and parsing.

About the Speaker: Carlos Ramisch is an assistant professor at Aix-Marseille Université and a researcher in computational linguistics at Laboratoire d’Informatique Fondamentale de Marseille (France). He holds a double PhD in Computer Science from the University of Grenoble (France) and from the Federal University of Rio Grande do Sul (Brazil). His research focuses on integrating multiword expressions processing into natural language processing applications. Moreover, he is interested in MWE discovery, representation and translation, lexical resources, machine translation, computational semantics and machine learning. Carlos was co-organiser of the 2010, 2011 and 2013 editions of the MWE workshop; area chair for MWEs in *SEM 2012; guest editor of the 2013 special issue on MWEs of the ACM TSLP journal; active member of PARSEME; author of a book on MWE processing. Carlos develops and maintains the mwetoolkit, a tool for automatic MWE processing.

Labeling multi word units with CRFs

Isabelle Tellier (Université Paris 3 – Sorbonne Nouvelle, France)

Abstract: The recognition of multi word units (MWUs) in texts can be seen as a sequence labeling task. Some corpora where MWUs are identified are available, so machine learning approaches can be used. In this talk, I will explain how CRFs (Conditional Random Fields), a now classical machine learning approach, can be applied to this task. I will also focus on how external resources (for example lists of MWU) can be taken into account during the learning process. I will report the results of various experiments that have been conducted for French and, more recently, for the recognition of discontinuous (possibly embedded) MWUs in English.

About the Speaker: Isabelle Tellier has been a professor of computational linguistics in université Paris 3 – Sorbonne Nouvelle since 2011. Her background is computer science and her research interest is in machine learning applied to texts. She belongs to the laboratory « Lattice » in Paris. She has worked in various NLP domains such as theoretical grammatical inference, sequence labeling for POS tagging, chunking, named entities and multi-word units recognition… in particular with CRFs.
http://www.lattice.cnrs.fr/sites/itellier/

Supervised Statistical MWE Identification: Dependency Parsing vs. Sequential Labeling

Matthieu Constant (Université Paris-Est Marne-la-Vallée, France)

Abstract: In this talk, we investigate the use of dependency structures to represent lexical segmentations including shallow multiword expressions, independently of any syntactic structure. We compare our hierarchical segmenter on several corpora with sequence labelers. Experimental results show comparable scores for flat structures, and open new perspectives for hierarchical representations of deeper constructions, such as nested and interleaved multiword expressions. This is joint work with Joseph Le Roux (Univ. Paris Nord).

About the Speaker: Matthieu Constant is an associate professor in Computer Science at Université Paris-Est Marne-la-Vallée, France, since 2006. His research interests fall in the field of Natural Language Processing. In particular, his recent works have been focusing on the interaction of MWE identification and linguistic analysis (POS tagging, chunking and syntactic parsing) in a supervised statistical framework.

Tutorial I – Distributional semantics: context, vectors, embeddings and similarity

Carlos Ramisch (LIF, Aix-Marseille Université, France)

Abstract: “You should know a word by the company it keeps”. This simple principle, known as Harris’ distributional hypothesis, is the basis of powerful distributional semantic models in natural language processing. In this tutorial I will present an introduction to distributional semantics. I will start with small, worked out examples that underline the relation between context and meaning. Then, I will present notions about context representation and vector similarity. These will be tested in practice using a tool called Minimantics. I will conclude the tutorial by a presentation of advanced issues such as neural networks, dimensionality reduction and applications.

About the Speaker: Carlos Ramisch is an assistant professor at Aix-Marseille Université and a researcher in computational linguistics at Laboratoire d’Informatique Fondamentale de Marseille (France). He holds a double PhD in Computer Science from the University of Grenoble (France) and from the Federal University of Rio Grande do Sul (Brazil). His research focuses on integrating multiword expressions processing into natural language processing applications. Moreover, he is interested in MWE discovery, representation and translation, lexical resources, machine translation, computational semantics and machine learning. Carlos was co-organiser of the 2010, 2011 and 2013 editions of the MWE workshop; area chair for MWEs in *SEM 2012; guest editor of the 2013 special issue on MWEs of the ACM TSLP journal; active member of PARSEME; author of a book on MWE processing. Carlos develops and maintains the mwetoolkit, a tool for automatic MWE processing.

October 21, 2015 – Day 2

Sala dos Conselhos (Room 220, building 43412 – 65 – Instituto de Informática)

Distributional Methods for Identification of Idiomaticity

Marco Idiart and Aline Villavicencio (Federal University of Rio Grande do Sul, Brazil)

Abstract: In this talk we will present on-going work on identification of the interpretation of multiword expressions using distributional semantic methods. In particular we discuss the automatic identification of idiomaticity in English compound nouns using context-predicting methods.

About the Speakers: Marco Idiart is a Productivity Fellow and a Professor at the Institute of Physics, where he was Head of the Physics Department (2007-2011). He has a PhD in Physics from UFRGS and was a Post-Doctoral Researcher at MIT (USA, 2014-2015) and Brandeis University (USA, 2011-2012 and 1992-1995) in Computational Neuroscience. His research is on neural networks models for learning and memory. 

Aline Villavicencio is a senior lecturer in Computer Sciences at the Federal University of Rio Grande do Sul (Brazil), and a CNPq fellow. She was a visiting Scholar at MIT (USA) from 2013-2014 and 2011-2012, and at Saarland University in 2012-2013, with PhD and MPhil degrees from the University of Cambridge, UK. Her research has included work on computational language acquisition and grammar engineering for languages such as English and Portuguese. She has coordinated several projects on these topics, which include collaboration with France, US and Latin American universities. She has organized events including the ACL-2007 and the EACL-2009, 2012 and 2014 Workshop on Cognitive Aspects of Computational Language Acquisition, and the ACL 2003, 2004, 2011, NAACL-2013 and Coling 2010 workshops on Multiword Expressions among others.

On the application of Focused Crawling for Statistical Machine Translation Domain Adaptation

Viviane Moreira (Federal University of Rio Grande do Sul, Brazil)

Abstract: I will start by presenting a brief summary of my past and current research and then I will move on to reporting on a recent MSc project that I co-supervised together with Aline Villavicencio and Carlos Ramisch on the application of focused crawling (FC) for acquiring comparable corpora. We proposed and evaluated new FC algorithms based on n-grams and on multiword expressions. We also assessed the viability of using FC for performing domain adaptation for generic Statistical Machine Translation (SMT) systems and whether there is a correlation between the quality of the FC algorithm and of the SMT systems that can be built from the collected data.

About the Speaker: Viviane Moreira is an Associate Professor at the Institute of Informatics-UFRGS in Brazil since 2006. She received her Ph.D. from Middlesex University (UK – 2004) and her M.Sc. from INF-UFRGS (1999). More recently, she spent a sabbatical year at the University of Utah (USA). Her areas of interest are Information Retrieval, Databases and Data Mining. More specifically, her work has focused on multilingual matching, opinion mining, hidden-web crawling, plagiarism detection, and stemming algorithms.

Studies on text complexity: text simplification for adults and automated essay evaluation in Brazilian Portuguese

Aline Evers and Maria José Finatto (Federal University of Rio Grande do Sul, Brazil)

Abstract: PorPopular Project is a systematic collection of texts from Brazilian popular newspapers whose intended audience is composed primarily by people with low literacy levels and low income. The main goal of the project is profiling the lexicon and language used in these texts as a possible way to model text simplicity in Brazilian Portuguese in terms of vocabulary and syntax. Under the scope of this project, we’ll present the results of studies carried out involving text complexity measures, automated text simplification and automated essay evaluation.

About the Speakers: Aline Evers has a Bachelor’s Degree in Linguistics, a Master’s Degree in Lexicology and is working towards a Phd in Lexicology at the Federal University of Rio Grande do Sul, Brazil. She is also an Examiner of the Celpe-Bras Exam (the Brazilian Ministry of Education’s official exam for Portuguese language proficiency) and ENEM Exam (The High School National Exam in Brazil), working with research covering automated essay evaluation and automated text simplification.

Maria José Finatto has a B.Sc. in Languages and Literature (Portuguese/German) – Universidade Federal do Rio Grande do Sul, 1991. º M.Sc. in Languages, Brazilian Lexicography Studies, Universidade Federal do Rio Grande do Sul, 1993. Doctoral Degree in Language Studies, Linguistics, Terminology and Terminography, Universidade Federal do Rio Grande do Sul, 2001. Postdoctoral Formation -in Natural Language Processing, Readability Assesments, at ICMC, University of São Paulo.

Recent Work at UFSCAR on the AIM-WEST Project

Pablo Botton da Costa (UFSCAR-São Carlos, Brazil)

Abstract: I will present the work during the past year on the AIM-WEST project at UFSCAR.

About the Speaker: Pablo Botton da Costa is Master student at UFSCAR-São Carlos, supervised by Helena Caseli in LALIC, since 2014, and a former undergraduate student of UNIPAMPA. Work with multilingual parsing, monolingual parsing, data transfer, machine learning end a little bit of SDN (Sfotware Defined Network).

Multiword text simplification with eXplainText

Rodrigo Wilkens (Federal University of Rio Grande do Sul, Brazil)

Abstract: In this presentation, we show the method that was used to generate a gold standard for MWE compositionality in English. Starting with a list of MWE, we present the steps taken to select MWEs for annotation and the subsequent annotation process via web interface. Finally, we show the results of the annotation and our gold standard list for MWE compositionality.

About the Speaker: Rodrigo Wilkens is a PhD student at the Institute of Informatics, working on language acquisition and text simplification.

MWE Compositionality: Development of a Gold Standard

Leonardo Zilio (Federal University of Rio Grande do Sul, Brazil)

Abstract: In this presentation, we show the method that was used to generate a gold standard for MWE compositionality in English. Starting with a list of MWE, we present the steps taken to select MWEs for annotation and the subsequent annotation process via web interface. Finally, we show the results of the annotation and our gold standard list for MWE compositionality.

About the Speaker: Leonardo Zilio is a post-doctoral researcher at the Institute of Informatics, UFRGS, working on text simplification and multiword expressions. He has a PhD in Linguistics (2015, UFRGS), and has worked with semantic role labeling in technical and general corpora.

Simplification of Numerical Expressions

Susana Bautista Blasco (Federal University of Rio Grande do Sul, Brazil)

Abstract: The way of writing or presenting information can exclude many people, especially those who have problems to read and write or to understand. This work focuses on automatic text simplification and particularly in the treatment of numerical information. We study this kind of information such as a special case of multiword expressions to represent with an only semantic unit the numerical information.

About the Speaker: Susana Bautista Blasco, is a post-doctoral researcher at the Institute of Informatics, UFRGS, working on text simplification and multiword expressions. She received her PhD in Computer Science in text simplification area from the Universidad Complutense de Madrid (Spain) on 2015, supervised by Pablo Gervás and Raquel Hervás.

Multiword Extraction from Europarl

Luís Möllmann dos Santos (Federal University of Rio Grande do Sul, Brazil)

Abstract: I will present an on-going research aiming at establishing a sizeable multilingual repository of MWEs extracted from aligned corpora. The languages involved are Portuguese, English and French, and the MWE types include nominal and verbal compounds. In a first phase, MWEs will be extracted from the europarl corpus, while we collect other bilingual or multilingual corpora, such as Folha online. MWE extraction will be made using two different methods, one essentially based on a symbolic parser and is used to extract collocations, while the second, based on statistical tools, is used to extract other types of MWEs.

About the Speaker: Luís Möllmann dos Santos is an undergratuate student at the Institute of Informatics, Federal University of Rio Grande do Sul.

Distributional Thesaurus for Portuguese: Evaluation and Methodologies

Eduardo Ferreira (Federal University of Rio Grande do Sul, Brazil)

Abstract:  In recent decades there has been an increase in interest on methods for the automatic construction of distributional thesauri from corpora. Efforts to systematically evaluate and improve the resulting thesauri have been made for languages like English and French, but for Portuguese there is an urgent need for such initiatives. This paper presents a comparative investigation of the two main approaches for thesaurus generation: count-based and predictive methods, focusing on Portuguese. For the evaluation we propose a TOEFL-like test for Portuguese which was automatically generated from BabelNet, using nouns and verbs.

About the Speaker: Eduardo Ferreira is an undergratuate student at the Institute of Informatics, Federal University of Rio Grande do Sul.

Tutorial II – Automatic Speech Recognition

Laurent Besacier (LIG, Université Joseph Fourier – Grenoble, France)

Abstract: I will present a tutorial on Automatic Speech Recognition including: features extraction, acoustic modelling, lexical modelling and language modelling. Some current hot topics in the domain will be also reviewed.

About the Speaker: Laurent Besacier is Professor at UJF since September 2009. He defended his PhD thesis in Computer Science in April 1998 on “A parallel model for automatic speaker recognition” at the University of Avignon (France). Then he spent one and a half year at IMT (Switzerland) as an associate researcher working on M2VTS European project (Multimodal Person Authentication). Since September 1999 he is associate professor (professor since 2009) in Computer Science at the University Joseph Fourier (Grenoble, France). From September 2005 to October 2006, he was invited scientist at IBMWatson Research Center working on Speech to Speech Translation. His research interests can be divided in two main parts: 1. multilingual speech recognition and 2. (more recently)machine translation.  Laurent Besacier has supervised or co-supervised 15 PhD students and 20 Master students. Finally, he has been involved in several national and international projects: among others, NESPOLE European project on speech-to-speech translation, M2VTS European project on multimodal biometrics, as well as evaluation campaigns organized by NIST, DARPA or other organizations: RT, TRECvid, TRANSTAC, WMT, IWSLT.